Crime-Predicting Algorithms May Not Fare Much Better Than Untrained Humans
The American criminal justice system couldn’t get much less fair. Across the country, some 1.5 million people are locked up in state and federal prisons. More than 600,000 people, the vast majority of whom have yet to be convicted of a crime, sit behind bars in local jails. Black people make up 40 percent of those incarcerated, despite accounting for just 13 percent of the US population.
With the size and cost of jails and prisons rising—not to mention the inherent injustice of the system—cities and states across the country have been lured by tech tools that promise to predict whether someone might commit a crime. These so-called risk assessment algorithms, currently used in states from California to New Jersey, crunch data about a defendant’s history—things like age, gender, and prior convictions—to help courts decide who gets bail, who goes to jail, and who goes free.
But as local governments adopt these tools, and lean on them to inform life-altering decisions, a fundamental question remains: What if these algorithms aren’t actually any better at predicting crime than humans are? What if recidivism isn’t actually that predictable at all?
That’s the question that Dartmouth College researchers Julia Dressel and Hany Farid set out to answer in a new paper published today in the journal Science Advances. They found that one popular risk-assessment algorithm, called Compas, predicts recidivism about as well as a random online poll of people who have no criminal justice training at all.
"There was essentially no difference between people responding to an online survey for a buck and this commercial software being used in the courts," says Farid, who teaches computer science at Dartmouth. "If this software is only as accurate as untrained people responding to an online survey, I think the courts should consider that when trying to decide how much weight to put on them in making decisions."
Man Vs Machine
While she was still a student at Dartmouth majoring in computer science and gender studies, Dressel came across a ProPublica investigation that showed just how biased these algorithms can be. That report analyzed Compas's predictions for some 7,000 defendants in Broward County, Florida, and found that the algorithm was more likely to incorrectly categorize black defendants as having a high risk of reoffending. It was also more likely to incorrectly categorize white defendants as low risk.
That was alarming enough. But Dressel also couldn't seem to find any research that studied whether these algorithms actually improved on human assessments.
"Underlying the whole conversation about algorithms was this assumption that algorithmic prediction was inherently superior to human prediction," she says. But little proof backed up that assumption; this nascent industry is notoriously secretive about developing these models. So Dressel and her professor, Farid, designed an experiment to test Compas on their own.
Using Amazon Mechanical Turk, an online marketplace where people get paid small amounts to complete simple tasks, the researchers asked about 400 participants to decide whether a given defendant was likely to reoffend based on just seven pieces of data, not including that person's race. The sample included 1,000 real defendants from Broward County, because ProPublica had already made its data on those people, as well as information on whether they did in fact reoffend, public.
They divided the participants into groups, so that each turk assessed 50 defendants, and gave the following brief description:
The defendant is a [SEX] aged [AGE]. They have been charged with: [CRIME CHARGE]. This crime is classified as a [CRIMI- NAL DEGREE]. They have been convicted of [NON-JUVENILE PRIOR COUNT] prior crimes. They have [JUVENILE- FELONY COUNT] juvenile felony charges and [JUVENILE-MISDEMEANOR COUNT] juvenile misdemeanor charges on their record.
That's just seven data points, compared to the 137 that Compas amasses through its defendant questionnaire. In a statement, Equivant says it only uses six of those data points to make its predictions. Still, these untrained online workers were roughly as accurate in their predictions as Compas.
Overall, the turks predicted recidivism with 67 percent accuracy, compared to Compas' 65 percent. Even without access to a defendant's race, they also incorrectly predicted that black defendants would reoffend more often than they incorrectly predicted white defendants would reoffend, known as a false positive rate. That indicates that even when racial data isn't available, certain data points—like number of convictions—can become proxies for race, a central issue with eradicating bias in these algorithms. The Dartmouth researchers' false positive rate for black defendants was 37 percent, compared to 27 percent for white defendants. That roughly mirrored Compas' false positive rate of 40 percent for black defendants and 25 percent for white defendants. The researchers repeated the study with another 400 participants, this time providing them with racial data, and the results were largely the same.
"Julia and I are sitting there thinking: How can this be?" Farid says. "How can it be that this software that is commercially available and being used broadly across the country has the same accuracy as mechanical turk users?"
To validate their findings, Farid and Dressel built their own algorithm, trained it with the data on Broward County, including information on whether people did in fact reoffend. Then, they began testing how many data points the algorithm actually needed to retain the same level of accuracy. If they took away the defendant's sex or the type of crime the person was charged with, for instance, would it remain just as accurate?
What they found was the algorithm only really required two data points to achieve 65 percent accuracy: the person's age, and the number of prior convictions. "Basically, if you're young and have a lot of convictions, you're high risk, and if you're old and have few priors, you're low risk," Farid says. Of course, this combination of clues also includes racial bias, because of the racial imbalance in convictions in the US.
That suggests that while these seductive and secretive tools claim to surgically pinpoint risk, they may actually be blunt instruments, no better at predicting crime than a bunch of strangers on the internet.
Equivant takes issue with the Dartmouth researchers' findings. In a statement, the company accused the algorithm the researchers built of something called "overfitting," meaning that while training the algorithm, they made it too familiar with the data, which could artificially increase the accuracy. But Dressel notes that she and Farid specifically avoided that trap by training the algorithm on just 80 percent of the data, then running the tests on the other 20 percent. None of the samples they tested, in other words, had ever been processed by the algorithm.
Despite its issues with the paper, Equivant also claims that it legitimizes its work. "Instead of being a criticism of the COMPAS assessment, [it] actually adds to a growing number of independent studies that have confirmed that COMPAS achieves good predictability and matches," the statement reads. Of course, "good predictability" is relative, Dressel says, especially in the context of bail and sentencing. "I think we should expect these tools to perform even better than just satisfactorily," she says.
The Dartmouth paper is far from the first to raise questions about this specific tool. According to Richard Berk, chair of the University of Pennsylvania's department of criminology who developed Philadelphia's probation and parole risk assessment tool, there are superior approaches on the market. Most, however, are being developed by academics, not private institutions that keep their technology under lock and key. "Any tool whose machinery I can't examine, I’m skeptical about," Berk says.
While Compas has been on the market since 2000 and has been used widely in states from Florida to Wisconsin, it's just one of dozens of risk assessments out there. The Dartmouth research doesn't necessarily apply to all of them, but it does invite further investigation into their relative accuracy.
Still, Berk acknowledges that no tool will ever be perfect or completely fair. It's unfair to keep someone behind bars who presents no danger to society. But it's also unfair to let someone out onto the streets who does. Which is worse? Which should the system prioritize? Those are policy questions, not technical ones, but they're nonetheless critical for the computer scientists developing and analyzing these tools to consider.
"The question is: What are the different kinds of unfairness? How does the model perform for each of them?" he says. "There are tradeoffs between them, and you cannot evaluate the fairness of an instrument unless you consider all of them."
Neither Farid nor Dressel believes that these algorithms are inherently bad or misleading. Their goal is simply to raise awareness about the accuracy—or lack thereof—of tools that promise superhuman insight into crime prediction, and to demand increased transparency into how they make those decisions.
“Imagine you’re a judge, and you have a commercial piece of software that says we have big data, and it says this person is high risk,” Farid says, “Now imagine I tell you I asked 10 people online the same question, and this is what they said. You’d weigh those things differently.” As it turns out, maybe you shouldn't.