Yay, multi-dimensional best fit!
June 2, 2016 7:48 AM   Subscribe

Uncovering Big Bias with Big Data, by David Colarusso - "What follows is the story of how I used those cases to discover what best predicts defendant outcomes: race or income. This post is not a summary of my findings, though you will find them in this article. It is a look behind the curtain of data science, a how to cast as case study. Yes, there will be a few equations. But you can safely skim over them without missing much. Just pay particular attention to the graphs." posted by the man of twists and turns (19 comments total) 24 users marked this as a favorite
 
I was all prepared to just give a knee-jerk "well, duh" but thought I'd at least rtfa first. I got as far as 'Virginia' before popping back to say WELL, DUUUHH.
(wait I think I pronounced that wrong, hold on)
WELLL, DUUUUHHHHHHH!
posted by sexyrobot at 8:22 AM on June 2, 2016 [3 favorites]


This all tells us for a black man in Virginia to get the same treatment as his Caucasian peer, he must earn an additional $90,000 a year.

Interesting, depressing, and totally not surprising conclusion.

I do have to say, though, that leading with a pic titled "Big Data and the Law" that consists of a sloppy collage of an oversized Data-from-Star-Trek standing next to the Supreme Court building is maybe a little on-the-nose.
posted by dersins at 8:27 AM on June 2, 2016 [3 favorites]


Nice piece! To satisfy people who want to know what the findings were:
This all tells us for a black man in Virginia to get the same treatment as his Caucasian peer, he must earn an additional $90,000 a year.

Similar amounts hold for American Indians and Hispanics, with the offset for Asians coming in at a little less than half as much.

The answer to our question seems to be that race-based bias is pretty big. It is also worth noting that being male isn’t helpful either.
posted by languagehat at 8:27 AM on June 2, 2016 [4 favorites]


Of possibly marginal relevance: comment from The Onion on race and justice in America.
posted by fredludd at 8:42 AM on June 2, 2016 [1 favorite]


There are two gaping holes in this analysis: (1) Method selection for non-linear data, and (2) interactions.

In the first case, although log-transformation is a reasonable first-pass strategy, it's not really appropriate when your data include zeros. A better approach would be to use something like a Poisson or negative binomial model, fitting the parameters using a GLM framework, since such a model can make full use of any zeros that appear in the data.

In the second case, it's extremely likely that race, gender, and income interact with one another. Mathematically, this might take the form of an "income x race" variable, and including such a variable puts a "twist" in the function (making it look like a curved sheet rather than a flat plane). Insofar as these data display "multicollinearity," it's much more difficult than this blog post implies to parse which factors is larger, despite the size of the sample. Should an interaction exist, it can make an *enormous* difference. For example, it might be the case that racial differences in sentencing are *even more pronounced* among poor defendants than among the rich, but it could go the other way, such that it's among middle-class defendants that the racial divide is pronounced. In either of these scenarios, the simplistic "black people need to make $90,000 more a year" pronouncement would turn out to simply be false for most people: How much more a black person would need to make instead depends on their income.

Additionally, it's important to keep in mind that income is measured very indirectly here, being inferred by zip code. In other words: It is assumed that if a black person lives in a zip code, their income is identical to that of a white person. This is probably a bad assumption, since the racial wage gap is very real, and any given zip code is likely to have a range of housing options at different price points. This may mean that the gap in income is *larger* than is estimated from these data, which will in turn shift more of the explanatory weight onto race. Put another way: Income is almost certainly an "attenuated predictor" (i.e. noise has been added that weakens its explanatory signal), whereas it's pretty unlikely that the court had any doubt regarding the defendant's race (if anything, race is probably being exacerbated rather than attenuated). This makes it much harder to get good estimates of both effects, since it will probably weaken one and exaggerate the other.
posted by belarius at 8:48 AM on June 2, 2016 [16 favorites]


This is a pretty poor model and I wish better conducted research on this topic went viral instead of this.

There are a number of serious problems:

1) model fit. The R-squared, which is a very poor indicator of model fit to begin with, is remarkably low. The model simply doesn't fit the data well.

2) controls. The controls are limited by whats in the court records. The chance of omitted variable bias is very real. Of course, the data is limited, but including other variables, like whether those charged had priors or not, would effect the outcomes.

3) collinearity. I imagine income and race are highly correlated in Virginia, this won't bias estimates necessarily but does make inferences problematic.

Obviously the causal relationship between race and judgment outcomes is not identified. The question we are really interested in (and the one the blog post is purporting to answer) is: if there are two people who are similar in many ways but differ only in their race, how much does this change the judgment outcome?

(Also, personally, I like tree based methods for stuff like this better).
posted by MisantropicPainforest at 8:49 AM on June 2, 2016 [5 favorites]


I thought the most telling part was the update.

Shortly after publication, a commenter took me up on the offer to dig into the data and noticed that I had neglected to clean some extraneous entries from the dataset (i.e., those entries with unidentified race and sex). It amounted to two lines of missing code the consequence of which was to inflate the coefficients associated with race and sex in the model. The coefficient for sex only changed slightly. However, those associated with race came down a good deal. After correcting for this error, the original observation that a black man had to earn an additional $500,000 to be on equal footing with his white peers was amended to reflect the fact that the model now puts the dollar amount closer to $90,000.

It's an object lesson in it being unwise to take statistical analyses at face value, the benefits to opening up data, and the suspicion you should have for any analysis that hasn't been independently vetted. After all, you get a result with all the impressive looking statistical bells and whistles to matter how poorly you've formatted your data and your theory. Too few people double check an answer they like, kudos to this guy for being up front about a reasonable correction.

on preview: Yes, it's also clear this guy isn't a statistician.
posted by Across the pale parabola of joy at 8:50 AM on June 2, 2016 [8 favorites]


So are you guys saying that he fits right into the Danger Zone! on the Venn Diagram he quotes?
posted by clawsoon at 8:54 AM on June 2, 2016 [1 favorite]


The log transformations were suspect to me, and it feels like a lot of this hinges on that pretty big asterisk of assuming income based on the zipcode mean.

I was mostly surprised that the author was so good about couching their analysis along the way to avoid bad assumptions, then prominently concludes that " for a black man in Virginia to get the same treatment as his Caucasian peer, he must earn an additional $90,000 a year". Like, your research question and strategies do not begin to approach that type of concrete statement.
posted by Think_Long at 9:01 AM on June 2, 2016


A commenter also makes a pretty good point:

On that note, you may also find that your model would make more sense if you treated seriousness (or the crime itself) as a categorical variable, not as an ordinal variable with a linear/loglinear/exponentional relationship as you essentially disproved in your own analysis.

It does seem odd to use the degree of the charge as quantitative
posted by Think_Long at 9:12 AM on June 2, 2016


3) collinearity. I imagine income and race are highly correlated in Virginia, this won't bias estimates necessarily but does make inferences problematic.


People deny racism by arguing that it's really just classism all the time -- but wasn't the whole point of this exercise to control for income and race correlations?

(To say nothing of the fact that the reason blacks are disproportionately more likely to be lower income than other races is of course... racism).
posted by pocketfullofrye at 9:32 AM on June 2, 2016


People deny racism by arguing that it's really just classism all the time

Well, white people do, anyway.
posted by dersins at 9:49 AM on June 2, 2016 [1 favorite]


#NOTALLWHITEPEOPLE
posted by dersins at 9:50 AM on June 2, 2016


I did this sort of quackery as an undergrad researcher. It always felt like confirming my own biases. you mean I could have gotten paid big money for making shit up... I mean data science? Missed my calling.
posted by Yowser at 10:18 AM on June 2, 2016 [1 favorite]



People deny racism by arguing that it's really just classism all the time -- but wasn't the whole point of this exercise to control for income and race correlations?


You can't adequately 'control' for a variable that is highly collinear with another independent variable and then call it a day. It doesn't work like that.
posted by MisantropicPainforest at 11:14 AM on June 2, 2016 [4 favorites]


I saw this a few days ago and the model did not seem plausible to me, I mean R-squared means something and it is telling you that your model sucks and you need to try a different model. That said, several times in the post he implores people to pick up the data and do more and better data analysis. Hopefully some people who are more versed in statistics can push this further.
posted by selenized at 12:48 PM on June 2, 2016


I mean R-squared means something and it is telling you that your model sucks and you need to try a different model.

Or just that you're looking at a noisy dependent variable. In some circumstances an r2 of 0.9 is bad, in other circumstances an r2 of 0.25 is pretty good atc.
posted by ROU_Xenophobe at 2:44 PM on June 2, 2016


I kinda thought that with the number of starting variables that a principal components analysis may have been helpful in at least keying the author in on appropriate variable reduction/selection. I used to do some nasty stepwise torture to my variables in the name of modeling...
posted by Nanukthedog at 8:27 PM on June 2, 2016


(Also, personally, I like tree based methods for stuff like this better).
posted by MisantropicPainforest


Of course you wood!
posted by I-Write-Essays at 10:48 AM on June 4, 2016


« Older Hot Air Millionaire   |   Chronic pot use is as bad for your health as not... Newer »


This thread has been archived and is closed to new comments