Fairness and Bias Reduction in Machine Learning
July 23, 2018 8:01 AM   Subscribe

As artificial intelligence begins to drive many important decisions (e.g. loans, college admissions, bail), the problem of biased AI has become increasingly prominent (previously, previously, previously). Recently researchers, including at Google and Microsoft, have started taking the problem of fairness seriously.

For a deeper dive: Eirini Malliaraki's Toward ethical, transparent and fair AI/ML: a critical reading list is a comprehensive set of resources.

Fairness, Accountability, and Transparency in Machine Learning (FAT/ML) has hosted an annual conference on the subject since 2014, with open access papers posted online.
posted by jedicus (25 comments total) 47 users marked this as a favorite
 
Just finished a year-long cert in Data Science/Analytics. In "AI"(which it isn't - these machines are not "intelligent" in the self-aware sense) / Machine Learning, you combine your own analytical bias with statistics to build "black box" models that are improved predictors (better than the statistical 50/50 of blind guessing) for modeling real-world dynamics.

One of the issues here is that "black box" models are ones that are built without specific vision into their inner workings, and are often huge jumbles of machine generated decision/code that humans would barely understand if we opened the cover and looked into the model directly. Usually these models are just used with test and real world data, and their output/predictions compared iteratively to what really ended up happening. So we data analytics people would get a load of data, clean it up (an act of personal bias and skills right there - designing toward the desired solution injects personal bias into the initial load of data), set aside some testing data, select and train up a black box statistical model (another place in the process where our personal bias can get folded into the model) with the rest of the data, and test it with the testing data, and see how the model did. Then we might rinse and repeat, until we got an acceptable score (of reliability against testing data, and real world data) out of the model.

But at no time do we look into the actual model, because we wouldn't understand the infinitely branching machine statistical code anyway. So it's SUPER problematic, because even if the model is, e.g., 20% better at predicting something - by which I mean the model rates a 60% accuracy rather than the 50% of wild guessing, we often don't know WHY, and we don't know if our assumptions made while cleaning up the data or while training up the model include subconscious, unintentional or intentional bias from the arbitrary, racist, sexist, *ist society and our enculturation.

And this is not even considering that many professionals in the field p-hack and do other shitty, unethical stuff to make their work look better and unfairly promote themselves.
posted by kalessin at 8:18 AM on July 23, 2018 [17 favorites]


One of the issues here is that "black box" models are ones that are built without specific vision into their inner workings

Interpretability, explainability, and transparency of ML models is a big area of research. There was a conference on the subject last year.
Zhang and Zhu's Visual Interpretability for Deep Learning: a Survey is a good example of how to interpret deep learning computer vision models. Christoph Molnar has a free online book on interpretable ML.

we don't know if our assumptions made while cleaning up the data or while training up the model include subconscious, unintentional or intentional bias from the arbitrary, racist, sexist, *ist society and our enculturation.

Addressing this problem is the goal of some of Microsoft's research in this area (linked in the FPP):
Our method operates as a game between the classification algorithm and a “fairness enforcer.” The classification algorithm tries to come up with the most accurate classification rule on possibly-reweighted data, while the fairness enforcer checks the chosen fairness definition. ... Eventually, this process yields a classification rule that is fair, according to the fairness definition. Moreover, it is also the most accurate among all fair rules, provided the classification algorithm consistently tries to minimize the error on the reweighted data.

...

Our method works with many different definitions of fairness and only needs access to protected attributes, such as gender or race, during training, not when the classifier is deployed in an application. Because our method works as a “wrapper” around any existing classifier, it is easy to incorporate into existing machine learning systems.
posted by jedicus at 8:36 AM on July 23, 2018 [3 favorites]


People have cooked up ways of trying to measure fairness in black-box models. Unfortunately, as of right now everything seems to have significant problems. An interesting recent talk. Its summary slide is:
Summary
  1. Most proposed mathematical measures of fairness are poor proxies for detecting discrimination.
  2. Attempts to equalize these measures can itself lead to discriminatory or otherwise perverse decisions.
  3. The idea that there are trade-offs between different measures is largely illusory.
One interesting set of approaches actually relies on causal inference, see previous MetaFilter discussion. This approach is not without problems but has some advantages. In particular, a lot of the legal language around discrimination is in terms of counterfactuals and cause-effect reasoning.
posted by vogon_poet at 8:36 AM on July 23, 2018 [5 favorites]


My naive approach would be to hire and fire AI like we hire and fire people: make sure it's been to the right schools and got good grades, that it has good reviews from past professors and bosses, and that it performs well on hiring tests: give lots and lots of sample problems to solve and see if it meets our expectations.

And after you start using it, treat it as if you think it's fully aware and trying to cover up for itself, because it is created and owned and run by fully aware humans who may be fully aware that it/they are doing wrong and possibly being investigated. Continually audit it without it knowing it's being audited. Give it lots of test cases that it doesn't know are test cases. Encourage others to quietly submit test cases and analyze the results. The NAACP, for just one example, needs to keep a close eye on this shit. A loan officer who can process many housing loan applications per second merits some close watching.

Also, we are going to need AI to analyze AI, and more AI to analyze that AI, and so on. It's turtle graphics all the way down.
posted by pracowity at 8:52 AM on July 23, 2018


People have cooked up ways of trying to measure fairness in black-box models. Unfortunately, as of right now everything seems to have significant problems. An interesting recent talk.

That's a great addition! Here's a Washington Post article written by the authors of that talk, focusing on the COMPAS system for predicting whether someone is likely to reoffend while awaiting trial. The practical upshot is that a lot depends on what measure of fairness you use, and it may be mathematically impossible to satisfy multiple fairness measures simultaneously.
posted by jedicus at 8:53 AM on July 23, 2018 [2 favorites]


My naive approach would be to hire and fire AI like we hire and fire people: make sure it's been to the right schools and got good grades, that it has good reviews from past professors and bosses, and that it performs well on hiring tests: give lots and lots of sample problems to solve and see if it meets our expectations.

The problem here is an underlying assumption that consistency is the natural state. All of the monitoring you suggest makes the same assumption -- that if you verify that it is doing the right thing now, it will still be producing similar results five minutes from now.

In comparing the machines to humans there is a natural tendency to think in terms of a sliding scale -- after all, if a human is rated 70% today it is expected that they would be rated 60%, 70%, or 80% tomorrow. Things slide, but rarely that far.

Computers on the other hand can change from 70% to -1000% in a second, and in my experience produced machines often do so. Until we arrive at a solution for that you can audition and monitor all day and never get the reliability you're looking for.
posted by Tell Me No Lies at 9:08 AM on July 23, 2018 [10 favorites]


My naive approach would be to hire and fire AI like we hire and fire people: make sure it's been to the right schools and got good grades, that it has good reviews from past professors and bosses, and that it performs well on hiring tests

Wouldn't doing that guarantee that it's racist and sexist, though? Or at least classist? That process doesn't seem to work very well for producing humans without those characteristics. (Maybe that's the joke?)
posted by XMLicious at 9:15 AM on July 23, 2018 [4 favorites]


Honestly, I think people are worried about the "black box" part too much. What matters is being able to tell what bias the data you are putting in has, and have a very good range of tests to decide whether your output is giving you what you want.

In face, having the algorithm be "human readable" just gives a false sense of security. The COMPAS case jedicus linked is the perfect example. The problem there is not in the black bock algorithm, it is in the very fact that you are making decisions on a probability that someone is likely to reoffend.

Ranking defendants by their likelihood to reoffend is guaranteed to have a higher false positive rate - a greater percentage of people unjustly restricted - for any population that is more likely on average to reoffend. Their black box could be an omniscient entity telling them the true, exact probability every given individual has of reoffending, and you would still see that bias. Because statistics is devious and something seemingly straightforward like making a bail decision on how likely an individual is to reoffend is mathematically guaranteed to come down harder on people in disadvantaged populations. If you're part of any group that has a higher than average recidivism rate, be it your race, being born in the wrong part of town, or just having terrible family support, that method will mean that you are more likely to be denied bail even if you're going to end up being a model citizen after.

And that's not immediately obvious! It took me a little bit of thinking to work out and I am good at math. But the point remains, human readability of the method is no substitute for really good analysis of the decisions you are making, because statistics is hard, and many of the pitfalls are hidden in plain sight.
posted by Zalzidrax at 9:43 AM on July 23, 2018 [10 favorites]


Maybe Im missing something obvious, In fact it’s almost certain that I am, but my concern with the black box elements as I understand them is that we don’t have any good way of saying why a correct answer is correct or an incorrect answer is incorrect, and that renders all answers a little suspect, Because it’s never good enough to have the right answer if you didn’t arrive via the right method.

Am I totally off base?
posted by GenjiandProust at 10:13 AM on July 23, 2018


One statistical solution to the unpredictability would be to go Minority Report on it. Train up many machines on different data and have a final vote on their predictions.

Of course something similar has been done with weather predictions for years and it hasn't been exactly reliable...
posted by Tell Me No Lies at 10:15 AM on July 23, 2018


Hmmm.... I think it's a confirmation bias issue in a way. ML has allowed a lot of statisticians to use computing to tweak prediction models and improve them, generally, beyond the 50/50 break for a statistically wild guess. But because we're/they're throwing raw computing power at the predictive model problem, while we do somehow get a better-than-guessing predictor model, we don't always have great visibility into why.

But folks are betting on ML and increasing profit margins, etc., within contexts that are filled with randomness (e.g. running a business, investing in the stock market, bitcoin, and sports betting). And with folks out in the world who purely grow their own and others' wealth with speculative investing, people are more than willing to throw huge piles of dollars at a machine, even if they don't understand the inner workings of it, that can demonstrate skewing a 50/50 guess into even a 55/45 machine predictive algorithm. And often, Data Science types will come up with testing algorithms (which test these models with real world data where the results from that data are known) that claim 70/30 or even 85/15. With the volumes of speculation, skewing that 50/50 significantly is well worth it to the investors. Even for the possibility of the magic working again.
posted by kalessin at 10:21 AM on July 23, 2018


Model averaging and training using only some fraction of all data are valid approaches to decrease wildly incorrect predictions. The black box is not that much of a problem if we are not looking for causal explanations. But the feedback loop created by trying to predict crime seems like a very bad idea to me. People tend to assume the roles they are offered in life. If you treat people like potential criminals, they are more likely to commit crime.
posted by ikalliom at 10:27 AM on July 23, 2018 [5 favorites]


Making moral machine learning decisions is like making moral machine gunning decisions. The underlying problem doesn't have a solution that's reachable by the tools in hand.

More arts & philosophy & history education, more and deeper statistical education, less IT and programming vocational training.
posted by Fraxas at 10:51 AM on July 23, 2018 [4 favorites]


It's beyond the black box issues -- not to say that those are not issues -- all the way to an interesting paradox, as Zalzidrax puts it. Let's say that we're classifying tacos and pitas to see whether they're truly delicious. We develop a model, but we don't tell it anything about whether it's grading tacos or pitas, we're just telling it measures like the amount of meat, the freshness of the vegetables, the spiciness, moldiness and so on. And that we come up with a pretty good model; 80% of the things it says are "yummy" are actually delicious, while only 20% of the things it says are "yucky" are delicious. This sure seems like a fair model, and under some definitions it is.

But let's suppose that we give it 400 pitas and 400 tacos to classify, and it classifies 200 of the tacos but only 100 of the pitas as "yummy". We know the accuracy from above; 80% of the judgments are correct. If you crunch the math, it winds up that the results from the tacos are as you might expect; 20% of not-delicious tacos are classified as yummy incorrectly and 20% of actually-delicious tacos are classified as yucky and thrown away unjustly.

However, the pitas are a different story. Because they were disproportionately classified as being yucky, the results are also disproportionate; while only 8% of not-delicious pitas are classified as being yummy, but a whopping 43%* of actually-delicious pitas are incorrectly classified as yucky and tossed out. That sure doesn't seem fair, yet it's the mathematical result of the above apparently fair and accurate model.

All these actually delicious pitas more than twice as likely to be thrown away haven't done anything wrong; they just share something that is correlated with other pitas more than with other tacos -- perhaps having pork increases the yumminess in the model, but there are just way more tacos with pork than pitas.

And that's not even getting into things like criminal recidivism rate that we are rating our models against is itself a biased measure; it doesn't measure how many people go on to commit another crime, it measures how many people go on to be arrested/convicted of committing another crime, which is a very different and not racially neutral thing itself.

* Of the 100 pitas classified as yummy, 80% or 80 are actually delicious. Of the 300 pitas classified as yucky, 20% or 60 are actually delicious. So there are 140 delicious pitas, 60 or 43% of them are classified as yucky.
posted by Homeboy Trouble at 10:59 AM on July 23, 2018 [15 favorites]


I think, especially on a site like this one, there is a tendency to imagine that the issue of fairness in AI is exclusively a people problem. Just a bunch of tech bros being stupid and thoughtless, shoving data through scikit-learn without trying to understand it. Indeed, that does happen!

But even well-intentioned people who have a deep understanding of the data, and excellent training in statistics, have trouble reasoning about the fairness properties of their models. The whole thing is fraught with paradoxes, counterintuitive conclusions, and unintended consequences.
posted by vogon_poet at 11:05 AM on July 23, 2018 [7 favorites]


GenjiandProust: My concern with the black box elements as I understand them is that we don’t have any good way of saying why a correct answer is correct or an incorrect answer is incorrect, and that renders all answers a little suspect, Because it’s never good enough to have the right answer if you didn’t arrive via the right method.

I think your concern is correct, to a point. Certainly the output of your ML black box is only as good as the training set it was trained on - as in the famous but possibly apocryphal story of the neural network that was taught to find Russian tanks in images but learnt to identify sunny vs cloudy days instead.

Until you know what biases were frozen in by your training set, you can't be sure what you're missing (or, on refresh while typing, how many delicious pitas you're throwing away).

There are ways around it - multiple competing models, fixed percentage human verification - but they are all expensive. And if there's one thing we know about corporate America, it is that they *love* to spend money and add labor instead of half-assing it.
posted by RedOrGreen at 11:07 AM on July 23, 2018 [3 favorites]


What if instead of trying to tweak the machines we accept that fact that the resulting data is going to have noticeable bias and address it in post processing? Basically let the Machine Learning algorithms do the heavy lifting while leaving a much smaller task that is much more suited for humans at the end of it.
posted by Tell Me No Lies at 11:30 AM on July 23, 2018


Just a bunch of tech bros being stupid and thoughtless, shoving data through scikit-learn without trying to understand it.

Except that really is the whole problem, because they're trying to skip past the hard part of figuring out what they actually have and go straight to collecting shitloads of money.

In the immortal and timeless words of Dr. Ian Malcolm: "Before you even knew what you had, you patented it and packaged it and slapped it on a plastic lunch box, and now you're selling it, you want to sell it."
posted by tobascodagama at 11:33 AM on July 23, 2018


We develop a model, but we don't tell it anything about whether it's grading tacos or pitas, we're just telling it measures like the amount of meat, the freshness of the vegetables, the spiciness, moldiness and so on.

And this is assuming you can actually measure all the components that result in deliciousness, because one will only get measurements of things that are measurable; some of this is the result of there being no sensors built or deployed to measure such things; some of this will be the result of there no possibility of currently even building the sensors. Usually in these cases, researchers tend to just ignore that which cannot be measured. With matters of taste, you'll end up with tacos and pitas that are kinda yummy being labelled delicious and actually mind-blowing delicious stuff being completely left out, sort of like how Max Martin can't seem to understand why "incorrect" songwriting results in awesome songs.
posted by eustacescrubb at 12:59 PM on July 23, 2018 [5 favorites]


Favorite book on the subject Weapons of Math Destruction.
posted by Peach at 1:01 PM on July 23, 2018 [2 favorites]


Max Martin can't seem to understand why "incorrect" songwriting results in awesome songs.

I recently attended a talk on this very subject at a computer science conference. The core of the talk was the theory (proposed by David Huron) that music appreciation is inherently statistical learning: we learn the rules of the music we're accultured to by repeated listening, and know what to expect in terms of rhythm, tonality and such. And secondly, a piece of music, to be satisfying has to violate expectations just so much: if it is completely predictable, it is dull, but if it breaks too many of the statistical rules, the listener doesn't get it; The speaker, Chris Ford, stated it as “prediction delights your thinking-fast brain, while novelty keeps your thinking-slow brain interested”.

(The conference in question, Curry On, has a strong leaning towards functional programming, so the gist of the talk was that, while some people have applied dependent types in languages like Haskell and Idris to making music that violates rules unrepresentable, that is inherently a bad idea if the music is to be listened to.)
posted by acb at 4:10 PM on July 23, 2018 [4 favorites]


If our model is reifying our world to help us make more efficient decisions then it will track to the world we have created.

/Dr. Phil voice
posted by nikaspark at 4:29 PM on July 23, 2018 [1 favorite]


However, the pitas are a different story. Because they were disproportionately classified as being yucky, the results are also disproportionate; while only 8% of not-delicious pitas are classified as being yummy, but a whopping 43%* of actually-delicious pitas are incorrectly classified as yucky and tossed out.

My brain isn't working but this sounds really interesting... could you assist with the math? I can't quite get to the 43%. We assume every judgement the machine made is 80% accurate -

200 tacos classified as yummy - 160 are yummy, 40 are yucky
200 tacos classified as yucky - 160 are yucky, 40 are yummy

100 pitas classified as yummy - 80 are yummy, 20 are yucky
300 pitas classified as yucky - 240 are yucky, 60 are yummy

Total stats in reality, 200 yummy tacos, 200 yucky tacos, 140 yummy pitas, 260 yucky pitas.

The machine said 50% of tacos are yummy. This is correct, 50% of tacos are yummy.

The machine said only 25% of pitas are yummy. This is incorrect, 35% of pitas are yummy.
posted by xdvesper at 4:29 PM on July 23, 2018


A total of 200 tacos are yummy.
40 of the yummy tacos are classified yucky.
20 percent of actually yummy tacos are discarded as yucky.

140 pitas are yummy.
60 yummy pitas are classified yucky.
43 percent of actually yummy pitas are discarded as yucky.

So even if the machine is only wrong 20% of the time, yummy pitas are more than twice as likely as yummy tacos to be incorrectly thrown out. (And that's not getting into what in the data is causing these results, or whether there really are so few yummy pitas, or whether throwing out lunches in the first place is the ideal solution even for unpalatable ones...!)
posted by goblin-bee at 3:39 AM on July 24, 2018


TIL geoff hinton is at U of T because he didn't want DoD dollars.
posted by kliuless at 9:26 AM on July 29, 2018


« Older A Black Artist Named White   |   shirtless guys in shorts playing with mud Newer »


This thread has been archived and is closed to new comments