The Code We Can’t Control
January 15, 2015 8:31 AM   Subscribe

David Auerbach for Slate discusses the dangers of the algorithm-driven data collection and organization of Big Data in a review for law professor Frank Pasquale's book on the subject, The Black-Box Society.

There are several issues raised by both, from what responsibility should exist for algorithms that produce biased results, the extreme and questionable niches that Big Data is filing people into in the pursuit of revenue through data collection, to the lack of transparency and accountability over the entire system.
posted by NoxAeternum (31 comments total) 44 users marked this as a favorite
 
If you ask an engineer, “Why did your program classify Person X as a potential terrorist?” the answer could be as simple as “X had used ‘sarin’ in an email,” or it could be as complicated and nonexplanatory as, “The sum total of signals tilted X out of the ‘non-terrorist’ bucket into the ‘terrorist’ bucket, but no one signal was decisive.” It’s the latter case that is becoming more common, as machine learning and the “training” of data create classification algorithms that do not behave in wholly predictable manners.
posted by infini at 9:45 AM on January 15, 2015 [3 favorites]


It's not violation of due process, application of stereotypes, racism, etc. if the machine learned it from the data, right?
posted by jeffburdges at 9:48 AM on January 15, 2015 [2 favorites]


jeffburdges, who fed the data and designed the frame of reference for teaching it? Implicit assumptions and unquestioned biases are apparently being amplified per the article. Its probably all legal.
posted by infini at 9:51 AM on January 15, 2015


I think a lot of people now have a sense that programs like this are classifying us, correctly or incorrectly, and that we're not as comfortably obscure as we were. This isn't surprising, and most people weren't really surprised that this was not only possible, but was happening. Why wouldn't it be happening?

What we need in order to understand this more clearly as a problem are convincing stories about the negative implications of this kind of data handling. Not only about people being possibly miscategorized as possible terrorists, but about how stuff like this can impact _most_ people in their everyday lives.

There needs to be a vivid picture of why this leads to X which leads to bad stuff happening that would affect Jill, Dave, or someone's Mom or child -- stuff most people would care about enough to get them to act. (Then also some clear way to act, of course). I'm talking about possible effects beyond being shown inappropriate advertisements, or feeling creeped out. We mostly get that feeling creeped out is legitimate, but we're not sure how to explain _why_ it's creepy -- what we're afraid of, in other words, in concrete terms.

The idea of classing people as terrorists is of course worth fighting against, but surely there are some more immediate effects to guard against (even if those effects are just logical follow-ons from classifying a bunch of people as possible terrorists). Just, can anyone spell those out?

Right now, most of the information on this that I've seen is convincing, yes, but not really connected to _effects_ in everyday life. In other words, I'm not sure how to persuade someone to care, although I do care myself.
posted by amtho at 9:57 AM on January 15, 2015 [4 favorites]


In other words, I think I'm looking for good stories about people (either factual, or, more likely, highly-plausible hypothetical stories).
posted by amtho at 9:58 AM on January 15, 2015


infini: If you ask an engineer, “Why did your program classify Person X as a potential terrorist?” the answer could be as simple as [1] “X had used ‘sarin’ in an email,” or it could be as complicated and nonexplanatory as, [2] “The sum total of signals tilted X out of the ‘non-terrorist’ bucket into the ‘terrorist’ bucket, but no one signal was decisive.”
If this were a person instead of a computer algorithm making the decision, we would label #1 as being "stereotyping, racist, or prejudiced"; #2 describes the sort of decision-making process a rational, unbiased person would have to use.

Weirdly, it's framed as if the trend toward #2 is a bad thing:
infini: It’s the latter case that is becoming more common, as machine learning and the “training” of data create classification algorithms that do not behave in wholly predictable manners.
I suspect this is a combination of a lack of in-depth understanding of what is happening, along with a feeling of helplessness that most people have around automated systems (queue all the people who won't trust GPS because sometimes they get it wrong - never mind how often they themselves get it wrong).

Although, as an innocent vistim of the system I'd much prefer a clean answer a la #1 ("Sir, you're on the no-fly list because your name looks like the name of a terrorist - Sayyid Iambroom Epstein), the latter is preferable overall (This expectant mother employed as a janitor googled "How to kill rats before the President visits my office"; despite containing the key words "How to kill the President", this doesn't raise to the level of a credible threat).
posted by IAmBroom at 10:10 AM on January 15, 2015 [1 favorite]


Just clarifying that IAmBroom is quoting my comment quoting the article and not anything written by me. Fear the format structure may miscommunicate source.
posted by infini at 10:12 AM on January 15, 2015


True that, but that quote was the entirety of your post. If you meant to disagree with it in any way, you forgot to do so.
posted by IAmBroom at 10:18 AM on January 15, 2015


You could have just referenced the OP directly. I was highlighting a matter of concern of importance to me. That inarguable algorithms can decide on the fate of people and the matter hangs on the balance of tipping scales and weighted rankings tables. Just like a scan of this conversation thread would. Hence the clarification.
posted by infini at 10:21 AM on January 15, 2015


Weirdly, it's framed as if the trend toward #2 is a bad thing:

I think the reason they call it a bad thing is that:
1. learning-based methods encode the racism that is present in the training data
2. in a way that may not be amenable to correction because it is inscrutable
posted by a snickering nuthatch at 10:22 AM on January 15, 2015 [13 favorites]


It's still early in the game, but I hold out hope (speculate) that algorithms will eventually be a Sixth Amendment issue, like Breathalyzer source code.
posted by rhizome at 10:23 AM on January 15, 2015 [2 favorites]


If this were a person instead of a computer algorithm making the decision, we would label #1 as being "stereotyping, racist, or prejudiced"; #2 describes the sort of decision-making process a rational, unbiased person would have to use.

Weirdly, it's framed as if the trend toward #2 is a bad thing:


That's because #2 is neither of the things you attribute to it. Just because you are using a defined ruleset to make a determination doesn't mean that ruleset is free of bias. And the argument that you are using such a system acts as obscuration, as we know nothing about the rules being employed.
posted by NoxAeternum at 10:27 AM on January 15, 2015 [3 favorites]


Algorithms don't kill people, people kill people (and write algorithms).
posted by tempestuoso at 10:41 AM on January 15, 2015


Every time I hear "big data", I think of the house of horrors - the "scientific" bombing of Vietnam, Mutually Assured Destruction, etc. - that mid-20th c. systems theory, the analog(ish) precursor of big data driven decision-making, brought us. The same bankrupt, suicidal Enlightenment thinking is behind both of them.
posted by ryanshepard at 10:50 AM on January 15, 2015 [2 favorites]


Every time I hear "big data", I think of...

You should just think "Marketing term made up to sell shit". There is nothing new under the sun with this, except shitloads of CFO's writing giant checks to some of this "Big Data" stuff all the other companies are buying.
posted by sideshow at 11:03 AM on January 15, 2015 [2 favorites]


I liked this review and will have to check out the book. But it seems weird that, in reviewing a book one of whose central claims (presumably) is that a bad effect of algorithmic decision-making is that it obscures the exercise of power, Auerbach's take at times seems almost to adopt a kind of obscurantism about corporate power:
Yet Pasquale underestimates the degree to which even those on the inside can’t control the effects of their algorithms. As a software engineer at Google, I spent years looking at the problem from within, so it’s not surprising that I assign less agency and motive to megacorporations like Google, Facebook, and Apple. In dealing with real-life data, computers often fudge and even misinterpret, and the reason why any particular decision was made is less important than making sure the algorithm makes money overall. Who has the time to validate hundreds of millions of classifications? Where Pasquale tends to see such companies moving in lockstep with the profit motive, I can say firsthand just how confusing and confused even the internal operations of these companies can be.
I guess if this is just meant to amount to "it's complicated," to note that Pasquale oversimplifies corporate decision-making, then that's a fine corrective and of course a book review isn't required to do anything more than that. But ultimately, just noting that it's complicated can't substitute for the analysis of power that we're going to need here.
posted by RogerB at 11:09 AM on January 15, 2015 [1 favorite]


In case people think the hypothetical at the beginning of the article is really just hypothetical, I've seen exactly this scenario in industry before:

Our group trained a classifier to predict whether or not a given person was a credit risk or not. We used a technique called classification trees, which (simplified) find the single variable that best splits people into two groups based on whether or not they turned out to be a credit risk, and repeat the process for each of those two groups separately. This spits out a series of questions like "is the person male", "is the person's salary greater than 40k", etc. The root of the decision tree we actually trained was based on the question "is this person a member of <name of social network aimed at a specific ethnic group>". We ended up not telling the client about the classifier.

But suppose one of the engineers hadn't looked at the classifier we trained; we never would have known how it was making decisions. With lots of classification algorithms, we wouldn't even have been able to interpret the results meaningfully (say, where "credit risk" means someone's data lies on one side of a hyperplane in an infinite dimensional space where distance, and where distance is measured using the gaussian radial basis function).

I don't know what the right general solution is. Saying "don't ever use black-box statistical techniques" seems like a strange moral imperative, but not being able to look inside black boxes sometimes leads to immoral results.
posted by The Notorious B.F.G. at 11:12 AM on January 15, 2015 [12 favorites]


Our group trained a classifier to predict whether or not a given person was a credit risk or not. We used a technique called classification trees, which (simplified) find the single variable that best splits people into two groups based on whether or not they turned out to be a credit risk, and repeat the process for each of those two groups separately.

It seems like decision trees would promising candidates for dealing with this issue. At least you had the ability to inspect and understand the output, giving you the option of not telling the client about the racist classifier.

That's not to say that techniques based on do-a-bunch-of-linear-algebra are entirely opaque. But you have to look harder, and so there would have to be some will to exert that effort.
posted by a snickering nuthatch at 11:37 AM on January 15, 2015


*Sigh*

Millions of users, every day willingly take personality, belief, preference tests, just on Facebook alone. For a laugh I took a linguistics test, with questions designed to reveal regional dialect, and also reveal where I grew up and cultural traits. I sensed they were building an algorithm to establish truth of identity. The way I answered the questions, I was pegged as African American from the midwest, back to the drawing board. I am sure this algorithm has sharpened up, since. Since the testors knew who I am and my background, my taking the test helped the build. It shoud be airtight, since a year has passed.

The collection of data about American school children, in school, by companies who were first in the data mining business, really disturbs me. This along with medical record keeping becomes the cradle to grave creation of identity definition, by authors shifting and unknown. If the value to business is potential gain, or uptake of promising servants to the system, or the blatant forwarding of some elite variant, well that is what it is for; no matter what other cause it serves tangentally.

Cowardice and greed rule this world for now, behold the shifting price of oil and the rising Swiss Mark, and other sudden onset modifiers of the overall state of human reality. Established algorithms have caused the fortunes of Iran and Russia to drop and slowed the flow of extra capital in the ME. Who has that program in place?

I am deliberately worthless, should be invisible, still I open my mouth. "Die gedanken sent frei," I can misspell in German too, the collection of all this data is certainly the coward's or the voyeur's route. Our children are out on the target range.
posted by Oyéah at 11:58 AM on January 15, 2015 [5 favorites]


#2 describes the sort of decision-making process a rational, unbiased person would have to use.

What you've overlooked is that people write the algorithms that are used to construct the buckets. An analogy would be a car company responding to their cars blowing up by saying that the robots on the assembly line built the cars, so sue them instead. The algorithms are robots.
posted by rhizome at 12:13 PM on January 15, 2015


This is mostly correct but not entirely. In a lot of cases the algorithms are general-purpose and open-source, written by graduate students. It's the plugging of specific data into the algorithm that makes it a weapon of mass mis-classification. When cars blow up is the car company at fault or the manufacturer of the robot that built the car? Liability most often rests with the tool user, not the tool maker.
posted by simra at 12:43 PM on January 15, 2015 [2 favorites]


There's probably some sort of mineable joke in here about how GamerGate, which David Auerbach has been associating and associated with, is probably providing an absolute goldmine for databases looking for evidence of threatening behavior on social media to fit into their algorithms. It would be sort of amazing if the long-term impact of GamerGate was to wrench algorithmic threat detection away from targeting ethnic minorities...
posted by running order squabble fest at 1:00 PM on January 15, 2015 [2 favorites]


I think there is a 'garbage in, garbage out' problem, too. Someone has to pick a "correct" looking dataset to train these algorithms on. That's tricky business even if one is trained in statistics and trying quite hard. My guess is that most businesses aren't trying any harder than it takes to make money.
posted by Zalzidrax at 2:10 PM on January 15, 2015 [2 favorites]


For my daily serving of Oh God We Are Living In The Future, today I found out the book A Legal Theory for Autonomous Artificial Agents exists and is apparently applicable to things that actually exist right now. Good to know.
posted by eykal at 2:15 PM on January 15, 2015 [2 favorites]




What do they want to do about what they want to know about me?
posted by Oyéah at 7:29 PM on January 15, 2015




They can always take the personal approach, and upon waking the comments indicate they know even about your childhood, yeah the fentanyl wears off at different rates and if you keep a blank look they reveal a lot about what they just learned from you, that you don't remember volunteering, how or why you would have, and in what context. Not to get really dark here...
posted by Oyéah at 8:35 PM on January 15, 2015


You can't tell what features of the training data the algorithm will consider significant. Back in the 80's, I heard of an attempt to train a neural network to recognize camouflaged Russian tanks. It performed wonderfully on the test data, but failed miserably on new data. Eventually they figured out that the network had learned to recognize pictures of sunny days!
posted by monotreme at 10:09 PM on January 15, 2015


Big Data is not a magic cure all, and the danger of it is that people don't really understand what it means. All "big data" does (basically, cutting out a lot of clever technical stuff) find significant correlations between things. As a result, it absolutely will racial stereotype people, because people in different racial groups tend to share characteristics due to historical and current biases (people of different ethnicities tend to have different average incomes, for example) and cultural norms.

The answer is to use the model as a guide, not as a cure all, especially when big decisions are made. Someone should never be put on a no fly list because an algorithm said so (and given the low prevelance of terrorists, I would be genuinely astonished that any model could predict someone was a terrorist with particularly high probability), but big data could produce a list of people that a human being could investigate.

Although I think that last point is interesting. There is an idea here that algorithms can be cold, unfeeling and wrong. Which is true. But so can humans. Let's say my algorithm says Bob Jones has a 20% chance of being a terrorist, so agent Smith investigates Bob Jones. He read's his facebook, checks out what book's he's got from the library and decide, yup, he could well be a terrorist! He pops Jones on the no fly list, job done. The problem here isn't necessarily the process, it's the lack of accountability.
posted by Cannon Fodder at 1:17 AM on January 16, 2015 [1 favorite]


The Cathedral of Computation, by Ian Bogost.
posted by twirlip at 3:03 PM on January 16, 2015


« Older It's expensive to be poor.   |   When Walmart Leaves Newer »


This thread has been archived and is closed to new comments