An Unprecedented Feat of Tedious and Repetitive Labor
June 21, 2023 10:45 AM   Subscribe

Much of the public response to language models like OpenAI’s ChatGPT has focused on all the jobs they appear poised to automate. But behind even the most impressive AI system are people — huge numbers of people labeling data to train it and clarifying data when it gets confused. Only the companies that can afford to buy this data can compete, and those that get it are highly motivated to keep it secret. The result is that, with few exceptions, little is known about the information shaping these systems’ behavior, and even less is known about the people doing the shaping. from AI Is a Lot of Work [Intelligencer; ungated]
posted by chavenet (21 comments total) 26 users marked this as a favorite
 
You might miss this if you believe AI is a brilliant, thinking machine. But if you pull back the curtain even a little, it looks more familiar, the latest iteration of a particularly Silicon Valley division of labor, in which the futuristic gleam of new technologies hides a sprawling manufacturing apparatus and the people who make it run. Duhaime reached back farther for a comparison, a digital version of the transition from craftsmen to industrial manufacturing: coherent processes broken into tasks and arrayed along assembly lines with some steps done by machines and some by humans but none resembling what came before.

What an interesting article, thanks for sharing. I've tried a bit of chatgpt here and there, and it always seemed like much more work to get something usable in terms of asking questions and refining than if I just wrote the thing myself. I have to send a lot of unique, but semi-similar messages, and the small customization is often the worst part of it, but it has proven not able to tackle that. To say nothing of the fact that a lot of what we think of as automated is often not. Not exactly the same, but I'm reminded of the movie Kimi, and how media is showing a little bit behind the curtain.
posted by Carillon at 12:25 PM on June 21, 2023 [1 favorite]


It's a bit of an open secret in the burgeoning AI industry that many mainstream companies are doing this while trying to obfuscate the fact. Many of these are near unicorn status already. Here's some example marketing copy:

"We are proud to be trusted by leading companies to provide a data-centric, end-to-end solution to manage the entire ML lifecycle. Combining cutting-edge technology with operational excellence, we help teams develop the highest-quality datasets."
posted by thoughtful_jester at 1:06 PM on June 21, 2023 [1 favorite]


I already knew there were human hands mediating pretty much everything in AI/ML —that's why it will always be biased — but when I went searching for info on labeling the other day, a LOT of jobs came up. Made me wonder what they pay (since the funding hurricane is over that island these days), but I didn't check.
posted by rhizome at 1:35 PM on June 21, 2023 [4 favorites]


Huge swathes of Data Theft AND the digital equivalent of sweatshops?

Cyberpunk Dystopia all the way down.
posted by Faintdreams at 1:56 PM on June 21, 2023 [3 favorites]


I’m going to use this to create a dataset called, ‘ai is bullshit, please stop hyping it’
posted by The River Ivel at 2:24 PM on June 21, 2023 [1 favorite]


According to the article, the jobs pay anything from less than $1 per hour (image labelling in Kenya after you spend a long time on the (unpaid) training course only to see the job evaporate after a few mins of work) to $30- $50 per hour (being recruited as a native English speaker in the US to chat with chatGPT and rate its responses). So like anything else it codifies already existing inequalities in pay.

There's an interesting bit towards the end too about Kenyan workers using VPNs to pretend to be Malaysian or Filipino workers, because the pay in those areas is better. Which given the recent unionization effort of African content moderators for AI just sounds like union busting on a massive scale.

One enterprising guy is using one AI to label images for the other AI, are the people looking at model collapse worried about that?
posted by subdee at 2:32 PM on June 21, 2023 [8 favorites]


The closing anecdote makes me feel like we are at the end of the golden age of AI training, soon the data sources and feedback systems will get clogged and sclerotic. Generated data is always some weird smoothed version of the training data, so your messy and interesting tails are obliterated. It will just be an ouroboros of AI chewing up garbage and shoveling it into the next model.
posted by crossswords at 2:35 PM on June 21, 2023 [6 favorites]


PPS "millions" seemed like too many but just one company in the article has apparently 100,000 workers. Once you get to "billions" it seems like you'd really have a problem like the one Amazon has, where you exhaust the entire supply of workers who would do this for you.
posted by subdee at 2:35 PM on June 21, 2023 [1 favorite]


So now people can have AI write their term papers and articles for them, all on the backs of underpaid labor. Spectacular!
posted by grumpybear69 at 3:07 PM on June 21, 2023 [7 favorites]


The closing anecdote makes me feel like we are at the end of the golden age of AI training, soon the data sources and feedback systems will get clogged and sclerotic.

So far all I have experienced from AI search, Bing in particular, is regurgitated low quality content farm material in a chatbot form. So you don't even need the recursive AI training problem to get bad results. Training on the terrible keyword-spammy current state of the web pre-AI is just as bad.
posted by srboisvert at 3:52 PM on June 21, 2023 [2 favorites]


> image labelling in Kenya after you spend a long time on the (unpaid) training course only to see the job evaporate after a few mins of work

i'm not in Kenya but i tried out for a "content labeling" job a couple years ago that was "training" then a "exam" and maybe i'm just no good at it, but it sure felt like it was a trick to get you to work an hour or two for free then dump you
posted by glonous keming at 3:58 PM on June 21, 2023 [5 favorites]


I read this earlier today and I just kept thinking - why? Like, why spend all this time and money having people correct chatbot language when you could just maybe pay those people to answer questions in chat apps? If the only way these models know the concept "shirt" is a million poorly paid people in the global South clicking "this is a shirt" a thousand times a day, when a real live human gets it right away even as styles, colors, and shapes change - how are the models ever going to get better at that? And if they can't, then what do we need them for?

And I kept linking it in my head to a piece I read a week or two ago - I completely forget where now - the central argument of which was that the CEOs of these "AI" companies are currently going around campaigning "for" regulation of their industry mainly because creating fear of it makes them powerful, which brings in more money - no matter the actual capabilities or limitations of their product. Like, maybe this is just the next crypto that seems potentially revolutionary until it turns out to be a massive con all along.
posted by dnash at 4:48 PM on June 21, 2023 [3 favorites]


Maybe?
posted by Ickster at 5:39 PM on June 21, 2023 [2 favorites]


though GPT-4 can generate complex and convincing prose, it can’t pick out which words are adjectives

Yes it can. It can tag parts of speech, though by no means 100% reliably.
posted by The Half Language Plant at 6:12 PM on June 21, 2023 [1 favorite]


Last week at the very tail end of the Stable Diffusion thread I wrote an incredibly long and detailed comment trying to track down where Stable Diffusion gets its training materials from across three research papers, which I’ll semi-TL;DR here:

Stability.AI (founders of Stable Diffusion, which is open source, btw) funded LAION - an AI non-profit - to assemble a training set. Now, OpenAI had at the time released neither the model nor the training set for CLIP, its auto-captioning / NSFW-flagging image-parsing tool.

LAION swiped the training set for CLIP, sort of, via a highly novel and arguably entirely legal means: they grabbed 400 million images from CommonCrawl, ran them through CLIP asking for captions, and used the combined images+captions as their starting point. Later when OpenAI released the model but not the training set for CLIP - which was itself trained on 400 million images - LAION ran its CLIP-captioned set back through the released CLIP model as pre-training to see if the end result was functionally identical output to CLIP.

Within a few percent on various tests, it was. Later they ran a 5.8 billion image set (Stable Diffusion uses the English half of this) through CLIP and demonstrated they’d shaved those same tests down to mostly fractions of a percent difference.

Literally hoodwinked a near-as identical set out from under OpenAI, and without OpenAI having any real legal basis to object too strongly. Especially because if you take the time to read through OpenAIs released paper on CLIP you’ll find that the source of their 400M images was… “the Internet.”

No further details, and the training set remains private to this day.

In one sense LAION are the bad guys because they’re just taking everything they can lay hands on - effectively every image >5KB and with >5 words of alt-text on the entire fucking Internet circa 20121, and using that to train an automated artist without seeking permission from any of the human artists involved. In another sense they’re very much the good guys because every other group doing this sort of thing is keeping their training set a firmly guarded secret. We will never be able to directly compare the results of different training sets for as long as this is true, never empirically gauge the significance of various artists’ works or evaluate/eliminate racial and other biases, and LAION took the first practical step towards addressing this by putting an open set out there.

Not sure how I feel about all this, but I thought it was a pretty neat trick in any case.
posted by Ryvar at 7:00 PM on June 21, 2023 [3 favorites]


Model Collapse
posted by blue shadows at 11:43 PM on June 21, 2023 [1 favorite]


I wanna say the movie Multiplicity illustrates this, in a way.
posted by rhizome at 2:30 AM on June 22, 2023 [2 favorites]


AI is private control of knowledge. thats it.
posted by AlbertCalavicci at 4:17 AM on June 22, 2023 [4 favorites]


Why spend all this time and money having people correct chatbot language when you could just maybe pay those people to answer questions in chat apps?

-They do not want to pay human beings in the long term. I think when people talk about the labor behind AI training there is kind of a misunderstanding of what that means. For the most part this kind of labor is not needed to run the software that people are currently using, it is building the datasets to build the next version of the software. In the long term they will need some amount of human labor, but they are not going to be providing jobs for the people they are hiring for this initial push for long.

-The kind of things that these models are actually useful for would require a *lot* of human effort and specialized knowledge, people are not asking chatgdp what a shirt is, they are asking it to summarize an English language academic paper on textile manufacturing into their native language,to put together a list of research topics and resources to start learning about 16th century fashion, or to write an epic poem in the style of John Milton about their moms ugly sweater, and right now there is something like 10 million of these questions a day.
posted by St. Sorryass at 4:52 AM on June 22, 2023 [1 favorite]


If the only way these models know the concept "shirt" is a million poorly paid people in the global South clicking "this is a shirt" a thousand times a day

That is, if I'm not mistaken, the whole idea behind image ReCAPTCHA. Except nobody gets paid and it is everyone, everywhere.
posted by grumpybear69 at 6:09 AM on June 22, 2023 [4 favorites]


"a digital version of the transition from craftsmen to industrial manufacturing"

And even industrial manufacturing requires really a lot of human labor, exploited and carefully hidden from consumer view though it may be. I'm thinking primarily about clothing manufacture, including but not limited to fast fashion, where the low price, ubiquity, and speed of turnover obscures the fact that nearly every piece of clothing we wear is made by human hands.

In a lot of ways, the profusion among the general public the idea that "AI" is some magical machine that will make people redundant is an extension of our general disconnect with a lot of the real human labor that maintains the global economy. Current AI really manifests (unbodies) alienation from the means of production.
posted by radiogreentea at 8:23 AM on June 22, 2023 [2 favorites]


« Older Maybe Don't Go So High   |   That Is Not ChatGPT Which Can Eternal Lie Newer »


This thread has been archived and is closed to new comments