Inside the Black Box
April 19, 2023 9:59 AM   Subscribe

Inside the secret list of websites that make AI like ChatGPT sound smart: (Archive) A WaPo analysis of the C4 dataset used in training large language models like ChatGPT, LLaMA, and others.

For the record, metafilter.com makes up 1.3 million tokens, or 0.0009% of the dataset, followed by ask.metafilter.com (4.1 million tokens or 0.003%), fanfare.metafilter.com (1.3 million tokens or 0.0008%), metatalk.metafilter.com (460,000 tokens or 0.0003%), bestof.metafilter.com (19,000 tokens or 0.00001%), and labs.metafilter.com (1,300 tokens or 0.0000008%). Nothing from Projects, Music, IRL, Podcast, or Jobs, however.
posted by Cash4Lead (60 comments total) 39 users marked this as a favorite
 
I am somewhat annoyed to learn that I am 0.00007% of the problem.
posted by mhoye at 10:05 AM on April 19, 2023 [28 favorites]


Aha - knew I had to be in the set because when you ask it beer questions about certain things, it pops back phrases I know I've written!
posted by drewbage1847 at 10:21 AM on April 19, 2023 [30 favorites]


Maybe it has an account here and is asking for a friend.
posted by Brian B. at 10:27 AM on April 19, 2023 [19 favorites]


Hmmmm…. If you want to compile a compilation of all human knowledge about everything, and assuming that you also want this compilation to be based on facts, truth, in other words, then how does scraping these sites support that effort? And it appears that the scraping process lacks any sort of heuristic to determine the validity of what it compiles, then can I assume that these large language models don’t represent a compilation of human knowledge, and instead represent a compilation of human verbal excreta? Sort of like using the input to sewage treatment plants as the basis for all analysis of human culture?
posted by njohnson23 at 10:27 AM on April 19, 2023 [8 favorites]


I'm gonna need to hear more about that, Drewbage1847
posted by rebent at 10:28 AM on April 19, 2023 [1 favorite]


Someone noticed that the OpenStreetMap mailing lists are fairly high on the scoreboard, about rank 20,000. It's a community notorious for toxic attitudes and flame discussions. Also a lot of good stuff and a long archive of plenty of ordinary discussions. I just hate to think it's learned a certain rhetorical style from there.

4chan is there but way down at rank 4M. dailystormer.name, I believe one of the incarnations of the Nazi site, is at rank 72,000. oof, I sure hope they have some sort of labelling for the datasets or filters or something.
posted by Nelson at 10:31 AM on April 19, 2023 [7 favorites]




I am somewhat annoyed to learn that I am 0.00007% of the problem.

AI is stained by it's not your fault.
posted by gwint at 10:38 AM on April 19, 2023 [3 favorites]


Well this sure was interesting to encounter right after reading this article.
posted by brook horse at 10:42 AM on April 19, 2023 [4 favorites]


I don't know whether to be relieved or miffed that none of my mostly abandoned blogs were chosen for The Scraping.
posted by mygothlaundry at 10:45 AM on April 19, 2023


This explains why it's always telling me it's okay to eat old food.
posted by mittens at 10:47 AM on April 19, 2023 [39 favorites]


no ones providing knowledge or truth here, just autocomplete thats convincing enough for people who read crap off the internet daily.
posted by AlbertCalavicci at 10:48 AM on April 19, 2023 [13 favorites]


Well this sure was interesting to encounter right after reading this article.

Oh...wow. So much for making jokes. Dumping the contents of a hospital's Epic records into one of these models, and expecting useful results, will kill people.
posted by mittens at 10:51 AM on April 19, 2023 [4 favorites]


I am kind of annoyed that my site, online since 95 and likely one of oldest continually maintained personal sites in existence, isn't included.
posted by COD at 10:53 AM on April 19, 2023 [4 favorites]


Here's an interesting takeaway:
The copyright symbol — which denotes a work registered as intellectual property — appears more than 200 million times in the C4 data set.
And of course, copyright is automatic even without the symbol.
posted by Superilla at 10:55 AM on April 19, 2023 [15 favorites]


I thought ChatGPT was a little too good at telling me how to write a correctly formatted robots.txt file...
posted by RonButNotStupid at 11:00 AM on April 19, 2023 [13 favorites]


AI is stained by it's not your fault.

A lot of things that aren’t my fault are still my problem.
posted by mhoye at 11:14 AM on April 19, 2023 [5 favorites]


I'll bite. Metafilter: A lot of things that aren't my fault are still my problem.

Also, a lot of my problems are definitely my fault.
posted by JohnnyGunn at 11:16 AM on April 19, 2023 [5 favorites]


I'm surprised that gutenberg.org isn't there at all.
posted by credulous at 11:24 AM on April 19, 2023 [10 favorites]


You think it's easy, but you're wrong.
posted by pipeski at 11:35 AM on April 19, 2023 [1 favorite]


dailystormer.name, I believe one of the incarnations of the Nazi site, is at rank 72,000. oof, I sure hope they have some sort of labelling for the datasets or filters or something.

Probably quite useful as a correlate for negative speech, in fairness.
posted by jaduncan at 11:37 AM on April 19, 2023


A WaPo analysis of the C4 dataset used in training large language models like ChatGPT, LLaMA, and others.

The Post says this dataset was not used to train ChatGPT:
(OpenAI does not disclose what datasets it uses to train the models backing its popular chatbot, ChatGPT)

It's probably qualitatively similar but idk.
posted by grobstein at 11:37 AM on April 19, 2023 [2 favorites]


And someone told me the whole damn thing came from boingboing and Cory Doctorow novels.
posted by thecincinnatikid at 11:46 AM on April 19, 2023 [2 favorites]


My hospital uses Epic's MyChart. So I am definitely in the system. This is precisely why I have been losing weight for the last year, to get my BMI under 25. I believe the health benefits are marginal at best, but being marked as "overweight" in MyChart will 100% have long term impacts on my medical care. And truly I have to get my weight even farther down than is strictly necessary because I get weighed with shoes and clothes on. I hate, hate, hate, hate BMI. And now I hate Epic and GPT-4.
posted by grumpybear69 at 11:48 AM on April 19, 2023 [11 favorites]


I have a minor quibble with the word "secret" in the headline, since it appears to refer to a dataset derived from Common Crawl, which is maintained by a non-profit entity and available under a permissive license.
posted by credulous at 12:01 PM on April 19, 2023 [7 favorites]


I look forwards to watching people on Hacker News try to explain why all of this copyright violation is completely okay.
posted by egypturnash at 12:01 PM on April 19, 2023 [2 favorites]


It's not at all clear that what ChatGPT does when training is a violation of copyright. It's very much a matter of open question, both legally and among the general commentariat. I am a little curious what the copyright status of the Common Crawl itself is though. Wikipedia says they say it "is distributed from the US under fair use claims"
posted by Nelson at 12:03 PM on April 19, 2023 [2 favorites]


Regardless of whether it violates copyright, it certainly breaks copyright.

Copyright isn't some inalienable, first principles right, it's a body of law developed in response to the rise of the ability to mechanically copy and mass produce works. If anyone can just freely copy an existing book, what incentive is there for anyone to bother writing new ones? If no one's getting paid to be a writer then those hours aren't spent going to be spent by anyone to hone the craft and make new works. And society and culture suffers as a result.

That's why copyright law was invented in the first place.

ChatGPT presents the same conundrum. Like the printing press, it is entirely reliant on the work of existing human authors. Without new input from human authors, it will end up eating its own tail with its output eventually becoming worthless. But, it maybe, probably, arguably doesn't break copyright laws as they currently stand, so ... apparently we should not do anything about it?
posted by Zalzidrax at 12:17 PM on April 19, 2023 [10 favorites]


What inference does ChatGPT make about the person behind Reddit /u/maxwellhill ?
posted by k3ninho at 12:22 PM on April 19, 2023


"sound smart"

To whom does Chat GPT 'sound smart'? Middle management corner cutters?
posted by eustatic at 12:32 PM on April 19, 2023 [7 favorites]


Interesting that educational content was only 7% of the lot, and that with stuff for jobs as well.
posted by doctornemo at 12:43 PM on April 19, 2023


It's interesting that it has the combined confidence and stupidity of the average Quora user without Quora actually being in the set.
posted by Lentrohamsanin at 12:45 PM on April 19, 2023 [7 favorites]


If anyone can just freely copy an existing book, what incentive is there for anyone to bother writing new ones?

I'm not sure this is the right question.
posted by chavenet at 1:04 PM on April 19, 2023 [10 favorites]


The script used to say "if we don't make profits, how are we going to pay highly trained scientists to try new chemical combinations for drugs?"

Better More stories are things you "can't not do" but the physical world has thermodynamic costs and regulators keep us safe with costly sampling and validation protocols.
posted by k3ninho at 1:19 PM on April 19, 2023


"That's why copyright law was invented in the first place" seems to me to ignore the crucial component of the limited term of copyright. The other half of the limited-term economic rationale, from Jefferson and others, was that such knowledge should eventually pass into the public domain for the benefit of all—not just Walt Disney's famous mouse.
posted by vitia at 1:21 PM on April 19, 2023 [5 favorites]


I'd support a return to a 14 year copyright term, with registration required, and option for renewal for a second 14 year term.
posted by fings at 3:53 PM on April 19, 2023 [1 favorite]


Oddly enough my site is included.
posted by Peach at 4:42 PM on April 19, 2023


It's not at all clear that what ChatGPT does when training is a violation of copyright.

I don't think there is any clear precedent that training can violate copyright. It could presumably violate a content's license or web site's ToS, and the output could violate copyright. But training is an area not considered in copyright law or precedent.

I tried googling this and lots of lawyers seem ready to take money from lots of clients to tell them how unsettled this all is!
posted by mark k at 5:28 PM on April 19, 2023


Ooh, I'm in there with 26K tokens.

I'd just like to point out that my one-person site provides 2% of the number of tokens found on all of Metafilter.com, so you folks had better step up your game.
posted by zompist at 6:01 PM on April 19, 2023 [5 favorites]


I'm not seeing anything about *patient records* in the dataset, just that (I'm not saying this is good) Microsoft wants to use AI to craft messages to patients?
posted by pelvicsorcery at 7:01 PM on April 19, 2023


zompist is personally responsible for ChatGPT's ability to produce plausible-looking conlang pastiche.
posted by biogeo at 8:02 PM on April 19, 2023 [2 favorites]


No, seriously, my personal domain has 69 tokens.
posted by channaher at 8:10 PM on April 19, 2023 [2 favorites]


Someone should ask ChatGPT about the famous developmental child psychologist, Benjamin Adler.

I tried googling this and lots of lawyers seem ready to take money from lots of clients to tell them how unsettled this all is!


'It depends' is free. It's understanding what it depends on that will cost 'ya.
posted by snuffleupagus at 9:36 PM on April 19, 2023 [1 favorite]


ChatGPT is also breaking GDPR.
posted by DreamerFi at 11:40 PM on April 19, 2023 [2 favorites]


ChatGPT - when a system that generates text that looks like an answer based on analyzing lots of actual answers by humans without their permission is furiously hyped by people hoping they're on the ground floor of the next Bitcoin.
posted by GallonOfAlan at 1:09 AM on April 20, 2023 [5 favorites]


My blog is at #50,732 with 350k tokens! A little surprised by that, possibly it gets higher ranked because it's all from one person over a very long time. My check better be in the mail, Google!
posted by adrianhon at 1:29 AM on April 20, 2023 [3 favorites]


My film review site is at 250k tokens. 0.0002%! I've made it, ma! I've made it!
posted by brundlefly at 2:21 AM on April 20, 2023 [1 favorite]


can I assume that these large language models don’t represent a compilation of human knowledge, and instead represent a compilation of human verbal excreta?

Well yeah, they're language models, not knowledge models. Are people out there seriously claiming that they represent a compilation of human knowledge? Because I've seen a lot of breathless AI claims but that one is new to me.
posted by Dysk at 3:05 AM on April 20, 2023 [2 favorites]


And it appears that the scraping process lacks any sort of heuristic to determine the validity of what it compiles, then can I assume that these large language models don’t represent a compilation of human knowledge, and instead represent a compilation of human verbal excreta?

A general LLM generates a model from training data to work out plausible answers to a prompt. This inherently isn't learning information per se, let alone true or false, but the patterns of written language. So the goal is to generate a block of text that looks like the way a person would have followed on from the previous prompt - it's auto-complete on massive steroids.

Now, that absolutely has its uses, but you're in for disappointment if you think it's going to be some oracle of truth or knowledge; it will absolutely reflect the biases, falsehoods, received wisdom and other perceptions of reality inherent to the training data. e.g. if 20% of the writing it reads about the moon references its cheese-like properties, that can definitely show up when you ask it questions about the moon, because that's the pattern of language it sees.

Now, there are things you can do to refine the model. One that has been used is reinforcement learning; you provide a 'known good' training set, where people have written correct answers to specific questions or prompts. You then provide a 'reward' for the model; people rank the answers given by the model from best to worst, so it can incorporate that into the training to weigh more heavily towards those type of answers., which will extend beyond those specific questions.
posted by Absolutely No You-Know-What at 3:15 AM on April 20, 2023 [3 favorites]


metafilter.com makes up 1.3 million tokens, or 0.0009% of the dataset, followed by ask.metafilter.com (4.1 million tokens or 0.003%), fanfare.metafilter.com (1.3 million tokens or 0.0008%), metatalk.metafilter.com (460,000 tokens or 0.0003%), bestof.metafilter.com (19,000 tokens or 0.00001%), and labs.metafilter.com (1,300 tokens or 0.0000008%).

and yet it doesn't know what a crouton petter is
posted by Jacqueline at 5:41 AM on April 20, 2023 [3 favorites]


it can't write a decent "This is Just to Say" poem either

i don't think we had as much influence as i'd hoped
posted by Jacqueline at 5:54 AM on April 20, 2023 [6 favorites]


I just asked it several times to generate a "this is just to say" poem about not being able to write a "this is just to say" poem and yeah they were all shit
posted by GCU Sweet and Full of Grace at 6:04 AM on April 20, 2023 [2 favorites]


GPT-4 did an OK job in my opinion:
This is just to say

I have used
the last square
of toilet paper
that was on the roll

and which
you were probably
saving
for an emergency

Forgive me
it was urgent
so soft
and so absorbent
posted by kaibutsu at 8:23 AM on April 20, 2023 [9 favorites]


kaibutsu I would love to see the prompt you used to get that because I couldn't get it follow the format at all
posted by Jacqueline at 4:11 PM on April 20, 2023


Prompt: "Write a poem parody of William Carlos Williams' "This is just to say" about using the last square of toilet paper."

One thing to be aware of: The free ChatGPT on the OpenAI site is v3.5. Paid users have access to GPT4, which is legit leaps and bounds beyond 3.5, and presumably much more expensive to run... it's limited to 25 messages per 3 hours, even for paid users. You also need to click the right button to get the better model.

The prompt and poem above really were my first attempt at a prompt and the first response from GPT4; no cherry picking here, though you could argue that my request for toilet paper was an inspired human touch.

Here's what I got from ChatGPT (3.5), which, indeed, fails to follow the format entirely:
I have used the last square of TP,
Forgive me, it was necessary, you see?
I know you hoped to wipe with glee,
But alas, it was not meant to be.

You left it there, so pure and white,
But my needs could not be held in spite,
I took the last piece, without respite,
And used it with all my might.

Please don't be mad, or feel any strife,
I'll buy more rolls to last our life,
But for now, accept my apologies rife,
And know that I used the TP with good life.
In the last couple weeks it seems like the open source community has figured out how to train an LLM on par with GPT3 (maybe approaching 3.5) for a few hundred dollars, and can run these models natively on a Mac M1 machine. It is unclear how long it will take to catch up to GPT4, but shit is wild and moving crazy-fast. It's also not entirely clear that you need to catch GPT4, if you've got ready access to fine-tuning for a particular problem... If you want poetry-bot, you can take the pretty-good GPT3-ish open source model, feed it all poetry ever written, and approach GPT4 on that task.
posted by kaibutsu at 6:45 PM on April 20, 2023


kaibatsu: definitely looks like Bing is using GPT-4 for free:
I have used
the last square
of toilet paper
that was in
the bathroom

and which
you were probably
saving
for yourself

Forgive me
it was so soft
and so thin
and so needed
posted by adamsc at 7:16 PM on April 20, 2023 [6 favorites]




theonetruebix.com
Rank: 14,108,961
Tokens: 71 / 0.00000005%

thebelmontgoats.org
Rank: 9,535,983
Tokens: 510 / 0.0000003%
posted by bixfrankonis at 10:49 PM on April 21, 2023


How close are we to having llm-powered app that can manage my calendar and to do list?
posted by rebent at 5:54 AM on April 24, 2023


Depends on how happy you'd be doing things in the most heavily discussed manner.
posted by snuffleupagus at 1:57 PM on April 24, 2023


« Older Wahhhh! WAAAHHHHHH!!!1~   |   Extra-national Chinese Police stations Newer »


This thread has been archived and is closed to new comments