FineWeb: decanting the web
July 9, 2024 2:08 PM   Subscribe

Secrets of generative AI.: the French-American startup Hugging Face recently made the most powerful corpus of texts for developing language models available on its open-source platform. The process for obtaining a quality dataset is explained in detail here. The tool developed to achieve this result is available here. How does it work? First, you can download a huge set of 5.354 TB web pages here. You remove all the porn, crossing your fingers that not too much remains. You trash all languages except English because everyone speaks English. Your dataset has lost 50% of its weight. Eliminate all duplicates. Filter the nonsense created to fool SEO scores. Finally, you rate the remaining texts and keep only the best. It's that simple. By what miracle do you accomplish this task, which would keep mankind busy for several years? By using artificial intelligence, of course! Data sets intended to feed language models are filtered by language models. How logical! Refine one or two more times and no soul should remain.
posted by verylazyminer (12 comments total) 17 users marked this as a favorite
I love that every sample paper on that Common Crawl web data dump contains "This is some text inside a div block." underneath it. It seems so... fitting.
posted by BigHeartedGuy at 2:22 PM on July 9

Data sets intended to feed language models are filtered by language models. How logical!

This is model collapse with extra steps.
posted by mhoye at 2:26 PM on July 9 [4 favorites]

^F "copyright" - no results. Huh.
posted by zompist at 2:40 PM on July 9 [12 favorites]

We wondered if the strong performance of the last few crawls could be, in part, attributed to the presence of a larger quantity of synthetic data (data generated by LLMs) ... we find a steep increase of our proxy metric in recent crawls. While this simple test is not enough to conclude that ChatGPT completions and other synthetic data is improving the quality of the most recent crawl, it at the very least does not seem to drastically harm it.

This is pretty weird, that the increased presence of ChatGPT-generated web pages doesn't seem to be harming the performance on benchmarks. Maybe it's the nature of the benchmarks or the model size, but ... weird.
posted by credulous at 2:54 PM on July 9 [1 favorite]

Of note in this is The Stack, a currently 6.4TB dataset of publicly available code which they pulled from the Software Heritage archive. There's a way to opt out (creating an issue in their GitHub repository, waiting for them to get to it, then waiting for them to remove it from future versions of the dataset), but this relies on trusting that every company that downloads it follows the terms of use requirement to keep up to date with releases & not keep/use old revisions.

I've already seen this cause problems, as its existence has led to a lot of distrust for theoretically disconnected OSS archival projects and the claim underlying genAI training on code is that licensing doesn't matter/doesn't transfer; so there's no way to say "No really, I don't want you slurping my code up".

For as many people as I've seen decry patterns suggesting a move to Dark Forest-type patterns of behavior; there sure are a lot of people eager to race to bring that into existence arguing that they have to grind the public commons into paste before someone else does.
posted by CrystalDave at 2:56 PM on July 9 [11 favorites]

Data sets intended to feed language models are filtered by language models. How logical!

This is more logical than it might sound. It is easier to recognize useful text than it is to create it.
posted by a faded photo of their beloved at 4:01 PM on July 9 [2 favorites]

You remove all the porn

Does anybody know what they’re doing with all the unused porn? Asking for a friend.
posted by Horace Rumpole at 5:07 PM on July 9 [11 favorites]

This is pretty weird, that the increased presence of ChatGPT-generated web pages doesn't seem to be harming the performance on benchmarks.

A lot of people seem to interpret the idea of “model collapse” to mean that any contamination with synthetic data will cause a loss in quality. I do not think that this is actually what research on the topic indicates, and it doesn’t even seem that implausible intuitively that the average quality of output produced by a model trained on a somewhat indiscriminate sample of online text and refined with RLHF etc. might in some ways be higher than that of… a somewhat indiscriminate sample of online text.

It also seems possible that it’s just over-representing stuff that’s part of the benchmarks, though.
posted by atoxyl at 5:53 PM on July 9 [1 favorite]

Not all training data is of equal quality and in many cases synthetic text generated by existing LLMs is of superior quality to what real humans write: a large number of people are blithering, bigoted idiots who can barely string together a coherent thought or sentence. Also, many intelligent people have periodic lapses in reasoning and say uncharacteristically idiotic things, or post drivel to Metafilter while using the 'Ryvar' account, or come to the right conclusions for the wrong reasons. Properly reviewed and filtered synthetic data can have reduced bias and be free of the worst idiocy, Ryvar, and private information belonging to actual humans. LLMs being mostly but not entirely incapable of novel reasoning means that any examples of "reasoning" found in synthetic data will either be cloned from the generating model's own training data, or likely of inferior quality or complexity, even in cases where the overall grammatical quality of the text is better.

All of which suggests that there is an ideal natural:synthetic training data ratio, which probably varies wildly based on intended purpose and benchmark used. Meaning that in modern mixture-of-experts models, the definition of optimal training data probably varies between component "experts" within the model. Related to all this, nVidia recently released their open source Nemotron-4 340B model specifically intended for synthetic data generation.

According to interviews with OpenAI researchers a big emphasis for the upcoming generation of models is sample efficiency, and in some sense the current generation's emphasis on multi-modality could be interpreted as an expression of that: each bit of text-based training data is a lot more useful when there are correlating images and video to put some real teeth behind pure text like "apples are red." Humans train their our neural networks on far fewer samples than LLMs, but all of it is multi-modal, contextualized, reactive, and continuously trained. Point being: we are all living proof that ML could be doing far more with far less, but that would require Capital to actually fucking wait, and I think we all know that ain't happening.

Speaking of which:
^F "copyright" - no results. Huh.

I wouldn't hold my breath, because the answer to "how much English did it take to get us this far?" was "all of it." And if anyone tried to do it ethically they lost their research funding because they inevitably produced greatly inferior results. Still, there are some tiny points of light on the horizon, namely the Open Model Initiative (the big new open source image generation community project) proposing that - while they're not going to stop using everybody's artwork or anything crazy like that - they're at least going to start taking measures against people being able to do things like dropping an artist's name in a prompt in order to blatantly rip off their style or even specific works of art.

...and yeah, I know, but: baby steps. At least people pushing this forward within the open source community are acknowledging there is a problem and committing to SOME kind of corrective measures.
posted by Ryvar at 6:18 PM on July 9 [3 favorites]

I've been very skeptical of what the Valley is hyping as artificial "intelligence". And I figured the trend pointed toward an Ouroboros effect where bullshit would bring down the whole house of cards hall of mirrors. But this passage
By what miracle do you accomplish this task, which would keep mankind busy for several years? By using artificial intelligence, of course!
makes mre realize that the Ouroboros effect might be avoided so long as the more-specific filter precedes the more generalised generative step. Give the bot a bullshit filter, it seems, might be the answer. Multi-staged, even!
posted by CookTing at 9:30 PM on July 9 [1 favorite]

patterns suggesting a move to Dark Forest-
ah, through the forest: 🌲🌳🌴 *goes dark*
posted by HearHere at 2:03 AM on July 10

As long as a model has information about human or animal anatomy, there will be porn of it.
posted by JustSayNoDawg at 2:45 PM on July 10

« Older Senator Snowball Melts Away   |   space~time clockers Newer »

You are not currently logged in. Log in or create a new account to post comments.