Is the internet becoming an infinite − and useless − library?
November 29, 2024 3:47 PM Subscribe
An 83-year-old short story by Borges portends a bleak future for the internet? A July 2024 paper published in Nature explored the consequences of training AI models on recursively generated data. It showed that “irreversible defects” can lead to “model collapse”. So how bad might this get? Fiction writers have explored some possibilities (via The Conversation)
It's wild stuff and by far the biggest future shock I've ever felt.
I work as a deep learning researcher, although not in LLMs, and I've had similar "future shock" experiences when contemplating what's going on. For language models, it is profoundly interesting that so much of what we understand as "meaning" in a line of text is merely statistical correlation between words. As long as an ordered series of words have the right probabilistic relationship to each other, we will find meaning in it! We'll believe something that thinks is behind it and is expressing ideas. It's like discovering that a single equation governs all "beautiful music," and as long as we sample that equation in the right way, we get infinite songs. (I mean, this is probably true too..)
posted by riotnrrd at 5:42 PM on November 29 [7 favorites]
I work as a deep learning researcher, although not in LLMs, and I've had similar "future shock" experiences when contemplating what's going on. For language models, it is profoundly interesting that so much of what we understand as "meaning" in a line of text is merely statistical correlation between words. As long as an ordered series of words have the right probabilistic relationship to each other, we will find meaning in it! We'll believe something that thinks is behind it and is expressing ideas. It's like discovering that a single equation governs all "beautiful music," and as long as we sample that equation in the right way, we get infinite songs. (I mean, this is probably true too..)
posted by riotnrrd at 5:42 PM on November 29 [7 favorites]
Strong recommend: A Canticle for Leibowitz
posted by constraint at 5:48 PM on November 29 [8 favorites]
posted by constraint at 5:48 PM on November 29 [8 favorites]
Is the internet becoming an infinite − and useless − library?
libraries are so much more than just the internet (though, yes, they also have that)
posted by HearHere at 6:07 PM on November 29 [6 favorites]
libraries are so much more than just the internet (though, yes, they also have that)
posted by HearHere at 6:07 PM on November 29 [6 favorites]
Not so much about the internet, but I've been more or less saying this for a while now. Specifically I've been talking with friends about the continued fall of Twitter.
My theory is that Twitter (and Facebook, to a lesser extent, so far) is becoming worse and worse, in terms of signal-to-noise ration. There's so much junk on Twitter - bots, ads, nazis, etc - that any useful information is harder and harder to find.
Eventually, this S/N ratio will become so bad that the website will become effectively useless. Think of 4Chan. Whatever it started out as, there's no point to looking for any useful information there, now.
The same could happen to the internet as a whole, but at this point in time, I kind of doubt it.
posted by Relay at 7:11 PM on November 29 [1 favorite]
My theory is that Twitter (and Facebook, to a lesser extent, so far) is becoming worse and worse, in terms of signal-to-noise ration. There's so much junk on Twitter - bots, ads, nazis, etc - that any useful information is harder and harder to find.
Eventually, this S/N ratio will become so bad that the website will become effectively useless. Think of 4Chan. Whatever it started out as, there's no point to looking for any useful information there, now.
The same could happen to the internet as a whole, but at this point in time, I kind of doubt it.
posted by Relay at 7:11 PM on November 29 [1 favorite]
For language models, it is profoundly interesting that so much of what we understand as "meaning" in a line of text is merely statistical correlation between words
I am not a deep learning researcher but this strikes me as pointing to the phrasing of “mere statistical correlation” being something of an oversimplification for effect? It’s an unguided, nondeterministic mathematical process, but it’s manifestly, well, a deeper model of the structure of language than a Markov generator.
But yeah the generality of these methods is crazy.
posted by atoxyl at 7:18 PM on November 29 [1 favorite]
I am not a deep learning researcher but this strikes me as pointing to the phrasing of “mere statistical correlation” being something of an oversimplification for effect? It’s an unguided, nondeterministic mathematical process, but it’s manifestly, well, a deeper model of the structure of language than a Markov generator.
But yeah the generality of these methods is crazy.
posted by atoxyl at 7:18 PM on November 29 [1 favorite]
Also see Spider Robinson's "Melancholy Elephants" for a taste of what happens when all of the music, etc. people can think of is already under copyright.
posted by johnabbe at 9:48 PM on November 29 [1 favorite]
posted by johnabbe at 9:48 PM on November 29 [1 favorite]
riotnrrd: As long as an ordered series of words have the right probabilistic relationship to each other, we will find meaning in it! We'll believe something that thinks is behind it and is expressing ideas.
Weirdly, what I’ve mostly observed is the opposite, that people have become so distrustful of text on the internet that they assume that writings by real people is generated by “bots”.
posted by Kattullus at 12:06 AM on November 30 [5 favorites]
Weirdly, what I’ve mostly observed is the opposite, that people have become so distrustful of text on the internet that they assume that writings by real people is generated by “bots”.
posted by Kattullus at 12:06 AM on November 30 [5 favorites]
not only is every image understandable (rather than the random noise of Borges' Library)
Who is to say that any given sample of what we so cavalierly dismiss as random noise is not, in fact, the secret to Life, the Universe and Everything, encrypted with a key we don't yet know?
posted by flabdablet at 12:44 AM on November 30
Who is to say that any given sample of what we so cavalierly dismiss as random noise is not, in fact, the secret to Life, the Universe and Everything, encrypted with a key we don't yet know?
posted by flabdablet at 12:44 AM on November 30
Eventually, this S/N ratio will become so bad that the website will become effectively useless. Think of 4Chan. Whatever it started out as, there's no point to looking for any useful information there, now.I think it already happened to the Internet as a whole. That's why we have Facebook and Twitter and Reddit and Amazon and a few others. Because the "internet as a whole" has been 90% garbage for years and those sites offer an island of relatively uncorrupted sanity. Or at least they used to.
The same could happen to the internet as a whole,
I mean, 10 years ago I shopped on many online sites, I browsed through blogs and RSS feeds, I discovered new things constantly. Now I assume any website I find is AI-generated slop, any storefront I find is a scam, and I spend 90% of my time on a small curated list of sites and communities I trust. (Hi Metafilter!)
I think the future of the Internet looks a lot like the past of the Internet. Back in 1995 I'd browse through categories of websites at Yahoo.com because a few people who ran the site had checked the links and decided whether they were worthy to be listed. Even back then I knew it only listed a fraction of the "whole internet" but it was useful.
I'm already wishing there was a site like that now because searching the whole Internet with Google is increasingly useless.
posted by mmoncur at 2:23 AM on November 30 [9 favorites]
mmoncur - I've noticed even in the last four years a big change in how I feel about online shopping. In 2020 I bought so much stuff (it was a different time, of course) to prepare for the birth of our firstborn, and had a positive experience. But recently I've been much more hesitant to purchase stuff online, it seems so much harder to find a genuine company to deal with. I'm looking at different products, of course (back then it was all baby products and that isn't where we are at as a family anymore) so it's difficult to make a true comparison.
posted by freethefeet at 3:41 AM on November 30 [2 favorites]
posted by freethefeet at 3:41 AM on November 30 [2 favorites]
The article focuses on LLMs, but image models are particularly mind-boggling to me, in the "infinite library" sense.
It’s definitely been the rapid advancement of image generation that has seriously shaken me. Just the public-facing tools have improved so well in such a short period of time, as to be beyond worrisome. Even a mere six months ago, it was relatively easy for a curious eye to spot the AI images, what with the various fiddly body bits, text, incorrect lighting, overlapping body parts, signage in the backgrounds, etc.
Today, though, things have improved immeasurably. There are still occasional instances of six-fingered hands, for instance, but they are becoming very few and very far between. The damned things are even showing signs of rendering text/signage/letterforms correctly.
And, as I said, this is just with the public tools. God knows what the various state-sponsored propaganda machines can churn-out now.
posted by Thorzdad at 3:44 AM on November 30 [2 favorites]
It’s definitely been the rapid advancement of image generation that has seriously shaken me. Just the public-facing tools have improved so well in such a short period of time, as to be beyond worrisome. Even a mere six months ago, it was relatively easy for a curious eye to spot the AI images, what with the various fiddly body bits, text, incorrect lighting, overlapping body parts, signage in the backgrounds, etc.
Today, though, things have improved immeasurably. There are still occasional instances of six-fingered hands, for instance, but they are becoming very few and very far between. The damned things are even showing signs of rendering text/signage/letterforms correctly.
And, as I said, this is just with the public tools. God knows what the various state-sponsored propaganda machines can churn-out now.
posted by Thorzdad at 3:44 AM on November 30 [2 favorites]
Old heads may remember Eternal September, the idea that instead of Usenet/the Internet experiencing a substantial influx of new unacculturated members only at the beginning of the US academic year, commercial availability meant that new people were arriving at that pace forever.
We’re now at Infinite September, where the new arrivals can't be bargained with, can't be reasoned with, and absolutely will not stop.
posted by zamboni at 4:42 AM on November 30 [8 favorites]
We’re now at Infinite September, where the new arrivals can't be bargained with, can't be reasoned with, and absolutely will not stop.
posted by zamboni at 4:42 AM on November 30 [8 favorites]
First, that Borges short story linked in the article was fantastic and a worthy read. Thank you for sharing.
Second, I'm curious: what percentage of internet content is in English? There are many different ways to look at this but the W3Techs data (via Wikipedia) suggests that around 50% of all content is English (while the 2nd-largest language is Spanish at 6%). Though I am not familiar with the latest research firsthand, I'm certain there's plenty of LLM work being done in other languages, of course, but would they face this recursive model collapse even sooner than those in English? They have 10% or 1% or even less of the available content to ingest.
Or, is that impact somewhat lessened because the models are largely trained on English and then effectively use a translation layer in between to communicate back and forth with a local language? Genuinely curious.
posted by robot_jesus at 6:24 AM on November 30 [2 favorites]
Second, I'm curious: what percentage of internet content is in English? There are many different ways to look at this but the W3Techs data (via Wikipedia) suggests that around 50% of all content is English (while the 2nd-largest language is Spanish at 6%). Though I am not familiar with the latest research firsthand, I'm certain there's plenty of LLM work being done in other languages, of course, but would they face this recursive model collapse even sooner than those in English? They have 10% or 1% or even less of the available content to ingest.
Or, is that impact somewhat lessened because the models are largely trained on English and then effectively use a translation layer in between to communicate back and forth with a local language? Genuinely curious.
posted by robot_jesus at 6:24 AM on November 30 [2 favorites]
When you ask a librarian about a particular book or author they don't spend 10 minutes trying to sell you something or recommending other books or authors before reluctantly giving you the information you wanted.
posted by tommasz at 6:57 AM on November 30 [12 favorites]
posted by tommasz at 6:57 AM on November 30 [12 favorites]
The article mentions the NYT and the like as credible sources, we all just sat through 2 years of that paper sanewashing the republican candidate(s). Misinformation can come in many forms, and the various papers' (likely management driven) need to ask, "who are you gonna believe? Us or your lying eyes?" has to have contributed to the continued flight of each generation away from the old guard of media.
posted by Slackermagee at 7:04 AM on November 30 [2 favorites]
posted by Slackermagee at 7:04 AM on November 30 [2 favorites]
The Borges story doesn't seem like the best analogy because LLMs do a great job at producing written pieces that appear reasonable and lucid and not random…it's just that those pieces never contain anything compelling or original. It reminds me of when Jesus told his disciples, "You are the salt of the earth, but if that salt loses its taste, it is good for nothing but to be thrown out and trampled upon."
posted by jabah at 7:35 AM on November 30 [3 favorites]
posted by jabah at 7:35 AM on November 30 [3 favorites]
this strikes me as pointing to the phrasing of “mere statistical correlation” being something of an oversimplification for effect?
Yes, I mean, a little sure. My sense of wonder about it a little silly I guess. I'm reminded of how amazed people were in the 90s when artists were making realistic looking mountain ranges and coastlines with fractals. In both cases, a mathematical model was found or generated that closely but not perfectly matches the underlying structure of a natural phenomena. But in neither case does it actually illuminate the mechanisms of the real natural process. Fractals don't provide insight about erosion, and LLMs don't "understand" language.
posted by riotnrrd at 10:25 AM on November 30 [1 favorite]
Yes, I mean, a little sure. My sense of wonder about it a little silly I guess. I'm reminded of how amazed people were in the 90s when artists were making realistic looking mountain ranges and coastlines with fractals. In both cases, a mathematical model was found or generated that closely but not perfectly matches the underlying structure of a natural phenomena. But in neither case does it actually illuminate the mechanisms of the real natural process. Fractals don't provide insight about erosion, and LLMs don't "understand" language.
posted by riotnrrd at 10:25 AM on November 30 [1 favorite]
I once asked ChatGPT for a list of titles by No Starch Press, and it started giving me actual titles, then titles about software that were not from No Starch Press, then titles about Zen:
34. The Art of Software Security Assessment
35. The Art of Software Testing
36. The Art of UNIX Programming
37. The Art of War for Computer Security
38. The Art of X-Ray Reading
39. The Art of Zen Meditation
40. The Art of Zen Practice
41. The Art of Zen Teaching
...and then it started getting weird:
45. The Art of Zen Zazen
46. The Art of Zen Zendo
47. The Art of Zen Zendo Practice
48. The Art of Zen Zendo Training
49. The Art of Zen Zendo Writing
50. The Art of Zen Zendo Yoga
51. The Art of Zen Zendo Zazen
52. The Art of Zen Zendo Zendo
53. The Art of Zen Zendo Zendo Practice
54. The Art of Zen Zendo Zendo Training
...and then it got very weird:
63. The Art of Zen Zendo Zendo Zendo Zazen
64. The Art of Zen Zendo Zendo Zendo Zendo
65. The Art of Zen Zendo Zendo Zendo Zendo Practice
65. The Art of Zen Zendo Zendo Zendo Zendo Training
...and ended with:
98. The Art of Zen Zendo Zendo Zendo Zendo Zendo Zendo Zendo Zendo Zendo Yoga
99. The Art of Zen Zendo Zendo Zendo Zendo Zendo Zendo Zendo Zendo Zendo Zazen
100. The Art of Zen Zendo Zendo Zendo Zendo Zendo Zendo Zendo Zendo Zendo Zendo
I can't wait to read all these books.
posted by AlSweigart at 12:04 PM on November 30 [6 favorites]
34. The Art of Software Security Assessment
35. The Art of Software Testing
36. The Art of UNIX Programming
37. The Art of War for Computer Security
38. The Art of X-Ray Reading
39. The Art of Zen Meditation
40. The Art of Zen Practice
41. The Art of Zen Teaching
...and then it started getting weird:
45. The Art of Zen Zazen
46. The Art of Zen Zendo
47. The Art of Zen Zendo Practice
48. The Art of Zen Zendo Training
49. The Art of Zen Zendo Writing
50. The Art of Zen Zendo Yoga
51. The Art of Zen Zendo Zazen
52. The Art of Zen Zendo Zendo
53. The Art of Zen Zendo Zendo Practice
54. The Art of Zen Zendo Zendo Training
...and then it got very weird:
63. The Art of Zen Zendo Zendo Zendo Zazen
64. The Art of Zen Zendo Zendo Zendo Zendo
65. The Art of Zen Zendo Zendo Zendo Zendo Practice
65. The Art of Zen Zendo Zendo Zendo Zendo Training
...and ended with:
98. The Art of Zen Zendo Zendo Zendo Zendo Zendo Zendo Zendo Zendo Zendo Yoga
99. The Art of Zen Zendo Zendo Zendo Zendo Zendo Zendo Zendo Zendo Zendo Zazen
100. The Art of Zen Zendo Zendo Zendo Zendo Zendo Zendo Zendo Zendo Zendo Zendo
I can't wait to read all these books.
posted by AlSweigart at 12:04 PM on November 30 [6 favorites]
I guess at some point Echo & the Bunnymen's lyrics got into the training data.
posted by polytope subirb enby-of-piano-dice at 12:25 PM on November 30 [2 favorites]
posted by polytope subirb enby-of-piano-dice at 12:25 PM on November 30 [2 favorites]
I'm not going to argue with either of mmoncur or freethefeet ... Dunno, maybe I was just being optimistic about The Internet(tm).
posted by Relay at 7:23 PM on November 30 [1 favorite]
posted by Relay at 7:23 PM on November 30 [1 favorite]
As long as an ordered series of words have the right probabilistic relationship to each other, we will find meaning in it! We'll believe something that thinks is behind it and is expressing ideas.
Those two things are very different, though. A series of ordered words can carry semantic content; that’s independent (more or less) from how the series of words was generated. Believing that whatever is behind “thinks” and “is expressing ideas” is a choice we make, even if unconsciously.
posted by nickmark at 9:55 AM on December 1 [2 favorites]
Those two things are very different, though. A series of ordered words can carry semantic content; that’s independent (more or less) from how the series of words was generated. Believing that whatever is behind “thinks” and “is expressing ideas” is a choice we make, even if unconsciously.
posted by nickmark at 9:55 AM on December 1 [2 favorites]
The truth is out there –but so is every conceivable falsehood. And all of it is embedded in an inconceivably vast amount of gibberish... which sounds exactly like the Internet of today.
... why we have Facebook and Twitter and Reddit and Amazon and a few others. Because the "internet as a whole" has been 90% garbage for years and those sites offer an island of relatively uncorrupted sanity. Or at least they used to.
Absolutely, but before Facebook etc, we had blogs and the like that were, largely, a collection of information curated by an actual human, along with hand-built infrastructure like webrings that served as connections to similar places. We could rely on that information because we could see who that human was and form a view as to how trustworthy they were. This is where sites like MetaFilter started the movement away from individually curated material towards information curated by a group of people, but there was a sense of community and agreed standards that awarded such sites credibility. Then the advertisers stepped in and started shaping content and its presentation in a way that has ultimately killed any trust in information on the Internet by anyone with their critical thinking skills at least somewhat intact. Anyone who trusts information on the Internet at first glance today is fooling themselves, because so much of it is generated for the sole purpose of driving people towards advertising. Add into this what is becoming a swampland of recursively generated nonsense and the Internet is rapidly becoming worse than useless in large part, but huge numbers of people still hold onto the view that, if it's on the Internet it must be true. Even worse, the belief that 'Google is your friend' and wouldn't lie to you is harmfully pervasive.
I don't know what the future of the Internet is, but I'm afraid the idea that only those who can afford to pay for it will be able to access anything resembling accurate information on pretty much anything seems like the most likely outcome.
posted by dg at 4:40 PM on December 1 [3 favorites]
This afternoon I was looking up the correct pressure for my tires. I punched in the model info and got a top generated response of 51psi. I double checked the sidewalls to find they’re rated 51kPa…NOT to be inflated over 40 psi. If I had trusted the Ai I could've died on the highway home.
The main reason I support Metafilter is because AskMefi has given me such valuable advice for a decade and a half (this and a previous profile)
posted by brachiopod at 6:19 PM on December 1 [3 favorites]
The main reason I support Metafilter is because AskMefi has given me such valuable advice for a decade and a half (this and a previous profile)
posted by brachiopod at 6:19 PM on December 1 [3 favorites]
« Older Art But Make It Sports | A machine for inducing nostalgia for a brief... Newer »
It's not available right now, but there was a demo that let you generate images in real time, literally refining the result as fast as you could type the prompt. Consider the basic parameters of the model:
- each image was 512x512, room for plenty of detail
- the maximum prompt length was 77 tokens, or a largish paragraph
- each image had a "seed" value between 0 and 9,999,999, with each discrete seed giving a completely different take on the prompt
I can't begin to calculate the upper limit on the number of possible human-readable prompts that can fit in 77 tokens -- and these models can turn even gibberish into something understandable -- but multiply even an (extremely conservative) estimate of a million possible prompts by 10 million seeds and it's clear that this model "contains", at minimum, literally tens of trillions of possible meaningful images -- all in a model file that's under 7 GB.
It feels like having nigh-infinite possibility crammed, TARDIS-like, into an impossibly small package. An art gallery the size of a small town museum with a collection orders of magnitude larger than all previous images combined. And yet despite that vast size, not only is every image understandable (rather than the random noise of Borges' Library), but you can instantly locate any image you want (or at least something close to it) just by describing it. It's a borderline miraculous concentration of visual and conceptual information in a ridiculously intuitive way, and yet ironically threatens to deluge the world of real images with an infinite ocean of unreal images that could make finding an authentic picture as difficult as finding an intelligible sentence in the Library. It's wild stuff and by far the biggest future shock I've ever felt.
posted by Rhaomi at 4:49 PM on November 29 [19 favorites]