Separating hyperplanes with Shoggoth Shalmaneser
September 26, 2023 11:59 PM   Subscribe

A jargon-free explanation of how AI large language models work - "Want to really understand large language models? Here's a gentle primer." [link-heavy FPP!]
Words are too complex to represent in only two dimensions, so language models use vector spaces with hundreds or even thousands of dimensions. The human mind can’t envision a space with that many dimensions, but computers are perfectly capable of reasoning about them and producing useful results.
Generative AI exists because of the transformer [ungated] - "In order to grasp a word's meaning, work in our example, LLMs first observe it in context using enormous sets of training data, taking note of nearby words. These datasets are based on collating text published on the internet, with new LLMs trained using billions of words."
Eventually, we end up with a huge set of the words found alongside work in the training data, as well as those that weren’t found near it.

As the model processes this set of words, it produces a vector — or list of values — and adjusts it based on each word’s proximity to work in the training data. This vector is known as a word embedding.

A word embedding can have hundreds of values, each representing a different aspect of a word’s meaning. Just as you might describe a house by its characteristics — type, location, bedrooms, bathrooms, storeys — the values in an embedding quantify a word’s linguistic features.

The way these characteristics are derived means we don’t know exactly what each value represents, but words we expect to be used in comparable ways often have similar-looking embeddings.

A pair of words like sea and ocean, for example, may not be used in identical contexts (‘all at ocean’ isn't a direct substitute for ‘all at sea’), but their meanings are close to each other, and embeddings allow us to quantify that closeness.
LLM now provides tools for working with embeddings - "Embeddings are a fascinating concept within the larger world of language models."[2]
I explained embeddings in my recent talk, Making Large Language Models work for you. The relevant section of the slides and transcript is here, or you can jump to that section on YouTube.

An embedding model lets you take a string of text—a word, sentence, paragraph or even a whole document—and turn that into an array of floating point numbers called an embedding vector.
A Hackers' Guide to Language Models - "Jeremy Howard's new 1.5 hour YouTube introduction to language models looks like a really useful place to catch up if you're an experienced Python programmer looking to start experimenting with LLMs. He covers what they are and how they work, then shows how to build against the OpenAI API, build a Code Interpreter clone using OpenAI functions, run models from Hugging Face on your own machine (with NVIDIA cards or on a Mac) and finishes with a demo of fine-tuning a Llama 2 model to perform text-to-SQL using an open dataset."[3]

Transformers Revolutionized AI. What Will Replace Them? - "Transformer co-inventor Ashish Vaswani summed it up well: 'The transformer is a way to capture interaction very quickly all at once between different parts of any input. It's a general method that captures interactions between pieces in a sentence, or the notes in music, or pixels in an image, or parts of a protein. It can be purposed for any task.'"
But there is a more specific reason for transformers’ computational cost: the transformer architecture scales quadratically with sequence length. Put simply, this means that as the length of a sequence processed by a transformer (say, the number of words in a passage or the size of an image) increases by a given amount, the compute required increases by that amount squared, quickly growing enormous.

There is an intuitive reason for this quadratic scaling, and it is inherent to the transformer’s design.

Recall that attention makes it possible to understand relationships between words regardless of how far apart they are in a sequence. How does it do this? By comparing every single word in a sequence to every other word in that sequence. The consequence of this pairwise comparison is that as sequence length increases, the number of required computational steps grows quadratically rather than linearly. To give a concrete example, doubling sequence length from 32 tokens to 64 tokens does not merely double the computational cost for a transformer but rather quadruples it.

This quadratic scaling leads to a related drawback: transformers have a hard time handling very long sequences.

As sequences grow in length, feeding them into transformers eventually becomes intractable because memory and compute needs explode quadratically. Consider, for example, processing entire textbooks (with millions of tokens) or entire genomes (with billions of tokens).

Increasing the maximum sequence length that a model can be fed at one time, known as the model’s “context window,” is an active area of research for large language models today. The context window for the base GPT-4 model is 8,000 tokens. A few months ago, OpenAI released a souped-up version of GPT-4 with a 32,000-token context window. OpenAI competitor Anthropic then upped the ante, recently announcing a new model with a 100,000-token context window.

This arms race will no doubt continue. Yet there are limits to how big OpenAI, Anthropic or any other company can make its models’ context windows if they stick with the transformer architecture...

One of the most intriguing new architectures in the S4 [sub-quadratic] family is Hyena, published a few months ago by a powerhouse team that includes [Chris] Ré and Yoshua Bengio.

In place of attention, Hyena uses two other operations: long convolutions and element-wise multiplication.

Convolutions are one of the oldest existing methods in machine learning, first conceived of by Yann LeCun back in the 1980s. Hyena’s fresh take on this venerable architecture is to stretch and vary the size of the convolution filter based on the sequence length in order to boost computational efficiency.
Tai-Danae Bradley: Modeling Language with Tensor Networks - "This talk features the latter. I'll share an elementary passage from classical probability to quantum probability and use it to describe a tensor network model for language."[5,6,7,8]
  • Machine Learning, Statistical Inference and Induction - "'Machine learning' is the AI label: how do we make a machine that can find and learn the regularities in a data set? ... The connection to neuroscience and cognitive science is plain: how on Earth do human beings, and other critters, actually learn?"
  • Learning Theory (Formal, Computational or Statistical) - "We have a bunch of inputs and outputs, and an unknown relationship between the two... A learning algorithm takes in a set of inputs and outputs, its data, and produces a hypothesis... The key notion is that of a probably approximately correct learning algorithm --- one where, if we supply enough data, we can get a hypothesis with an arbitrarily small error, with a probability arbitrarily close to one."
  • 'Attention', 'Transformers', in Neural Network 'Large Language Models' - "The fact that attention is just a kind of kernel smoothing takes nothing away from the incredibly impressive engineering accomplishment of making the blessed thing work. A large, able and confident group of people pushed kernel-based methods for years in machine learning, and nobody achieved anything like the feats which modern large language models have demonstrated. The reason I put effort into understanding these machines and papers is precisely because the results are impressive! To see that a key step is, after all, something we'd been doing for decades is humbling. (What else are we missing about tools we think we understand?)"
also btw... Rethinking the Luddites in the Age of A.I. [ungated] - "Brian Merchant's new book, 'Blood in the Machine,' argues that Luddism stood not against technology per se but for the rights of workers in the face of automation."[12]

Unliving systems that run on human beings - "Markets and states can have enormous collective benefits, but they surely seem inimical to individuals who lose their jobs to economic change or get entangled in the suckered coils of bureaucratic decisions. As Hayek proclaims, and as Scott deplores, these vast machineries are simply incapable of caring if they crush the powerless or devour the virtuous. Nor is their crushing weight distributed evenly. It is in this sense that LLMs are shoggoths. Like markets and bureaucracies, they represent something vast and incomprehensible that would break our minds if we beheld its full immensity."[13,14,15]
posted by kliuless (28 comments total) 99 users marked this as a favorite
just to connect some more dots in the database...
One of Asher’s innovations — or more precisely one of his companies’ innovations — was what is now known as the LexID. My LexID, I learned, is 000874529875. This unique string of digits is a kind of shadow Social Security number, one of many such “persistent identifiers,” as they are called, that have been issued not by the government but by data companies like Acxiom, Oracle, Thomson Reuters, TransUnion — or, in this case, LexisNexis.

My LexID was created sometime in the early 2000s in Asher’s computer room in South Florida, as many still are, and without my consent it began quietly stalking me. One early data point on me would have been my name; another, my parents’ address in Oregon. From my birth certificate or my driver’s license or my teenage fishing license — and from the fact that the three confirmed one another — it could get my sex and my date of birth. At the time, it would have been able to collect the address of the college I attended, Swarthmore, which was small and expensive, and it would have found my first full-time employer, the National Geographic Society, quickly amassing more than enough data to let someone — back then, a human someone — infer quite a bit more about me and my future prospects.

When I opened my first credit card, it got information from that; when I rented an apartment in New York City, it got information from that; when I bought a cheap car and drove across the country, it got information from that; when I got a speeding ticket, it got that; and when I secured a mortgage and bought my first house in Seattle, it got that.

Two decades after its creation, my LexID and its equivalents in the marketing world have connected tens of thousands of data points to me. They reveal that I stay up late and that I like to bicycle and that my grandparents are all dead and that I’ve underperformed my earning potential and that I’m not very active on social media and that I now have a wife and kids, who, if they don’t already have LexIDs, soon will.

Persistent identifiers let algorithms map in milliseconds a network of people I’ve met, lived near or interacted with online or off, and they show the trajectory of my life — up, down and sideways. They help health systems assess my living conditions, impacting what kind of care I get from my doctor. They affect how much I pay for car insurance. They help determine what kind of credit cards I have. They influence what ads I see and how long I wait on hold when I call a customer-service line. They allow computers inside police departments, intelligence agencies, hospitals, banks, insurance companies, political parties and marketing firms to understand personal behavior and, increasingly, as artificial intelligence and machine learning expand into every corner of society, to predict and exploit it.
oh and re: Blood in the Machine, "Into the Meat" :P
posted by kliuless at 12:34 AM on September 27 [16 favorites]

Nice post! I haven't looked at all the links but the main one agrees with my understanding of how GPT works.

I'll add Andrej Karpathy's video on how to build a nano-scale version of GPT using Pytorch. It's pretty technical and I found some parts hard to grasp, but with a basic understanding of Python was able to get it running and generating fake Shakespeare. It is part of a series that covers neural networks and LLMs more broadly.
posted by justkevin at 12:39 AM on September 27 [2 favorites]

yeah, iirc, using letters instead of tokens?*
posted by kliuless at 12:56 AM on September 27

Yes, as I recall in one of the earlier videos of the series you create a visualizer for the vector space and predictably "vowel/consonant" was a learned dimension.

Even though it lacks the "feature depth" of word-sized tokens, it was able to learn the patterns of Shakespeare surprisingly well:

Initial, random model:
!.TIQoywDXZjskHHBw? yXVGjNvvA?i&o'LNAoCYwNZ'UiRPYvUtzg,ZGUMnFilij;bB rntjcF&VLT:fz.ImYciXnYtmb m:,YjjjEFSZoAx.hDkotw;;n',rotPelTyZ:: MJoJJyR,oeyJLOYLWp;JomucrJu'onnkfovtnZo sn-b!eTdO

After 1000 iterations:
Sous sheacw? youlj vever.
Thll tablentiente, you bele cous hemocind that the sill noto.

Tote hous beist yous arestay, werep Leppaceteriuconnk oven shong ied ard senvevi

After 20000 iterations:
The heart by, that not greans; by your my browees to the swing that succiear shoween in the indegelant.

TORKENBUSWARD, This thooke moniur, 'stay?

By think offfer lighones prevai

posted by justkevin at 1:28 AM on September 27 [2 favorites]

Stunning post, kliuless, thank you.
posted by signal at 4:19 AM on September 27 [10 favorites]

By think offfer lighones prevai

I can't say how much I like that this model hasn't yet figured out that "lighones" isn't a word, but has figured out that "CLAUDIO" is one of the more likely words to go before a line of dialogue.
posted by Jeanne at 5:58 AM on September 27 [4 favorites]

maximum anti-eponysterical, dude.
posted by skippyhacker at 6:09 AM on September 27 [1 favorite]

Great post, much reading queued up, thank you indeed kliuless...

Some interesting perspective on the scaling problem(s) in section 13 of this outline / marketing / propagandising of the thoughts of some of the actors currently at the coalface. The rest of the article may also be of side interest to readers of this post, though it's not as directly relevant to the detail kliuless has covered.

(I wanted to build a more technical AI / LLM post around that link and some other stuff, but both my max context-window and the quality of my underlying model have been well and truly smashed!)
posted by protorp at 6:55 AM on September 27 [1 favorite]

Too bad that most of things these technologies & techniques will be used for will prove to have awful purposes.
posted by Insert Clever Name Here at 8:26 AM on September 27

That first article was an excellent explainer. I've been thinking of all "AI" as what I (very limitedly) understand as a GAN, where one program generates a response to a query, then another program evaluates that response and the process is repeated until it meets a certain threshold. Explaining things as matrices of vectors which are continually tagged and refined makes it slightly more comprehensible.
posted by slogger at 8:29 AM on September 27

I had no idea that nobody knew how LLMs worked. I thought it was just really hard to understand. Jesus!

Having read "At the Mountains of Madness," I think LLMs may be one of the clearest examples of creating the Torment Nexus that I have ever since, if not the example.
posted by Countess Elena at 9:16 AM on September 27 [3 favorites]

Many, many, many moons ago I was working for a noted believer in alien abduction on a website that was supposed to be the nexus of all information. My task was to scrape and auto-categorize news articles. A the time, the state of the art in that department was an algorithm called Latent Semantic Indexing. Dude didn't want to pay the licensing fee for a fully vetted implementation and asked me to implement it from a whitepaper in 3 weeks. I quickly contacted my PhD comp sci / EE friend who made chips for Intel and asked if he thought that was a reasonable or possible request. He laughed heartily, and then said "no." In the mean time I began to think about how I would devise such a system, and settled upon an n-dimensional matrix where words and phrases would be plotted and then assigned rankings based on their distance from one another - something calculated using a multi-dimensional equivalent of the Pythagorean theorem. I bought a book about matrix algebra, quickly realized that I didn't have the math background to even read the symbols in the book, and decided to convince dude to just pay the licensing fee. He did.

It is fascinating to me to see that the multidimensional matrix concept undergirds LLMs.
posted by grumpybear69 at 12:31 PM on September 27 [3 favorites]

Tim's explanation for Ars is a great start, though "jargon-free" maybe slightly inaccurate. Thanks for this post, I am considered the AI expert at my job but while I have a general grasp of things, I often struggle to explain the technical stuff properly (since I don't actually know or understand it) - so it's great to have some rabbit holes to send people down if they really get interested in some aspect or another.
posted by BlackLeotardFront at 12:54 PM on September 27

Here's Baldur Bjarnason with a good explanation of how chat-based Large Language Models replicate the mechanisms of a psychic’s con.
posted by tovarisch at 3:09 PM on September 27 [1 favorite]

tovarisch, if credulity towards LLMs is based on audience self-selection and "wanting to believe" then how come it works on people who don't realize they are reading something generated by an AI? For example a teacher marking a student's ChatGPT-generated homework. In the absence of those self-delusion mechanisms that Bjarnson describes the illusion should be broken, but yet they often fall for it.
posted by L.P. Hatecraft at 3:37 PM on September 27

Great post, kliuless, thank you.

Re: chat-based LLM-backed models, and applications of LLMs more generally...

It's helpful to always bear in mind that most of the end-user facing systems that we see, like ChatGPT or Bard, are themselves the result of applying a very large amount of human-feedback reinforcement learning (RLHF) to the "raw" inputs of the model. In the case of these persuasive chatbots, they have received millions of rounds of feedback to guide them towards persuasive and polite-but-confident outputs. The underlying models are capable of many other tasks besides that particular style of language-based turn-taking (which is one reason why Manning, Lei, Liang, et al at Stanford are preferring to call them "Foundation Models").
posted by graphweaver at 4:07 PM on September 27 [2 favorites]

I would love a similar explainer for image generators like Stable Diffusion.
posted by aspo at 5:45 PM on September 27

It's been a while since I watched them, but the computerphile videos on diffusion image generation I recall being pretty comprehensible.
posted by justkevin at 7:05 PM on September 27 [1 favorite]

For example a teacher marking a student's ChatGPT-generated homework. In the absence of those self-delusion mechanisms that Bjarnson describes the illusion should be broken, but yet they often fall for it.

I don't think that the ONLY way credulity toward LLMs manifests is via audience self-selection etc., but I do think that's one path toward garnering buy-in. In the case of students using it to help them write essays and the like, it's just that ChatGPT is sometimes a quite talented bullshit artist that, unlike a lazy student, has read not just the Cliff's Notes, but the actual play/book/whatever and also a shit-ton of responses to it. Teachers sometimes get duped by the type of student who can spin a glance at the notes into a convincing spiel of BS, why should they suddenly be immune to the same trick performed by an a better bullshitter which did actually read the assignment?
posted by axiom at 9:19 PM on September 27 [2 favorites]

It has always been the case that computers did not understand what they were doing: it was axiomatic that they just did symbol manipulation. With some of these vast LLMs it becomes possible for the first time to wonder plausibly whether something that deserves to be called understanding is happening.

The problem is, we’re asking whether the machine is doing a thing the human mind does. But we don’t know in sufficiently clear terms what the thing is, we don’t know how the human mind does it, and we don’t know how LLMs might be doing it either, nor whether that is the same way, nor what criteria would determine whether it is the same way.

I can say with some certainty though, that my brain is definitely not generating understanding in this case.
posted by Phanx at 3:29 AM on September 28

"Write me an essay on Shakespeare" = "go and trawl all the human-generated content you stole and put together a string of alphanumeric characters that mathematically looks like an answer while at no point displaying any comprehension of anything due to the fact that you're not actually intelligent despite the hype."
posted by GallonOfAlan at 4:28 AM on September 28 [3 favorites]

I too use an n-dimensional matrix to process words and language.

It just so happens than n=1 (and occasionally 2)
posted by blue_beetle at 5:52 AM on September 28 [1 favorite]

There is ongoing research that is maybe shedding some light on how humans do this, and the parallels with LLM internal operations are suggestive.

(In both cases it's a very, very long way from making a best guess at what looks like an answer based on prior exposure to content - as the headline link in this post explains pretty clearly.)

This research on cortical columns is being led by people (notably Jeff Hawkins of PalmPilot fame) who have moved to neuroscience from tech... so as ever, caveat sceptor... and has been ongoing for the past couple of decades with influence going in both ways between the fields.

Hawkin's book A Thousand Brains - review here by Bill Gates - is a very readable inroad to the neuroscience / ai subject, with plenty of general background explanation as well as his pet-theory itself.

A brief but more technical dive into the theory and its potential implications is here on Towards Data Science.
posted by protorp at 8:23 AM on September 28

Melanie Mitchell's Artificial Intelligence: A Guide for Thinking Humans (2019) is extremely good.
posted by neuron at 8:43 AM on September 28

Got a name for my next metal band.
posted by ocschwar at 4:30 PM on September 28

Now I just need to feed each of these links into the LLM content-summarization pipeline I'm building so it tell me what it all means!

Seriously though this post is a gold-mine and I will be sharing it extensively.
posted by daHIFI at 8:05 PM on September 28

Baldur Bjarnason has an eBook "The Intelligence Illusion: a practical guide to the business risks of Generative AI" which is linked from the article linked above by tovarisch. I'm interested in reading more in-depth critiques of LLMs so decided to buy it and read it over the last day or so. Unfortunately I'm a little disappointed, as he gets some very basic facts wrong. For example in his introductory glossary he defines a neural network as:
Neural Network: A data structure based on early to mid-twen‐
tieth century ideas on how the brain might work. You feed
training data into an algorithm that builds a mathematical
model of a network of “neurons”, parameter by parameter,
where each parameter is supposed to be functionally equivalent
to a neuron in an animal brain. In a biological neural network
each neuron is one of the most complex living cells in the body,
filled with multiple interconnected complex chemical systems,
and communicating with its neighbours using a variety of neur‐
otransmitters. Conversely, the “neuron” of an artificial neural
network is a single number.
I'm not an expert in AI, but this isn't correct. The parameters in a neural network/"model" are the weights of the connections between the "neurons" not the neurons themselves. He doubles down on this later on when he compares the number of neurons in a bumble bee brain with the parameters in GPT4. A bee brain only has a million neurons, but a billion synapses, so his comparisons are all off by 3 degrees of magnitude. Now to be fair, bumble bee brains are still really impressive and they still do a lot with much less connections than GPT4 while using a microscopic fraction of the energy. But that's just sloppy.

I get the impression he doesn't really care about those kinds of details anyway because his argument boils down to "computers can't think because they're just machines". But if the functionalist theory of mind is correct, a thinking computer isn't out of the question. As to whether an LLM (now or in the future) could be an example of "multiple realization", that depends on the nitty gritty functional details, so it would be awesome if the conversation around them graduated from "it's just a fancy autocomplete" to include more of them, which makes this a great post.
posted by L.P. Hatecraft at 2:48 AM on September 29

> Too bad that most of things these technologies & techniques will be used for will prove to have awful purposes.

like databases? (or computers, engines, atomic energy, fire, flight...)
posted by kliuless at 7:42 AM on September 29 [1 favorite]

« Older "They bent, then broke, and gave us what we...   |   A New Age of Copper Newer »

This thread has been archived and is closed to new comments