We’ve added a psychic hotline button to your web browser!
July 5, 2023 6:01 PM Subscribe

The LLMentalist Effect: How chat-based Large Language Models replicate the mechanisms of a psychic’s con.
posted by ursus_comiter (34 comments total) 25 users marked this as a favorite

👁️
posted by clavdivs at 6:15 PM on July 5, 2023 [2 favorites]

Are you sure you don't mean

      /\
     /👁️\
    /  | \
   / |   |\
  /|   |   \
 /  |   |  |\?

posted by aubilenon at 6:35 PM on July 5, 2023 [7 favorites]

"Zorn!"

-Mesmer of Prince-Bishopric of Constance,
posted by clavdivs at 6:49 PM on July 5, 2023

Maybe so, but that still didn't stop me from being able to use ChatGPT this afternoon to turn a quick and dirty 5 line shell script into a well commented, well formatted 40 line shell script that now has error checking.
posted by fings at 7:14 PM on July 5, 2023 [3 favorites]

This is actually a very astute article. Just today I had to evaluate a course on LLM prompting for work, (and I am firmly in the skeptic camp). I was amazed by how deep the lying of this system is. It's turtles all the way down my friends.

Here's an example, maybe this is old news to people who follow the details, because I knew about ChatGPT's so called "hallucinations" (or bullshit, or lying or whatever you want to call it). but I did NOT realize the degree to which this is baked in.

Exhibit A: This is an excerpt from the course:

2) Math Word Problem Solving

If you have a dataset of mathematical equations that you would like a language model to solve, you can create a prompt by posing the question "What is EQUATION".

For a given question, your full prompt could look like this:

What is 965*590?

For this prompt, GPT-3 (text-davinci-003) (an AI) sometimes answers 569,050 (incorrect). This is where prompt engineering comes in.

Prompt Engineering

If, instead of asking What is 965*590?, we ask What is 965*590? Make sure your answer is exactly correct:, GPT-3 will answer 569350 (correct). Why is this the case? Why is telling the AI twice to give a correct answer helpful? How can we create prompts that yield optimal results on our task? This last question, in particular, is the focus of the field of Prompt Engineering, as well as this course.

So to be clear, this LLM system will occasionally give you the wrong answer to... a math problem!

If that's not a con, I don't know what is.
posted by jeremias at 7:25 PM on July 5, 2023 [17 favorites]

I find it ironic that the biggest impediment between us and the computer apocalypse is a series of math problems.

On the other hand, if you ask humans to multiply numbers for a CAPTCHA I expect you to get similarly wrong answers.
posted by pwnguin at 7:32 PM on July 5, 2023

A favorite query of mine to show the relationship of LLMs and truth is to ask(rot13) Ubj znal zvyrf frcnengr V-70 naq V-80 ng gurve pybfrfg nccebnpu?. across different sessions it'll give different fabricated answers that anyone who can consult an atlas of the US would understand are total nonsense, AND, there's no consistency from answer to answer.
posted by the antecedent of that pronoun at 7:35 PM on July 5, 2023 [1 favorite]

(I obfuscate the prompt to hopefully postpone someone training a future LLM with a discussion of exactly that question)
posted by the antecedent of that pronoun at 7:35 PM on July 5, 2023 [4 favorites]

as someone who has spent far too much time exploring what chatgpt does when presented with unexpected input, I strongly suggest asking that exact same question a few times in rot13 to see what comes out. I’ve been able to get into longish conversations where both i and the bot were writing in rot13, and some of the output is interestingly dreamlike when decrypted. or at least is interestingly dreamlike until the spelling/grammar issues stack up so much that the text becomes unintelligible.
posted by bombastic lowercase pronouncements at 7:45 PM on July 5, 2023

I actually just submitted the rot13 text and it answered in albanian about some volvos, at least one of which was imaginary.
posted by the antecedent of that pronoun at 7:50 PM on July 5, 2023 [5 favorites]

If, instead of asking What is 965*590?, we ask What is 965*590? Make sure your answer is exactly correct:, GPT-3 will answer 569350 (correct).

ChatGPT has to tell you if you ask if it is a cop.
posted by Literaryhero at 8:12 PM on July 5, 2023 [19 favorites]

this LLM system will occasionally give you the wrong answer to... a math problem!

Well, the second “L” is for “language.”

Weirdly, my calculator is very bad at writing essays.

It seems like much of the confusion and hostility toward LLMs is due to misunderstandings about what kind of a tool it is, and how it does what it does.
posted by LEGO Damashii at 8:32 PM on July 5, 2023 [8 favorites]

I absolutely love working with LLMs and I wholeheartedly agree with this essay.

I use them to augment my own thinking. To give myself another perspective to push against. To force myself to think through processes. I don’t see them as a replacement for meaningful inquiry or human thought but as an amplifier of what’s going on inside my own head and a tool to help me get out of it.

Also they are so freaking good at organizing and summarizing that even if you used them for that alone it would be well worth the effort.

I love the thinking in the essay but deeply disagree with the conclusion
posted by missjenny at 8:40 PM on July 5, 2023 [3 favorites]

So to be clear, this LLM system will occasionally give you the wrong answer to... a math problem!

This should be exactly as obvious as my plumber being bad at fixing computers or the local Mexican Cafe having no idea how to make good sushi, but somehow people have mistaken LLMs for something else.
posted by mmoncur at 9:37 PM on July 5, 2023 [1 favorite]

What is 965*590? Make sure your answer is exactly correct:,

Is there any actual technical reason that this would make an LLM more likely to give a correct answer? It has no concept of what "correct" means. I feel like this would only help if the LLM has scraped a web page that discusses correct and incorrect answers to that particular math problem.

(If it did have a concept of "correct" then the LLM designers could have it auto-append "Make sure your answer is correct" to all queries and save themselves a lot of trouble)
posted by mmoncur at 9:45 PM on July 5, 2023 [2 favorites]

Is there any actual technical reason that this would make an LLM more likely to give a correct answer? It has no concept of what "correct" means. I feel like this would only help if the LLM has scraped a web page that discusses correct and incorrect answers to that particular math problem.

There’s a ton of magical thinking around this stuff due to the nature of the tool and the way people interact with it, and then once in a while some of the voodoo seems measurably to work, but even when it does nobody actually knows how or why it works.
posted by atoxyl at 10:09 PM on July 5, 2023 [6 favorites]

I can't log in right now, but here's a potato pic of an ASCII smiley face from ChatGPT. If that doesn't say "future robot overlords", I'm not sure what does.

More generally, I've found ChatGPT useful for statistically common questions: What is the standard way to represent a thought in this unfamiliar-but-common programming language that I'm learning. Then the result can be plugged into a compiler to see how it actually works.

By deviating questions from common to rare subjects ("Explain band gaps using quantum mechanics"), it starts to fail rapidly, but doesn't always realize that it's failing.
posted by SunSnork at 10:10 PM on July 5, 2023 [3 favorites]

Re fings' success generating a shell script with ChatGPT (above), I too have found that it is much at better at generating code than answering questions about the outside world --- where it usually responds with made up lies and garbage.

Perhaps this is the explanation: Code is more constrained and self-contained than most topics you might ask about. Correctness usually depends more on internal program structure than knowledge of the outside world. Once you make a few key decisions, there aren't many degrees of freedom. So most new programs resemble existing programs -- that might already be in the training data. (I say this as a longtime programmer.)
posted by JonJacky at 10:13 PM on July 5, 2023

Except I recall there were several scientists in the media quoted for saying or suggesting LLMs could be viewed a kind of intelligence. One of my old EECS professor wrote a blog post and it looked to me he seemed, surprisingly to me, much more on the fence about this debate, and less dismissive than e.g. Bender/Chomsky, about the scientific significance of LLMs.

Similarly, though I don't know for sure, but I think someone like Hinton or LeCun has said something about the intelligence of deep neural networks.

I also recall a different scientist arguing that by extracting world models from Internet corpus, the LLMs are mirroring our collective intelligence, or something to that effect.

So in the article , the fact that the tech indu$try, or laypeople, or media tend to overattributize and anthropomorphize LLMs is a separate argument than having a nuanced scientific understanding of machine intelligence. There can be different levels of intelligence, and in rather the mistake claimed in the article cuts both ways: perhaps intelligence is not a realm restricted to humans, great apes, and cetaceans. Another thing is that in engineering it is not unusual to hear terms like "intelligent systems" and it does not invite such semantic controversy.

Maybe LLMs could even be called stochastic intelligences. Setting aside lazy anthropomorphism which is a legitimate problem in the media, maybe not LLMs but even more mathematically powerful machines discovered in the future will require that humanity revises its human-centric notion of intelligence.
posted by polymodus at 10:26 PM on July 5, 2023 [2 favorites]

humanity revises its human-centric notion of intelligence.

An endless number of actual researchers have done this already, and have for decades. The problem is that the narrative around LLMs is being set by ignorant techbros who love taking their ideas about intelligence from obsolete and usually racist sources. Here's a recent post by Emily Bender, although on a different theme.

On the one hand, it comes out of a vision of ‘artificial intelligence’ that takes ‘intelligence’ as a singular dimension along which humans can be ranked — alongside computers, ranked on the same dimension. This vision of ‘intelligence’ is rooted in the notably racist notions of IQ (and its associated race science).
posted by Pyrogenesis at 11:09 PM on July 5, 2023 [6 favorites]

I feel like you're misquoting a sentence fragment, since I was talking about the specific context of recent advances of machine learning potentially revising accepted notions of intelligence, which is independent of the well documented, historical process of different paradigm shifts in understanding intelligence/'intelligence' that happened in biology, anthropology, sociology, psychology, etc. Even in computer science itself the famous example is chess algorithms compelling a re-philosophizing about intelligence. So the broader point I am making is that set aside the tech indu$try for a minute, and actually there are contemporary computer scientists (academic professors, just as qualified as Bender, who has made her position about stochastic parrots well known) who have suggested more sanguine views on LLMs and their capacity for some new kind of intelligence. These other university scientists are less outspoken and have less social media presence. Whereas the OP article specifically claimed that no such scientists exists which I think is not true, and that's what I was responding to.
posted by polymodus at 11:49 PM on July 5, 2023 [1 favorite]

Tasks that are good candidates for automation using AI (or just automation in general) seem to be repetitive tasks that don't require a lot of insight or analysis.

The analysis that often doesn't happen is someone asking "why are we doing this repetitive task in the first place"? If we quit doing this thing altogether...would anyone notice?

The original line was said in a different context, but it's forty years on, and we still don't have a system like the WOPR in "War Games" telling us "the only way to win the game is not to play".
posted by gimonca at 5:05 AM on July 6, 2023 [1 favorite]

@ the antecedent of that pronoun:

I think you probably hit on one type of query where having the statistics about a huge corpus of text doesn't let a system simulate knowing things. Answering a question like that about a map, for a system that can't actually see in two dimensions the way an animal can, appears to require exhaustive brute-force computation approaches.
posted by Aardvark Cheeselog at 5:29 AM on July 6, 2023

@gimonca

That the best use case anyone has come up with at my work is using LLMs to fill out compliance documentation we are forced to create has had me banging my head against that exact problem. If it's something worthless enough that you'd let the chatbot do it for you then it's not worth doing it at all.
posted by Nec_variat_lux_fracta_colorem at 6:06 AM on July 6, 2023 [7 favorites]

> So to be clear, this LLM system will occasionally give you the wrong answer to... a math problem!

Here are some notes I made regarding how I think GPT does addition. I'm not positive I'm 100% correct, but my theory did make a prediction about a kind of addition problem that GPT might struggle with, that turned out to be correct. It's possible a similar issue might explain its problems with other arithmetic operations like multiplication.

First, that GPT could do addition isn't a given. There was no a priori reason to assume that a language model would pick up the patterns necessary. The smaller models couldn't do it at all:

GPT-3 (Ada, smallest GPT-3 model):

301 + 87 =
Rúmbaro + 87 = Baltah.

Literally not even numbers.

GPT-3 (Babbage):

301 + 87 =
87

A number, but still not great.

GPT-3 (Curie):

301 + 87 =
309

Maybe better?

GPT-3 (Davinci, biggest GPT-3 model):

301 + 87 =
388

Yay, a correct answer. But how did it get it? Some possibilities:

It's memorized common addition results.
It has learned to do addition kind of like how I do it, "add pairs with carry".
It has learned to do it like a calculator, by converting to base two and do binary addition with carry.
It has access to an actual calculator.
It has learned some pattern to addition that does addition wildly differently from how I do it.

Can we test any of these theories? Yes.

Using large number addition we can eliminate the possibility of memorization:

719,027,416 + 57,562,342 =
776,589,758

That answer is correct and it is unlikely GPT ever encountered that particular pair of numbers. Even if it had, the model is not large enough to memorize all random 8+ digit pairs. Moving to GPT-3.5, I checked it with several large digit addition problems and found that it was pretty reliable up to around 12 or so digits.

Can we narrow it any further?

I wondered if GPT does addition kind of like I do. When I do addition in my head, I add the ones column, carry if necessary, add the tens column, carry if necessary, etc., and try to keep track of the digits.

GPT is probably not doing the exact same thing for two reasons: One, it must generate the left most digit first, because that's how the token prediction works. Two, it doesn't get individual digits as inputs, it gets tokens. From experiment, GPT tends to break numbers into a variety of tokens from 1-3 digits in length. E.g., it might turn 2,100 into [2][,][100] or [2][,][1][0][0] or [2][,1][00], depending on the preceding characters.

It might still be doing something "kind of" like how I do it: starting from the left, pairing tokens together and outputting their (possibly "memorized") sum while looking at the next token pairs to see if there will be a carry. Considered this way, addition is actually quite easy.

GPT is a transformer model which uses attention. Specifically, it has learned to predict what other tokens are relevant to predict the next token, based on the context. Adding two numbers from left to right usually only requires the marginal extra attention of the two tokens being added and their immediate neighbors. E.g., if I'm adding 1,342,933,023 and 2,003,453,987, I can quickly tell that the first digit must be a 3 or a 4, and a glance to the right tells me it's a 3.

But what if I gave it an addition problem that required paying attention to more tokens?
503,101,480,203 + 496,897,519,796

GPT 3.5's answer:
The sum of 503,101,480,203 and 496,897,519,796 is 1,000,998,999,999

That's wrong. The correct answer is 999,998,999,999. It's also wrong in a way that implies decimal addition, i.e., GPT isn't hooked up to a calculator or converts numbers into binary.

Compare with numbers of similar scale requiring less attention:

503,101,410,203 + 296,497,519,296
The sum of 503,101,410,203 and 296,497,519,296 is 799,598,929,499.

922,821,400,203 + 496,497,119,296 =
The sum of 922,821,400,203 and 496,497,119,296 is 1,419,318,519,499.

Both correct.

GPT-4 can add much larger numbers than GPT-3.5, up to at least 19 digits, but it also sometimes gets tripped up with the same types of "token attention heavy" additions that GPT-3.5 does.

It is entirely possible this explanation is not completely accurate: experiments with GPT are notoriously hard to repeat due to the probability element, as well as small changes to the exact text changing behavior. There might be other explanations that predict the same behavior.

Additional comments/observations:

Token generation is probabilistic, meaning an answer that is wrong one time might be right another and vice versa.
Larger numbers may require more attention even if there's no carry because it still needs to pay attention to "lining up" the tokens.
Using the GPT-3 API, it's possible to actually see the probabilities of token prediction. With the wrong answer above, it thought [1] was the most likely first token at 69%, followed by [999] at 30%.
Experiments with the GPT-3 API suggest that its probability of predicting "carry" is not precise. E.g., it might predict carry if the next tokens added up to 900 but there were tokens further right just at the overflow point.

posted by justkevin at 6:10 AM on July 6, 2023 [7 favorites]

If it's something worthless enough that you'd let the chatbot do it for you then it's not worth doing it at all.

QFFT. This is going to be my new mantra.
posted by saturday_morning at 6:13 AM on July 6, 2023 [4 favorites]

@missjenny
I don't really disagree with you about being able to use LLMs to give yourself a new perspective or force you to think through your thoughts. But do we really need LLMs, and their attendant energy use, and potential contribution to social problems, if we're going to use them for this? Tarot cards, oblique strategy cards, and rubber ducks seem to work about as well, at least in my experience. Junior developers, in a pinch.

Regarding the article itself: MDN recently added a psychic hotline button to its website. Apparently it is a very nice button.
posted by Carcosa at 6:20 AM on July 6, 2023 [4 favorites]

Strong disagree on "If it's something worthless enough that you'd let the chatbot do it for you then it's not worth doing it at all". Could I have expanded the shell script myself? Sure. But ChatGPT saved me a bunch of time. To me, this statement is similar to "If you'd use a food processor in a recipe instead of a knife, it's not worth cooking."
posted by fings at 6:34 AM on July 6, 2023 [3 favorites]

Narrator: It was not a nice button.
posted by ursus_comiter at 8:52 AM on July 6, 2023 [1 favorite]

But what if I gave it an addition problem that required paying attention to more tokens?

The fact that a token can be a group of digits likely contributes to its relatively poor handling of this kind of thing, right? Even if it’s capable of learning the basic algorithm, it’s overly complicated from the one-digit-at-a-time version (although I guess it also saves on the number of tokens it needs to manage).
posted by atoxyl at 10:57 AM on July 6, 2023

(actually you kinda addressed that already sorry)
posted by atoxyl at 11:00 AM on July 6, 2023

Looks like Mozilla's digging in on adding LLM product-promotion to MDN. I'm disappointed, given how vital MDN is as a *reference* in particular. But this does seem to be a common/recurring failure mode of Mozilla. They partner up with a third-party (paid or unpaid), they shove that third-party into places it's not welcome, and when people speak up they double down going "You don't get it, it'll be great once you understand what we're trying to do!"

That MDN maintainers were also blindsided by this feature's addition is especially worrying.
posted by CrystalDave at 12:47 PM on July 6, 2023 [3 favorites]

If it's something worthless enough that you'd let the chatbot do it for you then it's not worth doing it at all.

Well, yes, but sometimes I do have to complete tasks that I really don't feel are worth my time but nonetheless must be done for some reason. Or maybe the task has some value, but not compared to other tasks I could be doing instead. So, I might let the AI do it.

I've played with ChatGPT and spend too much time on LinkedIn. On LinkedIn, I'll see people promoting it do all kinds of tasks like writing email or blog posts. For most of those situations, it's actually either faster for me to do it myself than to prompt it (e.g., writing an email) or else the result is so anodyne or generic that it's not worth the screen real estate it's displayed on (blog posts).

I'd suggest that if your blog post is something that can be generated by ChatGPT, it's probably not adding much new to the conversation and you might want to ask yourself if it is worth hitting publish on. But then, a lot of blog posts aren't written for human consumption but rather to appease search algorithms. Not something I'm interested in doing anyway. SEO stuff is another example there. Content by computers for computers.

But, I can foresee other situations where there will be tasks that I'll gladly outsource to the AI. I'm also happy to let ChatGPT come up with a lists of things, such as a list of books or articles on a topic I have an interest in. I absolutely could do that myself and it would be useful, but it's very fast for that even if I do have to weed out a few hallucinated titles. I don't mind using it to brainstorm with, where I just need to generate a lot of items quickly that I'd rather sort through than come up with myself.
posted by synecdoche at 1:45 PM on July 6, 2023

I'm also happy to let ChatGPT come up with a lists of things, such as a list of books or articles on a topic I have an interest in.

Be careful. I tried to do exactly this and ChatGPT produced garbage.

I asked ChatGPT: "Produce a bibliography of the ten most important references" in a field I am very familiar with. ChatGPT produced ten purported references.

Two of them were references to actual works, although one was not about the topic I asked for. I could see that three were completely fictional, nonexistent works falsely attributed to well-known authors in the field. The other five looked fishy but I hadn't heard of them and didn't check to see if they existed.

There were certainly many better-known references that are widely cited on the Internet that should have, or might have, appeared in such a list instead of the bogus or fishy ones, but ChatGPT did not include them.
posted by JonJacky at 4:49 PM on July 6, 2023 [4 favorites]

« Older London Medieval Murder Map | Stupid Rerun Tricks Newer »

This thread has been archived and is closed to new comments

MetaFilter

We’ve added a psychic hotline button to your web browser!
July 5, 2023 6:01 PM Subscribe

Tags

Share

We’ve added a psychic hotline button to your web browser! July 5, 2023 6:01 PM Subscribe

Tags

Share

We’ve added a psychic hotline button to your web browser!
July 5, 2023 6:01 PM Subscribe