3...2...1.... Fight!
May 12, 2024 8:49 AM   Subscribe

Chatbot vs Chatbot The Chatbot Arena will randomly load two chatbots in answer to your prompt. You mark which one gives the better answer. The Arena uses these human responses to rank the top LLM chatbots on an ongoing basis. Over 1,000,000 prompts have been submitted and scored.

Regardless of whether you do the ranking or not, the Arena is a good way to get multiple answers to your single prompt.
posted by storybored (36 comments total) 8 users marked this as a favorite
 
I asked which Muppet would be the tastiest if prepared by a master chef.

One scolded me because Muppets are delightful creatures not meant to be eaten. The other was reluctant, but noted:

"Gonzo, with his eclectic and adventurous spirit, might be the most flavorful option. His spicy personality could translate to a bold, zesty flavor profile, perhaps with a hint of Latin flair. A master chef might prepare him as a vibrant, colorful empanada, filled with a zesty mixture of herbs and spices."

WINNER.
posted by delfin at 9:05 AM on May 12 [23 favorites]


who last declared war on Germany in WW2?

No nation formally declared war on Germany in World War II after Germany's surrender on May 7, 1945.
whatever this is, it's not it.

This LLM thing has really brought truth-value and the problem of bias to the forefront to me.

I was raised in the pre-WWW 70s & 80s without TV so had to do a lot of reading, which aside from science fiction and spy novels was a collection of National Geographics ~1928 - 1972, the 1965 World Book Encylopedias, plus The Book of Lists, and Peoples Almanacs I through 3, the latter being important to counterbalance the rather rightwing biases of Field Enterprises, publisher of the encyclopedias (I didn't understand this at the time).

I think, theoretically, generative AI can actually become a super-memory genius able to accurately answer any question you can throw at it . . . but it all depends on the 'facts' you give it, which is why Elon wants to do AI again.
posted by torokunai at 9:06 AM on May 12 [2 favorites]


Tried it twice. Gave it nonsense. They both tried to respond to the nonsense. But clearly this code can’t deal with nonsense in the appropriate way by just returning the same. Instead it pedantically tries to respond with serious text, believing it’s dealing with a serious person.
posted by njohnson23 at 9:07 AM on May 12 [2 favorites]


Don’t encourage them…
posted by chavenet at 9:13 AM on May 12 [16 favorites]


I'm enjoying eliciting hogwash results from LLMs (torokunai's efforts led me to ask similar questions about other wars), but I also have this nagging feeling that I'm helping to tune the hogwash generators. Every interaction I have with people who confidently tell me "ChatGPT says..." goes through me saying "but, uh..." and the enthusiast saying "I find it's pretty good for [type of question]", and me going "but have you validated enough of those responses to actually know?"

And... the fact that they're getting more believable doesn't mean they're getting more correct, and I suspect we're setting up some real social timebombs as some of the ways that these things are being asked about, for instance, legal advice, aren't caught immediately.
posted by straw at 9:33 AM on May 12 [10 favorites]


I suspect we're setting up some real social timebombs as some of the ways that these things are being asked about, for instance, legal advice, aren't caught immediately. everyfuckingthing
posted by lalochezia at 9:38 AM on May 12 [9 favorites]


"...I suspect we're setting up some real social timebombs as some of the ways that these things are being asked about, for instance, legal advice, aren't caught immediately."
posted by straw at 9:33 AM on May 12


If you're interested the legal application of LLMs, I'd suggest looking at Harvey.
posted by sardonyx at 9:44 AM on May 12


How much do I get paid for helping train these chatbots? I want a percentage of the gross.
posted by Reverend John at 10:07 AM on May 12 [13 favorites]


Can they correctly describe how much electricity they spend per prompt?
posted by panhopticon at 10:13 AM on May 12 [11 favorites]


Every interaction I have with people who confidently tell me "ChatGPT says..."

Suddenly struck with the thought: How long until graduation speeches replace the phrase "Webster's dictionary defines X as ..." with "ChatGPT defines X as ..."?
posted by pwnguin at 10:22 AM on May 12 [3 favorites]


Can they correctly describe how much electricity they spend per prompt?

Sure, you can use AI to solve global warming, but the irony is that so much power will be needed to drive the computing power to finally get to that answer it will have warmed the planet to a point of no return. And you just know that the answer will still be 42.
posted by NoMich at 10:30 AM on May 12 [4 favorites]


The next step is to have competing chatbots go on to validate each other; the best verification or debunking, with the best, most relevant cites, wins.
posted by Artful Codger at 10:56 AM on May 12 [1 favorite]


I asked it a question about a long-time TV character who has lengthy biographies on two sites easily found via search engine. One chatbot answer was rather vague and the other was rather wrong. The latter said the character was married to their brother. Nope, not on broadcast TV.

The other answers I received were reasonably accurate and detailed, and surprisingly judgmental that I would even bother it with such trivial questions.
posted by fuse theorem at 11:05 AM on May 12


I was think about this - like how long until my ai email writing program interfaces with your ai email reading program and then humans can go offline and start touching grass.
posted by St. Peepsburg at 11:06 AM on May 12 [2 favorites]


Now, this I hate.
posted by The Manwich Horror at 11:15 AM on May 12 [3 favorites]


The next step is to have competing chatbots go on to validate each other; the best verification or debunking, with the best, most relevant cites, wins.
The final step will be called the AI Centipede.
posted by Hardcore Poser at 12:07 PM on May 12


Metafilter: can’t deal with nonsense in the appropriate way by just returning the same. Instead it pedantically tries to respond with serious text, believing it’s dealing with a serious person.
posted by AdamCSnider at 12:09 PM on May 12 [9 favorites]


I have decided on the whole to deal with AI by going full old-person and refusing to do anythig with it, because it is a snare and a delusion and didn't exist back in my day. Even though all the old people I know love AI. Which is, when you think about it, because they are also the people who pay ransom for deepfake kidnapped grandchildren and text messages that tell them they owe money on their taxes. Even though I am the very model of a modern early adopter.
posted by Peach at 1:59 PM on May 12 [3 favorites]


And... the fact that they're getting more believable doesn't mean they're getting more correct, and I suspect we're setting up some real social timebombs as some of the ways that these things are being asked about, for instance, legal advice, aren't caught immediately.
posted by straw at 9:33 AM on May 12


I saw a post a couple days ago about AI-generated employee handbooks, so it's definitely already happening. At least when I put the GPL through a markov chain to generate gibberish legalese I knew it'd never be confused for a real software license.
posted by silentbicycle at 2:11 PM on May 12 [2 favorites]


who last declared war on Germany in WW2?
No nation formally declared war on Germany in World War II after Germany's surrender on May 7, 1945.
whatever this is, it's not it.

I was curious about this, since of course there's a whole battery of mystery models on the site. So I asked
“What country last declared war on Germany in WW2?”

Model A said the Soviet Union, on June 22, 1941 (plus a paragraph of timeline text).
Model B said Bulgaria, on May 11, 1945, but that this was after Germany’s surrender and so largely symbolic.

Then I realized I hadn't followed the specific text of the previous prompt, and these weird things are sensitive to phrasing (though I was obviously dealing with a different model than torokunai). So, I followed up my first question with
“who last declared war on Germany in WW2?”

Model A now said Italy, on April 13, 1945 (plus a paragraph of timeline text including a note about the USSR being “the last major power to join the war on Germany.”
Model B apologized for being wrong before and then said “last country to officially declare war on Germany during World War II was Finland”, provided the complicated history of Finland's actions, and concluded “while Finland took significant steps to distance itself from and effectively end its alliance with Germany, it did not officially declare war on Germany in the manner traditionally understood.”

I have no idea what the actual correct answer to the question is, but these general Q&A systems seem to be pretty obviously still be jokes.

The site itself seems like a fun toy! Getting multiple answers for a single prompt seems useful, but what I really want is a true and accurate answer.
posted by Going To Maine at 2:14 PM on May 12 [1 favorite]


But clearly this code can’t deal with nonsense in the appropriate way by just returning the same. Instead it pedantically tries to respond with serious text, believing it’s dealing with a serious person.

The thing that gets me is the po-faced moralizing.

It's all well and proper that it won't help me figure out what wavelength of light is the best to put up my bum to cure covid.

But when I ask it how much antimatter I'd need to put underneath a kitchen pot to fling a Kia Soul 8km, a lecture about how very serious antimatter production is and it should be left to "qualified people" to do research with and not to launch Kias is not really welcome or, at present time, appropriate.
posted by GCU Sweet and Full of Grace at 3:00 PM on May 12 [8 favorites]


If going to the beach and dumping car batteries into the ocean were a website.
posted by fifteen schnitzengruben is my limit at 3:36 PM on May 12 [4 favorites]


I asked a question that I had just been asked as a job application technical screening question, in part because I was told (when sitting out an interview time) I was the quickest to answer and obviously hasn't used ChatGPT.

Wow, the answers were bad. They were the text equivalent of Stable Diffusion output that looks like a transporter accident. It astounds me that one part of a company might be seeing this horrific junk output from AI making every day of the working lives increasingly worse, while another part of the same company will be trying to work out how they can replace staff with AI.
posted by krisjohn at 3:37 PM on May 12 [1 favorite]


"A transcendental grounding of the midget unintelligibility of the absurdity of our statement about this idea of phenomenological being would be falsified; in the study of immanent time, innate ideas exclude the possibility of, for anyone but the featherhead, the latent ambiguity of our deduction recently demonstrated. We must not let ourselves be frightened by considerations of the objective presence of one wraithlike interpretation of the neverending regress in the series of empirical conditions and the plagirisms of polluted reason. When comes the radical validity of the position regarding lately denounced logic, the problem of which involves the chumpish relation between our intentional concept of reason and the playfield of reason?"

The above text was generated (heavily random) by me with my generative grammar text generator and I passed it on to this chatbot battle. Chatbot A said it was a "complex tangle of philosophical language." Then it produced an outline where it tried to explain various phrases as to what they mean. Then it went on to imply that it was a satire or "convoluted critique" of philosophical jargon. Chatbot B said "I see what's going on here." And said that I generated the text (!), as some sort of BS detector. B then went on to describe various quoted passages as being unintelligible, meaningless, vague, nonsense, etc. I liked that it found "the plagirisms of polluted reason" to be a nonsensical phrase. Not to me!

It finished with -

"In conclusion, this paragraph is a masterclass (!) in generating philosophical-sounding nonsense. While it might be entertaining to create such language, it's important to remember that actual philosophical inquiry requires clear, coherent, and meaningful language."

Pedantic to say the least! These things seem to begin seeing the input as something serious to be dealt with in a serious way. It's like talking to someone who takes everything in a literal way, but has a huge reservoir of stuff that it uses to try to fit it into something meaningful. It would devolve surrealist texts into Hallmark cards if it could...
posted by njohnson23 at 5:43 PM on May 12


I got two pretty decent answers to my question ("Define a sport, and list 5 examples of things that are close to sports but not sports").

My second question was "Is Metafilter a dying website"? They both said that it's not necessarily dying, but (A) said "its user base and popularity have declined somewhat over the years." and (B) said that its " not necessarily "dying," but it has undergone some changes and challenges over the years." (B). they then gave some bullet point reasons.

A picked: Demographic changes / Changes in online behavior / Competition from other online communities / Lack of new features and innovation.

B picked: Community Size and Activity / Changes in Moderation Policies / Competition from Other Websites / [Challenges with] Monetization / Platform Evolution / Engagement.

If they dropped the last 'in conclusion' paragraph both would be reasonable - if a bit verbose - metatalk comments.
posted by true at 6:20 PM on May 12


Unless they followed “In conclusion” with “Metafilter is a land of contrasts.” Then it would be undetectable.
posted by nickmark at 7:07 PM on May 12 [6 favorites]


Both misunderstood “primary reason” to require 5 and 6 bullet points in what appears to be random order. All the dates where relatively accurate.
posted by zenon at 9:25 PM on May 12


I asked about the neighbours of the game designer John Wallis in London in the early 1800s. Both LLMs assumed I was talking about the mathematician John Wallis, who died in 1703, and were quite terse in correcting my question. The possibility of there being more than one John Wallis did not seem to have occurred to either.

(John Wallis the game designer lived at 42 Skinner Street, next door to William Godwin and his daughter Mary, who ran off with Percy Shelley and invented science fiction.)
posted by Hogshead at 3:04 AM on May 13 [1 favorite]


in conclusion the butlerian jihad is a land of contrasts
posted by lalochezia at 4:21 AM on May 13 [7 favorites]


The next step is to have competing chatbots go on to validate each other.

There is no epistemology, only probability.
posted by CheeseDigestsAll at 5:57 AM on May 13


Facts is facts. If these AIs cannot make deliberate use of repositories of known-valid information when answering questions that involve or request factual information.... then they are not ready for prime-time. It's analogous to a human reaching for a reference book. Probability ain't good enough.
posted by Artful Codger at 6:08 AM on May 13


LETS FUCK SHIT UP! At first I'm like - no don't train them, people, come on! Now I'm like -- oh yeah, let's troll these bastards hard!
posted by symbioid at 11:12 AM on May 13 [1 favorite]


I'm with symbioid. What would be the best strategy to undermine this project? Intentionally wrong evaluations, or just random evaluations? Or some other strategy?
posted by Reverend John at 9:58 AM on May 14


Yes, I'm sure upvoting Nazi chatbot output won't have any unintentional side effects...
posted by pwnguin at 10:38 AM on May 14 [1 favorite]


Well, ok, fair, that's the kind of thoughtful mefite advice I was hoping for. So, while downvoting Nazi or otherwise obviously offensive content, what's the best strategy to undermine this project?
posted by Reverend John at 4:52 PM on May 14


Carbon tax?
posted by pwnguin at 7:33 PM on May 14 [2 favorites]


« Older Happy Mother's Day from Mr T (slyt)   |   Cascading Style Newer »


You are not currently logged in. Log in or create a new account to post comments.