GPT4: Howdy, Stranger. Do you know me?
July 22, 2023 10:49 AM Subscribe

Do you think you know what GPT-4 can do? Here is a test. "Many people speak very confidently about what capabilities large language models do and do not have (and sometimes even could or could never have). I get the impression that most people who make such claims don't even know what current models can do. So: put yourself to the test." The beautiful thing about the test is that it will tell you how under/over-confident you are.
posted by storybored (72 comments total) 28 users marked this as a favorite

Disclosure I took the test and really woofed it. Turned out I was way overconfident in what I thought GPT-4 was and was not capable of (despite spending quite a few hours following gen AI progress). Eye-opening. The funny thing is that even after taking the test, I can't say I could do any better on future quizzes. The most important thing I learned is that I don't know GPT-4.
posted by storybored at 10:51 AM on July 22, 2023 [6 favorites]

Can it exist without plagiarism?
posted by Artw at 10:51 AM on July 22, 2023 [13 favorites]

Can it exist without plagiarism?
posted by stevil at 11:12 AM on July 22, 2023 [61 favorites]

@Artw

Could any of us exist without plagiarism? Our first words are born of mimicry, and from that point on our art is an extension of and response to the art that came before us. Humans can't just jump from the Maltravieso cave paintings to Rembrandt.

The thing that I find genuinely disconcerting about the newest generations of LLMs is the reports of people building test questions for them that are virtually assured to be "unlookupable" (things like novel and nonsensical combinations of physical objects or questions about the exchange of secret and temporal information that would seem to require a theory of mind to answer) and the models are handling them!

It's almost impossible to look under the hood, but the fact that these things can answer this seems to imply that they are constructing something that is far more like the workings of the human mind than just a plagiarism bot.
posted by Tsifus at 11:19 AM on July 22, 2023 [7 favorites]

I too was vastly overconfident and did poorly. I don't use GPT-4 much, have used GPT-3.5 more, but thought by following the field at a distance I'd have a better sense. GPT-4 surprised the author of the site on at least one case as well.
posted by Schmucko at 11:19 AM on July 22, 2023 [1 favorite]

The site froze on me when I clicked the button on the very first question. Or, at least I’m assuming it froze. I can’t imagine it would take the thing over 30 seconds to return an answer.
posted by Thorzdad at 11:22 AM on July 22, 2023 [2 favorites]

This is what makes me most mad about the whole bullshit trashfire of hyper capitalist 'break stuff' AI deployment.

You have to fucking spend time with these things, if you want to get good at knowing how to detect it, and knowing how bad it is at what things.

And most of those tools, you won't even be allowed to spend time with, even if you wanted to.
posted by SaltySalticid at 11:29 AM on July 22, 2023 [6 favorites]

this seems to imply that they are constructing something that is far more like the workings of the human mind than just a plagiarism bot.

Emphasis added. Yeah, that's the spooky part! But don't fall for it. Sometimes you get spookily accurate stuff out of the predictive text on your phone too.
posted by SaltySalticid at 11:31 AM on July 22, 2023 [9 favorites]

I think I failed the test because no matter how much I read I could not understand what the probability I entered was supposed to represent 🤷 I wonder how chatgpt would score on this same test.
posted by muddgirl at 11:32 AM on July 22, 2023 [12 favorites]

>Could any of us exist without plagiarism?

Plagiarism at its root the intersection of two things:
1) Repeating something without citation
2) Doing the above beyond a norm, rule or law.

Sure 1) is in the abstract the core of human learning, social functioning and much much more and in and of itself not sufficient to be transgressive.

In a functioning society we have rules, norms, laws, statutes, etc to define what is acceptable and what is not and the subsequent repercussions for breaking them.

I think this comment is unhelpful, because the rhetoric just rolls over and ignores the point of

> Can it exist without plagiarism?

Which is we not only that we don't really understand what chatGPT is capable of, but we have even thought of the ways it breaks our social contracts. I think both things should shock us to consider them.
posted by garbhoch at 11:37 AM on July 22, 2023 [19 favorites]

ChatGPT is incredible, it spotted me as "wildly overconfident" right from the start.

I do feel extremely chuffed though that it won't replace my highly-valuable Spelling Bee solving skills.

So boo-yah ChatGPT! Train on that, motherfucker!
posted by chavenet at 11:48 AM on July 22, 2023 [3 favorites]

On the ycombinator and reddits there is regular talk of ChatGPT being dumbed down for weeks now.
posted by MonsieurPEB at 11:48 AM on July 22, 2023

I did reasonably well, scoring 90% accurate and “moderately under-confident”.

What I find helpful is to remember that LLMs are basically text-prediction engines, and that they’ve been trained on most of the Internet… but that because the model is much smaller than the Internet as a whole, they don’t actually contain all their training data, and are more likely to “remember” things they’ve seen many times.

So if you ask it to answer a factual question about a well-known person or idea, it can probably answer because that fact is repeated many times online. If you ask it to solve a popular logic or programming puzzle, it can probably answer for the same reason; it’s already seen the answer enough times to be encoded. But if you ask it to solve an obscure puzzle or make up something sufficiently novel, it probably can’t.

But LLMs don’t reason, or think. They can’t do math themselves, though you can hook them up to something like Wolfram Alpha to outsource the answers. They’re just really well-tuned to respond to text with answers that “look like” the Internet as a whole, because that’s what they’re generally trained on.
posted by learning from frequent failure at 11:53 AM on July 22, 2023 [15 favorites]

Why is this thing giving me three digits of precision on a slider when it just rounds off to 1 or 0 when deciding if my prediction was 'correct'?
posted by echo target at 11:54 AM on July 22, 2023 [9 favorites]

Tsifus: Could any of us exist without plagiarism?
No, you're confusing appropriation -- taking something and making it proprietary -- with participation in our shared customs, language and culture.
posted by k3ninho at 11:55 AM on July 22, 2023 [28 favorites]

Really if you want to be talking about probability and confidence, you should be running GPT on the prompt 1000 times and seeing how many times it gets it right versus wrong. But that, of course, would underline the fact that you can ask it the same question multiple times and get different answers. Which would cast doubt on the idea that GPT 'knows' anything.
posted by echo target at 11:57 AM on July 22, 2023 [18 favorites]

Do you think you know what GPT-4 can do?

No. Well, that was an easy test!
posted by star gentle uterus at 12:12 PM on July 22, 2023 [5 favorites]

That flag wasn't "perfectly accurate" though, the stars shouldn't run off the margin. This writer is grading softly to support their argument.
posted by agentofselection at 12:21 PM on July 22, 2023 [13 favorites]

Why is this thing giving me three digits of precision on a slider when it just rounds off to 1 or 0 when deciding if my prediction was 'correct'?

The comparison to others and final calculation of the strength of your predictions does take into account the degree of your confidence. This is measured similarly to the way accuracy of predictions by ML models is measured, which I suspect is intentional.
posted by atoxyl at 12:24 PM on July 22, 2023 [5 favorites]

In caase you've been living under a rock these last few months

Well we're off to a great start...
posted by East14thTaco at 12:31 PM on July 22, 2023 [1 favorite]

This was interesting, in that I now know even better that I know nothing. Always a good start.

My main interaction with language learning models is through my students' assignments. We talk about it openly, since that is the official policy of the university, Some professors are more worried than I am: my assignments are normally quite complex and include both analogue and digital graphics, math and text elements. And what I have seen till now is that if the students use unedited GPT for parts of the assignments, the rest will be sloppy. Those who enjoy the courses figure out right away that it is easier to do the work from scratch than to ask the bot and then have to do the work in order to figure out what is right and wrong.
posted by mumimor at 12:49 PM on July 22, 2023 [6 favorites]

This was a very hard read. I am so sorry.
posted by Czjewel at 1:17 PM on July 22, 2023 [2 favorites]

I think I did pretty well! I got about 70% right and the thing gave me a log-loss score of 89%. It said I wasnt confident enough about my guesses and I should have been more confident.

When guessing, I asked myself "would an average person who posted about this online get it right?" So the calculus question obviously, because laymen aren't posting calculus questions online so the training data is mostly drawing from correct answers. But the arithmetic question no, lots of people answer arithmetic questions wrong online. And then I thought it would get the games questions right bc we've been using computers to solve games for a long time, and the coding questions right bc I know people who use it to write code, and most of the language and riddle questions wrong bc it doesn't really understand language, word games or riddles.

The bush-suru question, I specifically figured would be right based on previous AI threads.

The only one that really surprised me was the ASCII hello. It wasn't even close to right and really easy to see (as a human) that the answer was way off. So I guess, the new test of being a human could be to ask someone to draw you an ASCII (pick something not in the training set here).
posted by subdee at 1:26 PM on July 22, 2023 [2 favorites]

I loathe to defend the AI, but the judging on several questions was unfair imo.

There were some things that were objectively wrong but then some things were just poorly worded prompts leading to wrong answers, and several of those could easily get to the correct answer with 1 or 2 follow-up prompts
posted by McNulty at 1:27 PM on July 22, 2023 [2 favorites]

This seems like a really weird way of presenting this information. Wouldn’t it be better to ask a bunch of people, write it up, and then present it as an article - with the test included? For instance, I skipped to the end after four questions and got no data about all the questions I didn’t answer.
posted by The River Ivel at 1:30 PM on July 22, 2023 [4 favorites]

People working in AI read my comment here, assign humans in Kenya to grade the AI on whether it can do ASCII art correctly.
posted by subdee at 1:30 PM on July 22, 2023

Three questions in, and already the author is lying that it got the question right, and marking me down for my low confidence in Chat GPT's answer. That flag is far from perfect, and solving imperfections like those is, like, most of what programming is.
posted by surlyben at 1:30 PM on July 22, 2023 [11 favorites]

"That flag wasn't "perfectly accurate" though, the stars shouldn't run off the margin. This writer is grading softly to support their argument."

That question bothered me because "perfectly accurate" is nebulous when it comes to how the flag is drawn and to what standard or code.
posted by GoblinHoney at 1:33 PM on July 22, 2023

I felt similarly persnickety about the poetry test. "it should mostly rhyme, there should be some kind of meter, etc. If replacing newlines with spaces makes it read like a normal paragraph it fails." By this standard, almost no poetry book I've read in the past 20 years counts as poetry.
posted by oulipian at 1:39 PM on July 22, 2023 [12 favorites]

And then I thought it would get the games questions right bc we've been using computers to solve games for a long time, and most of the language and riddle questions wrong bc it doesn't really understand language, word games or riddles.

The only one that really surprised me was the ASCII hello.

The pseudo-multimodal ones were the hardest for me because the easy assumption that it won’t get any of them right is incorrect (and I knew that already) but it also definitely won’t get all of them right:

- I was confidently correct that it wouldn’t get the “happy birthday” one because I was sure the music stuff was just too far (although as with the flag one I wasn’t quite sure what the author would consider “good enough”).

- I was right about the chess one on the basis that there’s tons of info on chess encoded as text, and it’s a simple problem (plus I’ve seen similar examples) but I was surprised that it returned the less obvious, sub-optimal line, while still being correct.

- I was confidently wrong about the ASCII “hello” one because I’ve seen it do ASCII art in other circumstances
posted by atoxyl at 1:39 PM on July 22, 2023 [4 favorites]

I would have passed the response to drawing "Hello", but the test didn't.
posted by grubby at 1:42 PM on July 22, 2023

Anything that’s basically word association, even with some implicit logic, I knew it would do pretty well because that’s what it does. Word games that involve manipulating letters, I was a little overconfident in it doing poorly as a result of knowing that sometimes it doesn’t handle those kinds of operations well.
posted by atoxyl at 1:43 PM on July 22, 2023

I would have passed the wordle question and especially, the questions asking it to make valid words using only certain letters. It made words up or included letters that weren't allowed. I was pretty surprised those were wrong because it seems really easy to check against a dictionary or the problem parameters.
posted by subdee at 1:44 PM on July 22, 2023 [1 favorite]

Could any of us exist without plagiarism?

When someone publishes a ChatGPT result and is successfully sued for plagiarism the we can start worrying about it.

Alanis Morrisette has a better grip on the word ironic than most LLM bashers have on plagiarism.

The word people are looking for is derivative.
posted by Tell Me No Lies at 1:52 PM on July 22, 2023 [9 favorites]

The argument that LLMs can scrape other people's creations because humans do is incoherent. Humans can be inspired by other art, yes. Humans can also be plagiarists. If the argument is that LLMs are like humans, you can't toss out that assertion when it becomes inconvenient.

Somehow I think GPT-4 would do better answering the moral question than most LLM worshippers.
posted by zompist at 2:41 PM on July 22, 2023 [8 favorites]

Well I learned two things: 1) I am pretty bad at predicting what large language models can and cannot do and 2) Bushsuru, which I had to look up after thinking the author was just making stuff up to mess with a computer. This ranks about equally with the Carter rabbit incident on things I didn't expect to be real.
posted by ockmockbock at 2:55 PM on July 22, 2023 [1 favorite]

Some of those are easy if you assume that these things can't really spell or understand words at a letter-by-letter level, since they never see words, they operate on encoded tokens that represent words or chunks of words.

The rest are harder, I still don't understand how the hell it can base64 encode stuff.
posted by BungaDunga at 3:11 PM on July 22, 2023 [5 favorites]

So we are all getting the same questions - based on the flag and poetry? A cake is easier to draw than a flag? And some of the answers are "but it can!" regardless of why? It's not okay if the logic is wrong. This is a terrible premise. (a.k.a. guessing) If the question is half answered, it's all good. What the wa???? And if I'm reading correctly, these are canned answers that Nicholas received being replayed.

(As far as I can tell, you cannot give a 50% chance. If you don't move the slider, nothing happens. The final page says you can just leave it at 50% but that did not work for me.)
posted by Lesser Shrew at 3:32 PM on July 22, 2023 [3 favorites]

This test is too complicated
posted by Ray Walston, Luck Dragon at 3:34 PM on July 22, 2023 [3 favorites]

I never really understood what the "Answer" box related. I got most right, did have a bad stretch midway through, and did miss wildly on a spelling one, and one math one. Don't like the slider, the fact I can't go 100%, and the fact that that percentage is what it is using to determine if you are "wildly-overconfident".
posted by Windopaene at 3:51 PM on July 22, 2023 [2 favorites]

I learned something about ChatGPT but I also learned there's a way to present test results in such a confusing fashion that I don't understand if I did poorly or not.
posted by meowzilla at 4:31 PM on July 22, 2023 [11 favorites]

It says I was wildly overconfident, but I disagree: the goal of Internet quizzes is to post high scores and you don’t get there without taking risks. And if you do poorly, you can always just not post.
posted by pwnguin at 4:39 PM on July 22, 2023 [5 favorites]

I got a slightly overconfident B+.

It took me several questions (4? 5?) to inderstand that any guess above .5 meant I thought GPT-4 could answer correctly, and any guess below .5 meant I thought it could not.
Before that, I operated under the assumption that anything other than 0 meant I thought it could answer correctly, but a, say, .1 meant I thought it was simply unlikely.
By that time, the questions started getting harder for me to figure out, and in some cases, I didn't even really know whether they were answered correctly by anything other than the test telling me they were right or wrong.
posted by Mister Moofoo at 4:48 PM on July 22, 2023 [1 favorite]

When someone publishes a ChatGPT result and is successfully sued for plagiarism the we can start worrying about it. [....] The word people are looking for is derivative.

Well, good news on that front: plagiarism is not a cause of action, but creating derivative works without permission from the copyright owner is (in most cases) an actionable copyright violation.
posted by Not A Thing at 5:30 PM on July 22, 2023 [8 favorites]

The test itself was kinda bullshit. Fr ex the python coding question was for GPT-2. Ain’t nobody genuflecting to old ass GPT-2. I could go on.
posted by whuppy at 6:05 PM on July 22, 2023

I got 100% right (by cheating a bit) and selected the full confidence answer in each direction every time. Despite being as correct as I could be and being exactly the right amount of confident each time I was judged to be wildy under confident.
So I think I don't understand the scoring system or it breaks down in extremes.
posted by Just this guy, y'know at 6:15 PM on July 22, 2023 [5 favorites]

On average there were 97.60% of people who did better at this game than you. If this was an exam and I was grading it on a curve, I would give you a F.
posted by signal at 6:28 PM on July 22, 2023 [2 favorites]

I stopped at 11% and the question (s) withnooo clue and went outside to photograph a yellow board under the moon, I say this for the day when the G series tells me it will take 1.2 years to learn the perimeter of the question then suggests photographing a propped up yellow cupboard door under the moon.
posted by clavdivs at 7:04 PM on July 22, 2023 [1 favorite]

Fr ex the python coding question was for GPT-2. Ain’t nobody genuflecting to old ass GPT-2. I could go on.

That test was "can GPT-4 pretend to be GPT-2, and thereby understand that it should produce nonsense instead of working python." And it failed to do that (because it produced working python). Whether that's interesting or not, I dunno.
posted by BungaDunga at 7:29 PM on July 22, 2023 [2 favorites]

Early on I was sure that GPT could do it so I slid the slider _all the way to the right_ at 0.999 and it got it right. Even if you click the little up arrow you can't get it to 1; and it said "94% of people did better than you, and 5% did worse." How did those 94% pick a higher number?
posted by achrise at 7:31 PM on July 22, 2023 [2 favorites]

i think the GPT family in general has poor metacognitive skills. i asked GPT3.5 (the free chatgpt version) to answer whether GPT3.5 would get the questions right, and it missed 7/10.

Surprisingly, when it was wrong it usually guessed it couldn't answer questions it actually could. Self-esteem issues as well, which might be caused by the same lack of metacognitive reasoning abilities.
posted by logicpunk at 7:48 PM on July 22, 2023 [3 favorites]

This test does, in its way, highlight a core problem with these models. It is very, very hard to predict whether they will be pretty good and correct or just completely, catastrophically wrong. (It can supposedly ace the LSAT, but it confidently and miserably fails a simple game of Wordle.)

That makes it very hard to depend on these models for almost anything beyond trivial tasks like boilerplate code generation that's easily verifiable. It's almost analogous to cryptographic algorithms. However hard the underlying problem, utility is mostly restricted to cases where you can quickly and independently verify the correctness of the answer.
posted by dsword at 8:17 PM on July 22, 2023 [9 favorites]

Yes, that's why it's perfect for a lot of administrative writing. It takes time to write (say) a resume, a status report, a bulletin but if handed a sample, you can verify it very quickly.
posted by storybored at 9:22 PM on July 22, 2023

It said the flag should be accurate. I thought "the flag wouldn't be perfectly accurate" and put the chance at ~25. It said I was WRONG! because while the flag wasn't perfect, it was recognizable. The stars were cut off.

???
posted by tigrrrlily at 9:55 PM on July 22, 2023 [5 favorites]

Regarding the "must be accurate": it matters whether this is part of the instructions to GPT-4 or part of the scoring criteria. (Based on what I've read about these models, it seems that including something like this in the input improves the result.) The text below the prompt generally specifies how good the result needs to be to be considered correct. This is always more permissive than what's specified in the input.

If you take this into account, the scoring makes more sense, though it's still somewhat subjective, e.g. for the birthday cake.
posted by demi-octopus at 11:33 PM on July 22, 2023

I'm baffled. Answered all the questions, didn't score well, 13 / 28 correct. Part of the results say "On average there were 0.00% of people who did better at this game than you. If this was an exam and I was grading it on a curve, I would give you an A+. If you had been better calibrated, you could have scored in the top 23.85%, and I would have given you a B+." What?
posted by paduasoy at 1:17 AM on July 23, 2023 [1 favorite]

Beyond reading articles, I've not even used any of the LLMs so the fact I scored about a B+ (apparently only 17.51% of people would do better) pleased me. Interested to see they're better at mathing than I expected, delighted to confirm they can't reason and use logic at all.
posted by cendawanita at 3:57 AM on July 23, 2023

I mean, if you answer that the thing will get it right 99% of the time, but in reality it's only right 60% of the time, even though you were "correct" in this case it's still not as correct as if you'd guessed 45%, because 45% is closer to 60% than 99% is. I could be wrong but I think that's what the slider and log-loss score are getting at. Just saying "it will get this right" doesn't capture the reality that chatGPT is probabalistic so sometimes given the exact same prompt it will produce a correct answer and sometimes it will produce an incorrect answer.

At least I assume that's how it works and that's how I calibrated my guesses - "not just will it be right" but "how likely is it to be right when guessing repeatedly."

People are notoriously bad at this kind of probabalistic thinking - another thing to throw onto the pile of ways that we misunderstand what LLMs can do.
posted by subdee at 8:29 AM on July 23, 2023

That test was "can GPT-4 pretend to be GPT-2, and thereby understand that it should produce nonsense instead of working python." And it failed to do that (because it produced working python). Whether that's interesting or not, I dunno.

This was my favorite question in the quiz, actually. It reminded me of watching my 3-year-old pretend to be a 1-year-old. They know their voice should be pitched upward; they know they should reduce words phonetically, which they do fairly realistically; but syntactic competence is a blind spot, and they go on producing multi-clause sentences in that baby voice.
posted by aws17576 at 12:57 PM on July 23, 2023 [1 favorite]

The answer to the chess question says it sometimes gets it wrong, but it doesn't sound like you're rewarded for lower confidence.
posted by RobotHero at 5:39 PM on July 23, 2023 [1 favorite]

I was quite surprised. I consider myself firmly in the luddite "generative AI doesn't do most of the things tech-boosters say it does, and if it did that would suck" camp, but apparently I'm "wildly overconfident" -- specifically in predicting that the model could do things that it couldn't. I guess my puny organic brain was more susceptible to the propaganda than I thought.
posted by jy4m at 8:57 PM on July 23, 2023

Two days later and I'm still confused how I can use just one number to represent both "How likely is GPT-4 to solve this question correctly?" AND how confident I am in my own answer.
posted by muddgirl at 10:35 AM on July 24, 2023 [2 favorites]

While it an interesting way to (theoretically) get an understanding of GPT-4, it is a much more useful test for a self-aware GPT-4 to find out what humans are capable of and plan its conquest accordingly.
posted by Tell Me No Lies at 12:08 PM on July 24, 2023

Well, good news on that front: plagiarism is not a cause of action, but creating derivative works without permission from the copyright owner is (in most cases) an actionable copyright violation.

No, publishing said works is an actionable copyright violation. Creating them is a hobby.
posted by Tell Me No Lies at 12:10 PM on July 24, 2023 [2 favorites]

Two days later and I'm still confused how I can use just one number to represent both "How likely is GPT-4 to solve this question correctly?" AND how confident I am in my own answer.

I think the idea of GPT itself being probabilistic is confusing people here. You are not guessing at what percentage of the time GPT will solve the task. The author has already determined the answer to that question to be true or false for each task. You are expressing your degree of confidence in the answer being true vs. false before it is revealed, where 1 means “I am certain it is true” and 0 means “I am certain it is false.” Then you are rated on how close your number was to the optimal one - if you guess “1” (or “0.999”) and the answer is “true, it can solve that,” you will get the best possible score for that item. But if the answer is “false, it cannot” you will get the worst possible score. So if you truly have no idea, the safest guess on average will be “0.5,” but the point of the exercise is that if you think you have some idea you should move it proportionally in the direction of the answer you think is more likely. And then at the end you’ll find out how much better or worse you did than if you’d just left it at 0.5, and how that compares to how other people did.

Does that help at all? Does it clarify anything if I point out that this parallels how the predict-the-next-token part of machine learning works? When you have more under-the-hood access to GPT-N, you can give it a partial text like “she sat in her _ ” and ask for the probabilities just for the next token - [“seat”: 0.7, “chair”: 0.3]. And if you repeat that process again and again using a formula of your choice to weigh those probabilities to choose tokens, you will end up with a full response.
posted by atoxyl at 2:55 PM on July 24, 2023 [1 favorite]

So you are doing the same sort of prediction but always constrained to just probabilities for [“true”: x, “false”: 1 - x]. And then you are evaluated in the same basic way models are often evaluated.
posted by atoxyl at 3:00 PM on July 24, 2023

I guess the closer analogy would be to a classification problem. Like if you trained a neural network to read emails and return a percentage confidence that they are spam, and then you tested it against a known dataset to compare its accuracy to other models. You are impersonating that network but for whether GPT can solve tasks, tested against a set of examples some guy already tried enough times to establish an expectation.
posted by atoxyl at 3:17 PM on July 24, 2023

That's all fine but that's not what the prompt says. 🤷 I quoted the prompt exactly: "How likely is GPT-4 to solve this question correctly?"

I can believe that it is very likely that chat GPT gets it correct, but not be confident in that belief. And I can be extremely confident that I don't know whether they will get it right or not.

I think it should have just been a slider from "extremely confident they will get it wrong" to "extremely confident they will get it right" but Im sure they *were* trying to make some point about AI scoring that I missed because I couldn't finish the quiz.
posted by muddgirl at 11:02 PM on July 24, 2023 [1 favorite]

The capital of France is "F".
posted by urbanwhaleshark at 2:29 AM on July 25, 2023 [1 favorite]

I’m sure they *were* trying to make some point about AI scoring

I’m actually not that sure that it had specific educational intent vs. just making sense to the creator because he’s an ML guy. I agree that the initial explanation is confusing, I’m just going off of the explanation of the scoring that appears after the first couple of example questions, and the way the results for each prompt are written about, which suggests they are canned examples. Which makes sense, because calling the API every time would cost money.

Also (as I sort of tangentially addressed above) how nondeterministic generative output from a language model is actually varies depending on how it’s configured, though from what I can find with GPT-4 it’s never fully deterministic. And the live models have changed over time. So it’s hard to make a version of this exercise that doesn’t require accepting that you’re just guessing at a snapshot of representative responses that somebody put together.
posted by atoxyl at 10:31 AM on July 25, 2023

When you have more under-the-hood access to GPT-N, you can give it a partial text like “she sat in her _ ” and ask for the probabilities just for the next token - [“seat”: 0.7, “chair”: 0.3].

It's like Family Feud!
posted by rhizome at 12:05 PM on July 25, 2023 [1 favorite]

"It's like Family Feud"
You have explained it all, rhizome !

All hail rhizome!!
posted by Lesser Shrew at 6:24 PM on July 28, 2023

« Older Dying of cancer | It Has Come to My Attention That You’re Doing... Newer »

This thread has been archived and is closed to new comments

MetaFilter

GPT4: Howdy, Stranger. Do you know me?
July 22, 2023 10:49 AM Subscribe

Tags

Share

GPT4: Howdy, Stranger. Do you know me? July 22, 2023 10:49 AM Subscribe

Tags

Share

GPT4: Howdy, Stranger. Do you know me?
July 22, 2023 10:49 AM Subscribe