Is that enough to account for all human bias?
November 26, 2024 12:47 PM Subscribe
This blog post is a bit ... different. Normally, a blog post is a static document, and the direction of communication is from the screen to the user. But this piece requires you to interact. You're not just reading the content; you'll actually change the story of the blog post as you interact with it. This makes the piece a bit more experimental, but hopefully, also much more interesting! from An Inverse Turing Test
Keeps resetting. I must be a bot.
posted by toodleydoodley at 12:53 PM on November 26 [1 favorite]
posted by toodleydoodley at 12:53 PM on November 26 [1 favorite]
Oh god I did not pass.
posted by mittens at 1:06 PM on November 26 [1 favorite]
posted by mittens at 1:06 PM on November 26 [1 favorite]
01100001010!
posted by y2karl at 1:10 PM on November 26 [1 favorite]
posted by y2karl at 1:10 PM on November 26 [1 favorite]
I failed. But i can't figure out if that means I'm a human or a bot.
posted by If only I had a penguin... at 1:14 PM on November 26 [2 favorites]
posted by If only I had a penguin... at 1:14 PM on November 26 [2 favorites]
First, let me say: this is a neat article & I appreciated reading it. It illustrates a lot about what it means to generate uniform random numbers, or at any rate to try to determine whether a sequence IS from a uniform random number generator.
I was curious, though, whether this would do any good at telling chatgpt from a person. So I created a program that could make fake key inputs into the web page, and solicited chatgpt to give me a long sequence of T & H. This could then be entered into the web page and we can answer the question of whether this technology is good at distinguishing a human from an LLM instructed to produce a "random sequence".
(At first, chatgpt generated and then executed a Python program; but when I told it I wanted it to generate the letters directly it obeyed)
tl;dr: Much like a human is expected to, the chatgpt sequence failed the majority of the statistical tests:
In any case, it's simultaneously interesting & unsurprising that the LLM's weights managed to produce "distinctly non-uniform" random numbers.
posted by the antecedent of that pronoun at 1:47 PM on November 26 [2 favorites]
I was curious, though, whether this would do any good at telling chatgpt from a person. So I created a program that could make fake key inputs into the web page, and solicited chatgpt to give me a long sequence of T & H. This could then be entered into the web page and we can answer the question of whether this technology is good at distinguishing a human from an LLM instructed to produce a "random sequence".
(At first, chatgpt generated and then executed a Python program; but when I told it I wanted it to generate the letters directly it obeyed)
tl;dr: Much like a human is expected to, the chatgpt sequence failed the majority of the statistical tests:
In our case we have 515 heads and 428 tails… At the moment 104 tests fail while 18 tests passI don't know how closely this resembles a "human distribution", whether it mimics the distribution I included in my prompt ('Make a random sequence of T & H (e.g., TTHTHTTHHHTHTHTHTH...). Make a minimum of 1000.'), or what.
In any case, it's simultaneously interesting & unsurprising that the LLM's weights managed to produce "distinctly non-uniform" random numbers.
posted by the antecedent of that pronoun at 1:47 PM on November 26 [2 favorites]
I managed to get pretty close to randomness, not by trying directly but by (spoiler?) aiming for the line between the two buttons and making my hands vibrate slightly. (On my phone, in case that wasn't clear.)
posted by demi-octopus at 2:11 PM on November 26 [2 favorites]
posted by demi-octopus at 2:11 PM on November 26 [2 favorites]
Not sure what it says about Perl's random number generation that typing in the output of:
was not a slam dunk for robot...
posted by straw at 3:17 PM on November 26 [1 favorite]
perl -e 'for (0..100) { print ((rand(2) > 1) ? 1 : 0)} print "\n";'
was not a slam dunk for robot...
posted by straw at 3:17 PM on November 26 [1 favorite]
That was a roller coaster. I got nearly 50-50 on the first level of the analysis, but that rapidly degenerated as the level went up, so I'm proud to report that I'm not a robot, and anyone who thinks I am can bite my shiny metal a** -- whoops.
posted by BCMagee at 3:19 PM on November 26 [1 favorite]
posted by BCMagee at 3:19 PM on November 26 [1 favorite]
I passed all the tests up to size 3, then failed half and passed half the aize 4 tests. Beep boop.
posted by dsword at 4:00 PM on November 26 [1 favorite]
posted by dsword at 4:00 PM on November 26 [1 favorite]
To put this in context: The blog post in this FPP introduces the monobit test and the runs test of the K2 criterion from BSI for evaluating the randomness of pseudorandom number generators.
posted by The genius who rejected Anno's budget proposal. at 4:30 PM on November 26 [2 favorites]
posted by The genius who rejected Anno's budget proposal. at 4:30 PM on November 26 [2 favorites]
I passed more than failed, but didn't pass as much as a random bot. Based on this, I conclude that means "cyborg" and am ok with that.
posted by otherchaz at 5:35 PM on November 26 [1 favorite]
posted by otherchaz at 5:35 PM on November 26 [1 favorite]
Thinking about the 120 tests they run for samples of length 100, what is the distribution of R = [#_Tests_Passed / #_Tests] for samples of length 100 that are either generated by a machine (pseudo-random) or generated by a real-world random process? Is passing all of the tests unusual?
posted by Jonathan Livengood at 5:46 PM on November 26 [1 favorite]
posted by Jonathan Livengood at 5:46 PM on November 26 [1 favorite]
My gut says yes, passing 120 tests would be unusual: If each test is at P=.05 and independent (**which is not quite true, because for instance a failure of
Also I noticed that tests are doubly repeated, like
posted by the antecedent of that pronoun at 6:19 PM on November 26 [2 favorites]
[0,0]
is probably correlated with failure of [0,0,0,0]
), then 1/20 of tests should fail, or around 6/120 tests. I'm not sure whether this page is using P=.05 or some other cut-off; there's a mention of the 90% quantile at one point.Also I noticed that tests are doubly repeated, like
[0,0,1,1][0,0,1,1]
so maybe there are actually only 60 tests of 2 through 5 bits long? 2**5 + 2**4 + 2**3 + 2**2 == 60
.posted by the antecedent of that pronoun at 6:19 PM on November 26 [2 favorites]
I already knew I was a bot because it's so difficult for me to solve a captcha :( It's just so hard to find all the bicycles :( :(
posted by potrzebie at 8:27 PM on November 26 [2 favorites]
posted by potrzebie at 8:27 PM on November 26 [2 favorites]
Is passing all of the tests unusual?
I got a randomly generated sample of 100 points to pass all tests after about 5 clicks.
posted by grog at 5:26 AM on November 27 [2 favorites]
I got a randomly generated sample of 100 points to pass all tests after about 5 clicks.
posted by grog at 5:26 AM on November 27 [2 favorites]
Just looking empirically a bit, I recorded the results of forty clicks of their Random 100x. I observed a sample with median number passed of 109, mode of 108, and range of 88 to 120. I was very surprised to see the number passed fall below 90 for the machine! I did see a few 120's: in fact, 10% of my sample. But anyway, it seems to me that if you generated a sequence that passed 104 or 112 of their tests, they couldn't reliably distinguish you from a machine on that basis. What might be interesting now would be comparing the distribution for the machine to a distribution for humans (and maybe even better to take distributions for individual humans). My conjecture is that humans are very, very unlikely -- much less likely than machines are -- to generate sequences that pass all of the tests, but I could be wrong.
With respect to this point -- "Also I noticed that tests are doubly repeated, like [0,0,1,1][0,0,1,1] so maybe there are actually only 60 tests of 2 through 5 bits long? 2**5 + 2**4 + 2**3 + 2**2 == 60." -- it looks like every number of passed-tests I observed was even, so yes, there is definitely at least one symmetry in the tests and the number of "real" tests is smaller than 120.
posted by Jonathan Livengood at 8:09 AM on November 27 [1 favorite]
With respect to this point -- "Also I noticed that tests are doubly repeated, like [0,0,1,1][0,0,1,1] so maybe there are actually only 60 tests of 2 through 5 bits long? 2**5 + 2**4 + 2**3 + 2**2 == 60." -- it looks like every number of passed-tests I observed was even, so yes, there is definitely at least one symmetry in the tests and the number of "real" tests is smaller than 120.
posted by Jonathan Livengood at 8:09 AM on November 27 [1 favorite]
« Older A pop song in classical dress | Harris to remain in power this time? Newer »
posted by grumpybear69 at 12:52 PM on November 26 [5 favorites]