# Nate Silver accusess polling firm of fraud

September 27, 2009 6:51 PM Subscribe

42.7 percent of all statistics are made up: After Strategic Visions refused to share the methodology behind some of their polling, Nate Silver of fivethirtyeight analyzed the firm's poling results and found evidence of fraud. Strategic Visions responds to The Hill. More amusingly, Nate went on a look at an even more questionable study by the same company claiming that only 23 percent of Oklahoma students know that George Washington is the first president.

For good measure, Silver compared the Strategic Visions data with a control set. Continuing~~research~~ obsessive inquisition revealed that the Atlanta address on Strategic Visions's website is not the company's actual address.

For good measure, Silver compared the Strategic Visions data with a control set. Continuing

It's a statistics firm called Strategic Visions. That alone is pretty hinky.

posted by Sys Rq at 7:03 PM on September 27, 2009 [4 favorites]

posted by Sys Rq at 7:03 PM on September 27, 2009 [4 favorites]

*Johnson said the series of events has caused a panic at his firm and that employees are fearful for their safety. He said they’ve been getting harassing text messages, and that their receptionist was verbally accosted in their parking lot by someone who referenced Silver’s blog.*

If this is ever proven true I will disable my account and eat a ten pound bag of hot fresh horseshit.

posted by Optimus Chyme at 7:04 PM on September 27, 2009 [36 favorites]

Agreed, Sys Rq. It might as well be called "The Results You Want®"

posted by rokusan at 7:05 PM on September 27, 2009 [1 favorite]

posted by rokusan at 7:05 PM on September 27, 2009 [1 favorite]

If there is one group with a proven history of violence, it is bloggers who like to read about esoteric statistics.

posted by DU at 7:06 PM on September 27, 2009 [49 favorites]

posted by DU at 7:06 PM on September 27, 2009 [49 favorites]

In the first analysis, Silver assumes a uniform distribution of last digits and then shows that the digits are, in fact, not uniform. I don't understand why Benford's Law, which states that leading digits are

posted by twoleftfeet at 7:08 PM on September 27, 2009

*not*uniform, couldn't also apply to last digits. Anybody know?posted by twoleftfeet at 7:08 PM on September 27, 2009

Also, the title is wrong. Silver has

posted by DU at 7:09 PM on September 27, 2009

**not**accused them of fraud yet. He's shown that there's something that needs explaining and that one of the explanations could well be fraud.posted by DU at 7:09 PM on September 27, 2009

twoleftfeet>

Benford's law

posted by UrineSoakedRube at 7:13 PM on September 27, 2009 [2 favorites]

*In the first analysis, Silver assumes a uniform distribution of last digits and then shows that the digits are, in fact, not uniform. I don't understand why Benford's Law, which states that leading digits are not uniform, couldn't also apply to last digits. Anybody know?*Benford's law

*does*apply to the last digit, but the distribution isn't the same as the log(1 + 1/d) of the first digit. Once you're all the way to the last digit, the distribution is going to be pretty close to random. I refer you to the wikipedia page, specifically to the section titled "Generalization to digits beyond the first".posted by UrineSoakedRube at 7:13 PM on September 27, 2009 [2 favorites]

I have very high respect for Nate Silver's statistical chops, but he hasn't yet convinced me that we're looking at compelling evidence of fraud here. Here's my blog post about it.

Executive summary: it's hard to draw trustworthy information about fraud from data like this unless you have a pretty solid picture of what the data is supposed to look like in the absence of fraud.

Mark Blumenthal at Pollster has a similarly cautious take.

posted by escabeche at 7:13 PM on September 27, 2009

Executive summary: it's hard to draw trustworthy information about fraud from data like this unless you have a pretty solid picture of what the data is supposed to look like in the absence of fraud.

Mark Blumenthal at Pollster has a similarly cautious take.

posted by escabeche at 7:13 PM on September 27, 2009

Benford's law says that the leading digit is 1 about 30% of the time; this has to do with our numbering system (for instance, from the numbers 1 - 20, 1 is the leading digit on 11 of them). In other words, you'd expect a log distribution of the leading digit, but an even distribution of the final digit.

The whole thing points up to the critical need for higher-quality sources of entropy when faking data.

posted by jenkinsEar at 7:14 PM on September 27, 2009

The whole thing points up to the critical need for higher-quality sources of entropy when faking data.

posted by jenkinsEar at 7:14 PM on September 27, 2009

*If there is one group with a proven history of violence, it is bloggers who like to read about esoteric statistics.*

I'm sorry, but WHAT THE FUCK have fantasy baseball players done to you??

Oh, you mean.... Oh.

posted by orthogonality at 7:15 PM on September 27, 2009 [3 favorites]

Once again, the terrifying Liberal Statistician Stormtroopers strike fear into the hearts of god-fearing Americans.

posted by Avenger at 7:18 PM on September 27, 2009 [1 favorite]

*Who will save us from the terror!?!?*posted by Avenger at 7:18 PM on September 27, 2009 [1 favorite]

*WHAT THE FUCK have fantasy baseball players done to you?*

For the record, that's also Nate Silver's fault.

posted by rokusan at 7:18 PM on September 27, 2009 [7 favorites]

*Who will save us from the terror!?*

Disaster-relief supplies inside. In case of emergency, solve for

*q*.

posted by rokusan at 7:19 PM on September 27, 2009 [6 favorites]

An important point to note is that the questionable pollster is Strategic Vision, LLC, not Strategic Vision, Inc. which has been around for 37 years and is considered a reputable pollster.

posted by thebestsophist at 7:23 PM on September 27, 2009 [6 favorites]

posted by thebestsophist at 7:23 PM on September 27, 2009 [6 favorites]

I knew something was up with that Oklahoma student poll. Being a graduate of Oklahoma public high schools, I know many of them aren't great, but just counting the Honor's Societies, IB/AP students, Academic Bowl, etc. members would throw these poll results off.

posted by fishmasta at 7:26 PM on September 27, 2009

posted by fishmasta at 7:26 PM on September 27, 2009

And of course I make a typo trying to prove them wrong. Such is life....

posted by fishmasta at 7:27 PM on September 27, 2009

posted by fishmasta at 7:27 PM on September 27, 2009

If you took all the college students who fell asleep during lectures and laid them end to end, they'd be a lot more comfortable.

posted by plinth at 7:28 PM on September 27, 2009 [20 favorites]

posted by plinth at 7:28 PM on September 27, 2009 [20 favorites]

*If this is ever proven true I will disable my account and eat a ten pound bag of hot fresh horseshit.*

Clever way to not have to do a follow-up.

posted by moonshine at 7:29 PM on September 27, 2009

Love reading this stuff, and I'm inclined from the data (and the instant litigiousness) to assume that Strategic Vision

I am reminded too, of something I've been meaning to ask some statistics folks – besides Nate Silver's stuff, what are some other good statistics-type blogs that do these sorts of analyses?

posted by barnacles at 7:30 PM on September 27, 2009

**are**cooking the books. A deeper look at who all has hired them for polling will be interesting.I am reminded too, of something I've been meaning to ask some statistics folks – besides Nate Silver's stuff, what are some other good statistics-type blogs that do these sorts of analyses?

posted by barnacles at 7:30 PM on September 27, 2009

Also, do people ever really believe statistics? They say 78.5 percent don't.

posted by moonshine at 7:30 PM on September 27, 2009 [1 favorite]

posted by moonshine at 7:30 PM on September 27, 2009 [1 favorite]

*An important point to note is that the questionable pollster is Strategic Vision, LLC, not Strategic Vision, Inc. which has been around for 37 years and is considered a reputable pollster.*

But Strategic Vision, LLC has a .biz address! That means they are a reputable business!

posted by grouse at 7:30 PM on September 27, 2009 [2 favorites]

David Johnson, CEO, has a twitter account. It's like reading the transcript of an ELIZA chatbot with the "reflexive Republican spin" dial twisted way to the right.

posted by logicpunk at 7:34 PM on September 27, 2009 [2 favorites]

posted by logicpunk at 7:34 PM on September 27, 2009 [2 favorites]

*besides Nate Silver's stuff, what are some other good statistics-type blogs that do these sorts of analyses?*

Pollster is a great group blog on election polling, featuring lots of people with years of experience in the field.

For general statistics blogging -- sometimes technical, but often not -- you can't beat Columbia statistics professor Andrew Gelman's Statistical Modeling, Causal Inference, and Social Science.

posted by escabeche at 7:35 PM on September 27, 2009 [2 favorites]

Speaking of frauds, I view Washington's presidency as something of a scam. Come on, the guy was an independent yet got in 100% of the electoral college backing him? And he had the gall to get everyone drunk on Barbados Rum at his inauguration, so they'd forget the details later! President Washington? Psh! More like King George the 1st!

posted by filthy light thief at 7:36 PM on September 27, 2009 [1 favorite]

posted by filthy light thief at 7:36 PM on September 27, 2009 [1 favorite]

*besides Nate Silver's stuff, what are some other good statistics-type blogs that do these sorts of analyses?*

Well, the Australian equivalent of 538 is The Poll Bludger, although he looks into US elections and polling when not much is happening down under.

It occurs to me that an interesting way to test this further would to translate the numbers into other number systems (binary, ternary etc. etc.), repeat the analysis, and calculate a RMS error for each number system. The error should peak quite clearly, one would assume, in the decimal system.

posted by Jimbob at 7:39 PM on September 27, 2009 [2 favorites]

*Once you're all the way to the last digit, the distribution is going to be pretty close to random.*

But if I'm correctly reading this sentence from the first article -

*Silver tested the last digit in every percentage polled in all of Strategic Vision’s surveys, which he suggested should be evenly distributed among all 10 digits*- then he's only got two digits most of the time (if they are digits in a percentage.) So not yet uniform, right?

posted by twoleftfeet at 7:39 PM on September 27, 2009

It isn't the analysis of the numbers that proves it, its the refusal to turn over the data.

Always, always, always . . .

its the coverup, not the crime.

Because once they decide to coverup, every move to hide something just shows its outline. If I have a bank account I didn't want anyone to know about, the transfer out of that account is traceable. If I want to hide that my polling numbers are rigged, I put numbers I think are "random" in there. You know where the money is and that the books are cooked by the very moves that are made. The more things that have to be hidden, the more moves they have to make and the more you can know about the thing they are trying to hide.

This is how karma works.

posted by Ironmouth at 7:42 PM on September 27, 2009 [7 favorites]

Always, always, always . . .

its the coverup, not the crime.

Because once they decide to coverup, every move to hide something just shows its outline. If I have a bank account I didn't want anyone to know about, the transfer out of that account is traceable. If I want to hide that my polling numbers are rigged, I put numbers I think are "random" in there. You know where the money is and that the books are cooked by the very moves that are made. The more things that have to be hidden, the more moves they have to make and the more you can know about the thing they are trying to hide.

This is how karma works.

posted by Ironmouth at 7:42 PM on September 27, 2009 [7 favorites]

twoleftfeet>

True, but you're not going to see the prevalence of 8's that Silver saw in Strategic Vision LLC's data:

*But if I'm correctly reading this sentence from the first article - Silver tested the last digit in every percentage polled in all of Strategic Vision’s surveys, which he suggested should be evenly distributed among all 10 digits - then he's only got two digits most of the time (if they are digits in a percentage.) So not yet uniform, right?*True, but you're not going to see the prevalence of 8's that Silver saw in Strategic Vision LLC's data:

Silver tested the last digit in every percentage polled in all of Strategic Vision’s surveys, which he suggested should be evenly distributed among all 10 digits.posted by UrineSoakedRube at 7:44 PM on September 27, 2009

But he found that, for instance, ‘8’ appeared at the end of a number in 60 percent more numbers than ‘1.’ He said it was “an incredible fluke – millions to one against.”

The suggestion, made patently but couched as to avoid legal trouble, is that someone invented the numbers and, for whatever reason, used ‘8’ as a last digit in an inordinate amount of numbers.

*David Johnson, CEO, has a twitter account. It's like reading the transcript of an ELIZA chatbot with the "reflexive Republican spin" dial twisted way to the right.*

I started reading his twitter account and all of a sudden my chest tightened and I started grinding my teeth. I haven't felt like this since the Bush administration.

posted by farishta at 7:45 PM on September 27, 2009 [2 favorites]

Benford's Law only applies to stats with a power law distribution. There is no particular reason to think that it applies here, especially in trailing digits.

That is a terrible example, and has nothing to do with Benford's law. If you choose the numbers 1-100, 1 is the leading digit on 12 of them. Benford's law is based on the fact that numbers obeying a power law cluster more tightly in the low ranges than the high ranges.

posted by unSane at 7:50 PM on September 27, 2009 [4 favorites]

*for instance, from the numbers 1 - 20, 1 is the leading digit on 11 of them*That is a terrible example, and has nothing to do with Benford's law. If you choose the numbers 1-100, 1 is the leading digit on 12 of them. Benford's law is based on the fact that numbers obeying a power law cluster more tightly in the low ranges than the high ranges.

posted by unSane at 7:50 PM on September 27, 2009 [4 favorites]

My interpretation: I expected the last digits to be completely random, or to at least have an explainable bias.

It's a meta-study, so maybe it's looking at responses to political questions; and, well, I guess some questions might only give two options, and some might give a free-for-all response that was boiled down to fitting up to, oh, maybe eight or ten categories.

Now a questionnaire that asks me to choose from ten choices is just inane. Most questions are going to code into somewhere between 2 and four or five options or categories. So if we do a meta-study of the distribution of last digits, I guess I don't expect the distribution to be completely random.

I guess I'd expect to see a rise toward the left. I guess I should expect any meta-study to, in raw data form, to show bias for 1—3, and against 7—10. If you're learning why Nate is right like I am, then I think you'll be pleased as I was to see that Benford's Law looks just like I might have expected it to: this sloped distribution of final digits. Well,

There's an unintentional political jab there. I'm sorry. Again, it is true: reality biases left.

posted by five fresh fish at 7:50 PM on September 27, 2009 [1 favorite]

It's a meta-study, so maybe it's looking at responses to political questions; and, well, I guess some questions might only give two options, and some might give a free-for-all response that was boiled down to fitting up to, oh, maybe eight or ten categories.

Now a questionnaire that asks me to choose from ten choices is just inane. Most questions are going to code into somewhere between 2 and four or five options or categories. So if we do a meta-study of the distribution of last digits, I guess I don't expect the distribution to be completely random.

I guess I'd expect to see a rise toward the left. I guess I should expect any meta-study to, in raw data form, to show bias for 1—3, and against 7—10. If you're learning why Nate is right like I am, then I think you'll be pleased as I was to see that Benford's Law looks just like I might have expected it to: this sloped distribution of final digits. Well,

*duh.*We figured that out pretty much ourselves. Let's call it*Our "You're Gonna Slope to the Left" Rule.*There's an unintentional political jab there. I'm sorry. Again, it is true: reality biases left.

posted by five fresh fish at 7:50 PM on September 27, 2009 [1 favorite]

Of course, I could be entirely out to lunch. Can a person who actually Knows Things confirm that what I seem to figure would be shown in looking at last digits, is what happens? I was surprised to see how smooth the distribution is, but it seems to make common sense.

posted by five fresh fish at 7:52 PM on September 27, 2009

posted by five fresh fish at 7:52 PM on September 27, 2009

I like what Nate is doing here, but I don't understand why he uses both trailing digits in a given poll. In his example, if a poll said had Obama 58, Clinton 32, he'd put both 2 and 8 into his dataset. Since the two options need to add up to 100 (less undecideds, which are up to a few percent depending). The data you get now is not going to be independent, since if you have that 2, you

If he would just pick one of the two results per poll (chosen at random somehow, perhaps, though it shouldn't matter) he would have a much nicer looking data set that won't have uncontrolled correlations in it. And he'd still have 2500 or so points, which would be more than sufficient to do a more sophisticated analysis than he could here.

posted by Schismatic at 8:00 PM on September 27, 2009 [2 favorites]

*have*to have an 8 for the last digit to sum up correctly. Or maybe if there are a few undecideds, you'll get a 2 and a 7. Either way, if one digit is low, then the other digit will necessarily be high because of this and you see that in his comparison to Quinnipiac where the low numbers skew one way and the high numbers skew the other. You also see it to a much lesser degree in the Quinnipiac data.If he would just pick one of the two results per poll (chosen at random somehow, perhaps, though it shouldn't matter) he would have a much nicer looking data set that won't have uncontrolled correlations in it. And he'd still have 2500 or so points, which would be more than sufficient to do a more sophisticated analysis than he could here.

posted by Schismatic at 8:00 PM on September 27, 2009 [2 favorites]

Um, let's say Obama 58 and Clinton 42 in that example. I'll now slink off and hope that gaffe doesn't detract from the rest of my point.

posted by Schismatic at 8:02 PM on September 27, 2009

posted by Schismatic at 8:02 PM on September 27, 2009

*Benford's law says that the leading digit is 1 about 30% of the time; this has to do with our numbering system (for instance, from the numbers 1 - 20, 1 is the leading digit on 11 of them).*

And of the numbers 20-99, 1 is the leading digit of

**none**of them. That has nothing to do with Benford's law.

posted by axiom at 8:09 PM on September 27, 2009

*If you took all the college students who fell asleep during lectures and laid them*

...you'd be tired and sore.

posted by ROU_Xenophobe at 8:26 PM on September 27, 2009 [2 favorites]

*Also, the title is wrong. Silver has not accused them of fraud yet. He's shown that there's something that needs explaining and that one of the explanations could well be fraud.*

Um, yes he has:

It sounds like the case with the student knowledge poll was far more obviously fake.It seems quite strongly possible, nevertheless, that the students polled for this survey don't exist anywhere in Oklahoma but instead on a hard drive somewhere in Atlanta. This is a valuable exercise undertaken by the OCPA. But they owe it to the hardworking students of Oklahoma to make sure that their contractor, Strategic Vision, didn't flunk its own citizenship test.

posted by delmoi at 8:33 PM on September 27, 2009 [1 favorite]

After more reading, I think I was probably laughably wrong.

Math is hard!

posted by five fresh fish at 8:37 PM on September 27, 2009

Math is hard!

posted by five fresh fish at 8:37 PM on September 27, 2009

*It sounds like the case with the student knowledge poll was far more obviously fake.*

Well, yeah. 0% of the 1000 surveyed students got more than 7 of the questions right? As a product of Oklahoma public schools, that's well-nigh impossible IF the 1000 students were selected randomly.

With the caveat that I graduated from high school almost 20 years ago, I just can't see how you could not find a fraction of kids who would know all 10 answers. It would essentially require every student to sleep through every civics, history, and social studies class from elementary school on. It would also essentially require a lack of smart kids in Oklahoma -- and yet there were 36 National Merit Scholars in Tulsa area schools this year, which represent what, 1% of the "smart kids" in metro Tulsa? That's 3600 kids -- and that's roughly 20% of the metro Tulsa high school population.

I don't see how they can come up with this without either running into some version of the "cell phone problem" (i.e. not being able to reach the smart white kids because they didn't have cell numbers, instead reaching poor kids on land lines), or they massaged the numbers something crazy. Did they only call poor black neighborhoods in north Tulsa while avoiding rich white neighborhoods in south Tulsa and the suburbs?

This is either the crappiest survey ever run, or it's intentionally skewed to favor their clients, or these 1000 high school students never existed.

posted by dw at 9:11 PM on September 27, 2009

*If this is ever proven true I will disable my account and eat a ten pound bag of hot fresh horseshit.*-- posted by Optimus Chyme

The problem with most MeFi detective threads is that there's no reward at the end.

This could be better.

posted by rokusan at 9:16 PM on September 27, 2009

*Strategic Vision, LLC, not Strategic Vision, Inc. which has been around for 37 years and is considered a reputable pollster.*

And they're even in the same industry?

Wow, someone really needs a trademark lawyer.

posted by rokusan at 9:18 PM on September 27, 2009 [6 favorites]

*But he found that, for instance, ‘8’ appeared at the end of a number in 60 percent more numbers than ‘1.’ He said it was “an incredible fluke – millions to one against.”*

Maybe Strategic Vision only polls Chinese Americans?

posted by orthogonality at 9:51 PM on September 27, 2009

The number 8 of course stands for either Heil or Hitler.

posted by Artw at 10:11 PM on September 27, 2009 [2 favorites]

posted by Artw at 10:11 PM on September 27, 2009 [2 favorites]

Should the first link in the post not be to this?

posted by chorltonmeateater at 10:28 PM on September 27, 2009 [1 favorite]

posted by chorltonmeateater at 10:28 PM on September 27, 2009 [1 favorite]

Should the first link in the post not be to this?

Damnit, thank you. Contact forming now.

posted by The Devil Tesla at 10:54 PM on September 27, 2009

Damnit, thank you. Contact forming now.

posted by The Devil Tesla at 10:54 PM on September 27, 2009

*That is a terrible example, and has nothing to do with Benford's law. If you choose the numbers 1-100, 1 is the leading digit on 12 of them. Benford's law is based on the fact that numbers obeying a power law cluster more tightly in the low ranges than the high ranges.*

House numbering and finite street sizes is the best way to describe Benford's Law to a n00b. In my humble opinion.

posted by uncanny hengeman at 11:36 PM on September 27, 2009

*Nate Silver of fivethirtyeight analyzed the firm's poling results and found evidence of fraud.*

What a pack of punts.

posted by pompomtom at 12:02 AM on September 28, 2009

Can anyone vouch for Silver's method given Schismatic's concern? I don't know where he's getting his ideas on trailing digits. There've got to be polling staticians in Metafilter somewhere. The student knowledge one seems to be a slam dunk. But I see no reason why the trailing digits should be random. I mean, it seems very "common sense," and may be what Silver usually sees, but those are pretty weak by a mathematician's standard.

posted by FuManchu at 1:33 AM on September 28, 2009

posted by FuManchu at 1:33 AM on September 28, 2009

Mark Twain (or whoever):

Another favorite of mine: "Studies have shown" ... that phrase means

People who get

Let alone the owner of Clever Hans.

posted by Twang at 3:14 AM on September 28, 2009

**"There are three kinds of lies: lies, damned lies, and statistics."**Another favorite of mine: "Studies have shown" ... that phrase means

**bupkiss**.**"Argument from authority or appeal to authority is a logical fallacy**... The fallacy only arises when it is claimed or implied that the authority is infallible in principle and can hence be exempted from criticism."People who get

__paid__to gather data to generate statistics or make studies are seldom infallible in principle. Everyone who does "studies" is liable to be biased... even people with the best intentions and impeccable credentials.Let alone the owner of Clever Hans.

posted by Twang at 3:14 AM on September 28, 2009

*But I see no reason why the trailing digits should be random.*

I know is that a similar metastudy was performed on scientific papers (thousands of them), and the results of that study showed a trend towards some digits appearing more often than they should, which many found concerning.

Anyway, think of this. Those numbers that are reported by the pollster are the result of sampling, and have been rounded to only a few digits. The reality, if you take

**census**of an entire population of approximately 100,000 people, might come out something like this:

Obama: 51831

Clinton: 41880

Or maybe:

Obama: 47032

Clinton: 39617

Or maybe even:

Obama: 60048

Clinton: 34783

In this case, we can see that the most significant digit, marking tens-of-thousands, is very likely to be, say, a 5 a 4 or a 3. Because the population is close to evenly divided. Maybe there'll be a 6 in there. Very unlikely to start with a 9 or a 1, because there is some external factor that's making a real difference to this digit. On the other hand, the least significant digit on the right-hand side - what's driving this? Nothing. Chaos. We can expect these digits to be random. The same should be true if you do a sampling from this population, and convert it into percentages.

**Unless**those numbers aren't the result of chaos, but are the result some other process, for example, a human making them up. For example, here's a histogram of the distribution of all the digits I just made up above. Clearly my brain isn't very random, and I like 8s and 0s quite a bit.

Okay that's probably not very good by a mathematician's standard. But that's why statistics is mathematics' weird, demented, twisted uncle.

The concerns with Nate's method are fair, though - pairing the 8's with 2's doesn't make sense in 2-horse races although selecting one of them at random instead shouldn't make a difference to the final distribution you get. The main problem, I feel, is the lack of precision in the poll reporting - as others have said, you really need to do this on numbers like 48.31, rather than straight 48, but that's down to the data he had available to him, I assume.

posted by Jimbob at 3:43 AM on September 28, 2009

Oooh man, my histogram is above is screwed, because for some reason R put "0" and "1" into the same bin. Curse you. Here's the proper one.

posted by Jimbob at 3:50 AM on September 28, 2009

posted by Jimbob at 3:50 AM on September 28, 2009

barnacles -

The BBC and Open University* produce a radio programme called "More or less: behind the stats" which basically exists to examine statistics and figures seen in that week's news, then investigate where they came from and whether they actually stand up to analysis (SPOILER: almost never).

Entertainingly presented and, as someone who barely has the maths skills to type in a phone number, I find the stats explanations to be very clear. A trained statistician might find it a bit simplistic, but I think it's great. Podcast, Website/archive.

*a venerable and fairly well-respected correspondance / distance learning univeristy in the UK.

posted by metaBugs at 3:51 AM on September 28, 2009 [2 favorites]

*besides Nate Silver's stuff, what are some other good statistics-type blogs that do these sorts of analyses?*The BBC and Open University* produce a radio programme called "More or less: behind the stats" which basically exists to examine statistics and figures seen in that week's news, then investigate where they came from and whether they actually stand up to analysis (SPOILER: almost never).

Entertainingly presented and, as someone who barely has the maths skills to type in a phone number, I find the stats explanations to be very clear. A trained statistician might find it a bit simplistic, but I think it's great. Podcast, Website/archive.

*a venerable and fairly well-respected correspondance / distance learning univeristy in the UK.

posted by metaBugs at 3:51 AM on September 28, 2009 [2 favorites]

*On the other hand, the least significant digit on the right-hand side - what's driving this? Nothing. Chaos. We can expect these digits to be random.*

That's what I'm questioning, and I haven't seen anything but assertion that it should be so.

I was finally able to read the full page at Silver's blog, and a lot of his knowledgeable commenters pointed out this problem. There are possibly multiple colinearities in those numbers. They are dependent on the total population size, the number of options in the given question, and the denominator used for percentages. They could also dependent on the types of questions and regions they poll (since I would imagine 40 vs 60 questions are more common than 10 vs 90).

I think it's possible, I just haven't seen any halfway decent proof of such.

posted by FuManchu at 4:08 AM on September 28, 2009

*I like what Nate is doing here, but I don't understand why he uses both trailing digits in a given poll. In his example, if a poll said had Obama 58, Clinton 32, he'd put both 2 and 8 into his dataset. Since the two options need to add up to 100 (less undecideds, which are up to a few percent depending). The data you get now is not going to be independent, since if you have that 2, you have to have an 8 for the last digit to sum up correctly. Or maybe if there are a few undecideds, you'll get a 2 and a 7.*

Except that in political horserace polls, your undecideds are much higher than 1%. Check out this sampling of polls from the Virginia governor's race: the "undecided" number is anywhere from 2% to 20%. There's no reason to pair up 8's and 2's, or 7's and 3's, or any other numbers, because they won't be paired up more frequently than any other trailing digits.

posted by EarBucket at 5:19 AM on September 28, 2009

*House numbering and finite street sizes is the best way to describe Benford's Law to a n00b. In my humble opinion.*

Again this has nothing to do with Benford's Law. House numbers are linearly distributed but Benford's law applies to stats obeying a power series. In other words, it applies to a series N if log(N) is linearly distributed. For example, city sizes.

Wikipedia has a clear explanation:

To say that a quantity is "growing exponentially" is just another way of saying that its doubling time is constant. If the quantity takes a year to double, then after one more year, it has doubled again. Thus it will be four times its original value at the end of the second year, eight times its original value at the end of the third year, and so on. Suppose we start the timer when a quantity that is doubling every year has reached the value of 100. Its value will have a leading digit of 1 for the entire first year. During the second year, its value will have a leading digit of 2 for a little over seven months, and 3 for the remaining five. During the third year, the leading digit will pass through 4, 5, 6, and 7, spending less and less time with each succeeding digit. Fairly early in the fourth year, the leading digits will pass through 8 and 9. Then the quantity's value will have reached 1000, and the process starts again. From this example, it's easy to see that if you sampled the quantity's value at uniformly distributed random times throughout those years, you're more likely to have measured it when the value of its leading digit was 1, and successively less likely to have measured it when the value was moving through increasingly higher leading digits.posted by unSane at 5:25 AM on September 28, 2009

*Except that in political horserace polls, your undecideds are much higher than 1%. Check out this sampling of polls from the Virginia governor's race: the "undecided" number is anywhere from 2% to 20%. There's no reason to pair up 8's and 2's, or 7's and 3's, or any other numbers, because they won't be paired up more frequently than any other trailing digits.*

That's a fair point, but I don't think it changes my problem with this too much. The last digits need to add up to 100-%undecideds. Unless the undecideds are uniformly distributed (which seems very unlikely), you would wind up with some sort of correlation in the data such that the last two digits would sum to some numbers more than others. While 8s and 2s may not be paired up, you would almost certainly see more 7 and 1 pairs than 7 and 9 pairs. Like I said, you do see this in the plot of last digit distribution on both Strategic Visions and Quinnipiac, where high numbers and low numbers skew in opposite directions.

I also think that the quantitative approach to falsifying the OK data is looking in the wrong place. The intuitive reason that it would be impossible for not even one student to score higher than 7 is probably the right approach. Extreme values (the highest points in a data set, in this case) behave differently than the average value of the data, and I've never seen a real data set this big without some significant outliers even if it were randomly distributed. The statistics of these should definitely be looked at on their own, because the average values can come from a number of processes and are inherently smoothed over.

posted by Schismatic at 6:45 AM on September 28, 2009

Another way to understand Benford's law is to think of it as the logical consequence of the fact that units of measurements are arbitrary. The distribution of 1s, 2s, 3s etc as leading digits should be the same regardless of the particular units we are using. Naively we might suppose that all digits are equally likely to appear as leading digits if we measure in whatever units we want (centimetres, inches, finger widths, bees dicks etc). However it is easy to see that this can't be the case. If it was, then measuring in units half as long, all the 5s, 6s, 7s, 8s and 9s would become 1s (fully 50%) and the rest would be divided amongst the other 9 digits, so the distribution would not be independent of the unit used. It turns out that the only distribution where you can arbitrarily multiply or divide each of the measurements (reflecting different measurement units) and still end up with the same distribution is the power distribution specified by Benford's law.

In the case of surveys I'm not sure why Benford's law should be expected to hold. After all you aren't dealing with measurements that can vary over an arbitrary range but usually percentages that are constrained between 0 and 100. Also you can expect the results to be around the middle of the field, for the simple reason that people design survey questions such that there is significant disagreement between people. "Do you think cannibalism is awesome?" is not a typical question, since it will yield boring results like 1% yes/99% no. More typical questions (e.g. "do you approve of politician X/policy Y?") are designed to yield a split between 20/80 to 50/50.

posted by L.P. Hatecraft at 6:53 AM on September 28, 2009

In the case of surveys I'm not sure why Benford's law should be expected to hold. After all you aren't dealing with measurements that can vary over an arbitrary range but usually percentages that are constrained between 0 and 100. Also you can expect the results to be around the middle of the field, for the simple reason that people design survey questions such that there is significant disagreement between people. "Do you think cannibalism is awesome?" is not a typical question, since it will yield boring results like 1% yes/99% no. More typical questions (e.g. "do you approve of politician X/policy Y?") are designed to yield a split between 20/80 to 50/50.

posted by L.P. Hatecraft at 6:53 AM on September 28, 2009

*Also, the title is wrong. Silver has not accused them of fraud yet. He's shown that there's something that needs explaining and that one of the explanations could well be fraud.*

Um, yes he has:

It seems quite strongly possible, nevertheless, that the students polled for this survey don't exist anywhere in Oklahoma but instead on a hard drive somewhere in Atlanta. This is a valuable exercise undertaken by the OCPA. But they owe it to the hardworking students of Oklahoma to make sure that their contractor, Strategic Vision, didn't flunk its own citizenship test.

Um, yes he has:

It seems quite strongly possible, nevertheless, that the students polled for this survey don't exist anywhere in Oklahoma but instead on a hard drive somewhere in Atlanta. This is a valuable exercise undertaken by the OCPA. But they owe it to the hardworking students of Oklahoma to make sure that their contractor, Strategic Vision, didn't flunk its own citizenship test.

Just wanted to point out that "It seems quite strongly possible" does not an accusation of fraud make. An accusation of fraud reads like this: "THEY COMMITTED FRAUD."

posted by Ironmouth at 7:00 AM on September 28, 2009 [1 favorite]

Thank you for posting this. I don't know anything about statistical math but that blog entry about the OK High School students is the most fascinating thing I've read today. The comments add some pretty important points-- such as if the poll was open-ended why didn't a single Oklahoma student answer the question, "What is the law of the land?" with "The Ten Commandments."

Unfortunately the use of false statistics to muddy the waters serves a twofold purpose-- it gives the underlying agenda more ammunition and at the same time it weakens the impact of real statistical analysis. Which political party in America needs to be propped up by fake numbers?

Unfortunately there isn't much that can be done to counter these made-up statistics, particularly in these days of psuedo-journalism. FOX news and USA Today will run with the shocking "Only 23% of Oklahoma High School students know that George Washington was our first President!" and there won't be any investigation behind the numbers presented. Viewers and readers won't question the numbers because "It was on the TV news/ In the newspaper." The memes "public High Schools are a failure" and "kids today are stupid" will once again be re-inforced.

posted by Secret Life of Gravy at 7:30 AM on September 28, 2009

Unfortunately the use of false statistics to muddy the waters serves a twofold purpose-- it gives the underlying agenda more ammunition and at the same time it weakens the impact of real statistical analysis. Which political party in America needs to be propped up by fake numbers?

Unfortunately there isn't much that can be done to counter these made-up statistics, particularly in these days of psuedo-journalism. FOX news and USA Today will run with the shocking "Only 23% of Oklahoma High School students know that George Washington was our first President!" and there won't be any investigation behind the numbers presented. Viewers and readers won't question the numbers because "It was on the TV news/ In the newspaper." The memes "public High Schools are a failure" and "kids today are stupid" will once again be re-inforced.

posted by Secret Life of Gravy at 7:30 AM on September 28, 2009

The Oklahoma student survey is 110% phony. Scroll down to the bar graphs. They show the various responses the students gave. For example, "What are the two major political parties in the United States?" 43%: Democrat and Republican; 11%, Communist and Republican; 46% Don't Know. There were no "Other Answers" column.

Or, "Who Wrote the Declaration of Independence?" 24% Abraham Lincoln; 19% George Washington; 14% Thomas Jefferson; 7% Barack Obama; 2% Michael Jackson; 34% Don't Know. Again, in spite of wild alternative answers, no "Other Answers" column. This is supposedly 1000 students surveyed. While you can find a couple of kids who might snark the poll and claim Michael Jackson, you would not find 20 kids who would all come up with the same ridiculous answer - and not come up with another equally snarky answer (or one more relevant to their generation).

This isn't just making something up. This is doing it half-assedly.

posted by dances_with_sneetches at 7:58 AM on September 28, 2009 [3 favorites]

Or, "Who Wrote the Declaration of Independence?" 24% Abraham Lincoln; 19% George Washington; 14% Thomas Jefferson; 7% Barack Obama; 2% Michael Jackson; 34% Don't Know. Again, in spite of wild alternative answers, no "Other Answers" column. This is supposedly 1000 students surveyed. While you can find a couple of kids who might snark the poll and claim Michael Jackson, you would not find 20 kids who would all come up with the same ridiculous answer - and not come up with another equally snarky answer (or one more relevant to their generation).

This isn't just making something up. This is doing it half-assedly.

posted by dances_with_sneetches at 7:58 AM on September 28, 2009 [3 favorites]

Anything with the word "Strategic" in it is usually full of shit.

posted by stormpooper at 8:19 AM on September 28, 2009 [1 favorite]

posted by stormpooper at 8:19 AM on September 28, 2009 [1 favorite]

*For example, "What are the two major political parties in the United States?" 43%: Democrat and Republican;*

And even this is wrong. The name of the party is 'Democratic', not 'Democrat'.

posted by deadmessenger at 8:55 AM on September 28, 2009

That's liek some special little thing that conservatives do to be dicks. Not entirely sure what they are trying to signify by it other than their own dickishness, but there's probably some deep symbolic meaning if you follow all theforwarded email conspiracy theories.

posted by Artw at 9:12 AM on September 28, 2009

posted by Artw at 9:12 AM on September 28, 2009

From the CEO's Twitter feed: "Are we sure that [Obama] is really President? LOL".

I'm pretty sure there was a poll last November that answered that question pretty decisively.

posted by mkultra at 9:56 AM on September 28, 2009 [7 favorites]

I'm pretty sure there was a poll last November that answered that question pretty decisively.

posted by mkultra at 9:56 AM on September 28, 2009 [7 favorites]

If you believe the polls, out of 1,000 Oklahoman High Schoolers, none could score higher than 6/10 on a basic citizenship test. Bullshit.

posted by nestor_makhno at 11:13 AM on September 28, 2009

posted by nestor_makhno at 11:13 AM on September 28, 2009

*Mark Twain (or whoever): "There are three kinds of lies: lies, damned lies, and statistics."*

That quote is also used by Jim Galloway at the Atlanta Journal-Constitution in his article yesterday on this brouhaha. As it turns out a commentor at AJC points out that the phrase is attributed to Benjamin Disraeli...a fact that Twain himself pointed out in

*Chapters from My Autobiography*: "Figures often beguile me, particularly when I have the arranging of them myself; in which case the remark attributed to Disraeli would often apply with justice and force: 'There are three kinds of lies: lies, damned lies, and statistics.'"

posted by ericb at 12:07 PM on September 28, 2009

I love that everyone in this thread wants to give the Oklahoma school children the benefit of the doubt. MeFi threads are so often tilted the other way when it comes to assuming things about red-state smarts.

Couldn't these have been multiple choice questions?

posted by damehex at 4:37 PM on September 28, 2009

*Or, "Who Wrote the Declaration of Independence?" 24% Abraham Lincoln; 19% George Washington; 14% Thomas Jefferson; 7% Barack Obama; 2% Michael Jackson; 34% Don't Know. Again, in spite of wild alternative answers, no "Other Answers" column.*Couldn't these have been multiple choice questions?

posted by damehex at 4:37 PM on September 28, 2009

I think there's a flaw in Nate's argument about the Oklahoma high school poll. He simulates responses from 50,000 students assuming each one has the same likelihood of getting each question right as the Strategic Visions poll indicates--so 23% for the question about George Washington, 26% for the one about the Bill of Rights, etc.--with the additional assumption that the results for each question are independent of each other, which, as he says, is not a reasonable assumption. The distribution of correct answers he gets matches up with the poll, so he concludes that they must have been generated with that same assumption, but that's not necessarily true.

The marginal distribution of the percentage of people answering each question correctly isn't changed by assuming correlations between the questions. In math lingo, what Nate's done is create a vector of Bernoulli-distributed random variables (1 or 0) and then estimated the average vector by sampling. But by linearity of the expected value, the average vector is just the vector of averages, i.e., the list of input percentages. So he could have assumed the results were perfectly correlated and he still should have gotten the same results.

I agree that there's something fishy about these polls (really, nobody out of 1000 students could get more than 7/10 right?), but this isn't actually evidence of anything. What would be more interesting would be the compare the distributions of

posted by albrecht at 11:25 AM on September 29, 2009

The marginal distribution of the percentage of people answering each question correctly isn't changed by assuming correlations between the questions. In math lingo, what Nate's done is create a vector of Bernoulli-distributed random variables (1 or 0) and then estimated the average vector by sampling. But by linearity of the expected value, the average vector is just the vector of averages, i.e., the list of input percentages. So he could have assumed the results were perfectly correlated and he still should have gotten the same results.

I agree that there's something fishy about these polls (really, nobody out of 1000 students could get more than 7/10 right?), but this isn't actually evidence of anything. What would be more interesting would be the compare the distributions of

*pairs*of correct answers, etc., with the data from the poll. If it was in fact a real poll, then that data should exist, right?posted by albrecht at 11:25 AM on September 29, 2009

Silver is missing out on an important secondary function of the polling and statistics industry. In many published studies, trailing digits are altered by members of secret societies and/or the intelligence community as a way to pass encrypted messages to agents in the field. Should I or should I not try to shoot an elected official today? Well, I go to the Strategic Visions, LLC poll released yesterday and run the big string of 8's and 2's through my decoder ring, and usually - but not always - come up with the answer 'no.'

It's actually a vast improvement on the previous system, which involved a whole dictionary of codewords and pass-phrases that would be woven into obituaries and personal ads. The upsurge of useless numbers in published materials over the last few decades has really made secret public communication far easier.

posted by kaibutsu at 11:27 AM on September 29, 2009 [1 favorite]

It's actually a vast improvement on the previous system, which involved a whole dictionary of codewords and pass-phrases that would be woven into obituaries and personal ads. The upsurge of useless numbers in published materials over the last few decades has really made secret public communication far easier.

posted by kaibutsu at 11:27 AM on September 29, 2009 [1 favorite]

`Hi David,`

I'm writing you for two reasons.

Firstly, I wanted to make sure that you had some

decent contact information for me. Was really

looking forward to the lawsuit and figured you

might be having trouble getting in touch. I'm

at this e-mail address -- xxxxxxxx@gmail.com,

or at xxx-xxx-xxxx.

Secondly, I wanted to provide you an opportunity

to clear the air about one singular fact. What

call center(s) have you used to conduct your

public-facing polling? For every call center that

you're willing to publicly disclose, up to a

maximum of 5, I will donate $538 to Children's

Healthcare of Atlanta (http://www.choa.org/).

Best wishes,

Nate Silver

I'm writing you for two reasons.

Firstly, I wanted to make sure that you had some

decent contact information for me. Was really

looking forward to the lawsuit and figured you

might be having trouble getting in touch. I'm

at this e-mail address -- xxxxxxxx@gmail.com,

or at xxx-xxx-xxxx.

Secondly, I wanted to provide you an opportunity

to clear the air about one singular fact. What

call center(s) have you used to conduct your

public-facing polling? For every call center that

you're willing to publicly disclose, up to a

maximum of 5, I will donate $538 to Children's

Healthcare of Atlanta (http://www.choa.org/).

Best wishes,

Nate Silver

posted by Rhaomi at 10:11 PM on September 29, 2009 [2 favorites]

*And they're even in the same industry? Wow, someone really needs a trademark lawyer.*

"A different polling firm that shares a similar name — Strategic Vision Inc. of San Diego, which has no connection to the embattled Atlanta firm — has already suffered some blowback from the flap.posted by ericb at 8:18 AM on September 30, 2009

'We’ve had a number of people very confused because one of the things we’re known for is impeccable data,' said Strategic Vision Inc. President Alexander Edwards. 'We have absolutely, positively no relationship whatsoever.'

It’s not entirely clear what the long-term implications will be for Atlanta-based Strategic Vision."*

On further review, I take back my objection. Nate's plotting the

posted by albrecht at 12:35 PM on September 30, 2009

*distribution*of the number of correct answers, which does depend on the joint probabilities of a student getting various questions correct.posted by albrecht at 12:35 PM on September 30, 2009

« Older The BLT Kama Sutra at Kitchen Scraps | Brindin Press, poetry translations Newer »

This thread has been archived and is closed to new comments

posted by DU at 7:02 PM on September 27, 2009