"I hope this is all just incompetence."
October 14, 2011 8:49 AM Subscribe

The statistical error that neuroscience researchers get wrong at least half the time. Ben Goldacre of Bad Science explains this mistake, which was made in about half of 157 academic neuroscience papers in which there was an opportunity to make it. The culprit doesn't seem to be any specific journal, since the sample included five different neuroscience journals.
posted by John Cohen (71 comments total) 50 users marked this as a favorite

Zowie.

And, having once based my master's thesis on the results of not one but two independent and duplicitous sets of researchers (took me over a year to realize it)... my money's on researchers publishing intentionally misleading results.
posted by IAmBroom at 9:00 AM on October 14, 2011

This is slightly difficult, and it will take 400 words of pain. At the end, you will understand an important aspect of statistics better than half the professional university academics currently publishing in the field of neuroscience.

I still don't understand. Maybe some more words would help.
posted by swift at 9:01 AM on October 14, 2011 [1 favorite]

More words: In order to tell if two statistical results are statistically different, you have to do a third statistical test. The error is neuroscience not doing that third test. They just look at the first two results and say "yep, they look different to me."
posted by yeolcoatl at 9:07 AM on October 14, 2011 [10 favorites]

swift: Suppose that you get a stock tip that a stock - XYZ - will go up by 10%. Stocks tend to swing +/-5% about now, so if XYZ does go up by 10%, you can be pretty sure the stock tip is good - and maybe you should pay attention to the source!

At the end of the period, XYZ goes up by 10% - success! Except, another stock you picked entirely at random, as a "control", went up by 5%. XYZ went up 5% because it was a smart pick - or maybe it's just random variability in a market that was rising 5% anyway. So, you still don't know if the guy selling you stock tips is full of it or not.

An inexact metaphor, but somewhat the same idea.
posted by IAmBroom at 9:08 AM on October 14, 2011 [4 favorites]

Oops:

MAYBE XYZ went up 5% higher than the control stock because it was a smart pick - or maybe it's just random variability in a market that was rising 5% anyway. So, you still don't know if the guy selling you stock tips is full of it or not.
posted by IAmBroom at 9:10 AM on October 14, 2011

So, are they basically saying that in these reports the researchers didn't perform pair-wise testing? Like a t-test?

If so, it's striking (and horrifying) that reviewers did not pick up on that, especially in prominent journals.
posted by ssmug at 9:10 AM on October 14, 2011 [1 favorite]

ssmug: no, it's not saying that. It's saying the researchers did two t-tests, but then did not use the correct methods to compare the results of those tests.
posted by yeolcoatl at 9:12 AM on October 14, 2011 [1 favorite]

When you drop the chemical on the mutant mice nerve cells, their firing rate drops, by 30%, say. With the number of mice you have (in your imaginary experiment) this difference is statistically significant, which means it is unlikely to be due to chance. That’s a useful finding which you can maybe publish. When you drop the chemical on the normal mice nerve cells, there is a bit of a drop in firing rate, but not as much – let’s say the drop is 15% – and this smaller drop doesn’t reach statistical significance.

Halfway through reading this paragraph, I knew where he was going. Is this really so common? It seems shockingly obvious to me.

Now, I should have prefaced this by saying I'm no genius. I'm neither a scientist nor a statistician. I usually don't know who the killer is until Columbo/Murder She Wrote/Velma has already pulled off the mask. The fact that this seems to clear to me means that people much smarter than I should routinely watch for it.
posted by Terminal Verbosity at 9:13 AM on October 14, 2011 [1 favorite]

I'll admit that I peek at this table like, all the time, but at least I use it.
posted by theodolite at 9:15 AM on October 14, 2011 [33 favorites]

I wholehearted agree that more statisticians are needed in science, both in constructing experiments and reviewing papers. I'm not trained in that area, but we have a statistician on the project I am on, and her job is to say "No, you can't say that with this data." That's key to good science.
posted by demiurge at 9:19 AM on October 14, 2011 [5 favorites]

A few more (highly simplified) words:

Say you have a drug that changes peoples' hair color. You test it on blondes and brunettes. Here's what you find:

- 15% of the blondes who are given the drug are turned brunette.
- 14% of the brunettes who are given the drug are turned blonde.

Now, these two numbers seem to indicate that the drug is producing an effect. But it could also be the result of random chance. So you do a statistical test to see whether the effect that you see is the result of random chance or not - that is, you test to see if the results are "statistically significant." You find the following:

- The drug had a (barely) statistically significant effect on the blondes.
- The drug did not have a statistically significant effect on the brunettes.

Now here is the important part. You can conclude that the drug produces an effect in blondes, and you can say that it failed to produce an effect in brunettes. But you cannot say that blondes and brunettes responded to the drug differently.

Why? Remember the original responses: 15% for blondes, and 14% for brunettes. The important thing is to determine whether the difference between these two figures (1%) is itself significant. The fact that you independently labeled the results from blondes "effective" and from brunettes "ineffective" is entirely separate from analyzing the difference between the two.
posted by googly at 9:20 AM on October 14, 2011 [14 favorites]

Isn't it comparable to a rounding error? They are rounding down the control groups results before doing the comparison.

The programming equivalent is that you should keep all values as floating point and if necessary convert to an integer at the very end.
posted by bhnyc at 9:21 AM on October 14, 2011

So what you're saying is, the zombies look like they're coming towards my house, but they might be passing through.
posted by phaedon at 9:22 AM on October 14, 2011 [1 favorite]

Summary: There are different differences: there are differences in the way things are different. Then there are different similarities: that is to say there are differences in the ways things are similar. But there are also similar similarities— the similarities in things that are similar that we only think are different.
posted by weapons-grade pandemonium at 9:24 AM on October 14, 2011 [8 favorites]

I still don't understand. Maybe some more words would help.

I think the article is a little unclear in that they didn't really lay out the numbers very well before hand.

Say we've got two samples, A and B. For our sample size and subject matter, if we expose it to chemical X, any detected results need to be above 20 Units (U) to be statistically significant.

We expose A to X, and we get a result of 30U. That's statistically significant.

We expose B to X, and we get a result of 15U. That isn't statistically significant.

But note that the result for A is less than 20U from the result for B. This means that while we did have a statistically significant finding for A, the difference between A and B is not statistically significant.

This significantly diminishes the value of our findings, as the math only supports a relatively unambitious claim. We want to say that A and B are different, but the math won't let us, as the difference in reaction between A and B is too small. We're left with a relatively uninteresting finding, namely that something seems to happen with A and X, but we aren't sure how much that matters.

Careers tend not to be made out of this kind of finding, and the author is implying that because scientists have an interest in careers, they're making more ambitious claims than their studies actually permit.
posted by valkyryn at 9:25 AM on October 14, 2011 [10 favorites]

With the number of mice you have (in your imaginary experiment) this difference is statistically significant, which means it is unlikely to be due to chance.

Ugh, that's totally wrong. Sorry, Ben...

Isn't it comparable to a rounding error? They are rounding down the control groups results before doing the comparison.

No. See googly's explanation, which is correct. The problem is that there is a threshold for "statistical significance" that we consider "enough evidence" to say that there is a difference. So, one comparison may be "not enough evidence" and another comparison can yield "enough evidence", but that doesn't mean that there's actually good evidence that these two comparisons are different.

The main difference here is evidence for a difference versus differences in levels of evidence. What the researchers have is the latter, but claim the former.
posted by Philosopher Dirtbike at 9:26 AM on October 14, 2011 [3 favorites]

I think I get it.

Let's say you're conducting an experiment to see if coins come up heads more often in a strong magnetic field. You discover that pennies come up heads 60% of the time in a strong magnetic field. You did enough flips that the odds of this being due to chance are 1 in 105. You discover that nickles come up heads 58% of the time in a strong magnetic field. You did enough flips that the odds of this being due to chance are 1 in 95.

If your criteria for "statistically significant" is "less than 1 in 100 probability of being due to random chance," then pennies meet this criteria but nickles do not. But you cannot be 99% sure that pennies and nickles behave differently.
posted by justkevin at 9:26 AM on October 14, 2011

For those that find the explanation unclear, I think part of the confusion may be that the example used numbers that don't exaggerate the issue enough.

A common practice in science is to set a "significance" threshold of say 90% or 95%. We'll stick with 95%. (It's really an arbitrary choice.) What this means basically is the following. You perform an experiment to see if "something" has an effect, and you get a result. Now, suppose that your "something" actually had no effect, but your results just occurred by chance given the uncertainties in your setup. That is, imagine repeating your experiment a hundred times; how many times would you expect results like the ones you got to show up just by chance? If it's less than 5 out of the 100 (5% = 100%-95%), then you've crossed the threshold and your results are "significant." If there's more than a 5% chance the results were due to chance, well, that's not "significant."

But now imagine you do two experiments and calculate the probability that each result would arise by chance independently. For one you get 4.9999999999999%, for the other, 5.0000000000001%. According to the rules of the game, you conclude that one result is "significant." The other is not. This does not mean there is any evidence of a difference between your two results. The difference between them is completely trivial.
posted by dsword at 9:29 AM on October 14, 2011 [3 favorites]

But note that the result for A is less than 20U from the result for B. This means that while we did have a statistically significant finding for A, the difference between A and B is not statistically significant.

Click! /brain

Thank you.
posted by swift at 9:30 AM on October 14, 2011

If I had a nickel for every time somebody says nickle...
posted by weapons-grade pandemonium at 9:30 AM on October 14, 2011 [1 favorite]

Statistics is very difficult. There is a fairly well known blogger (Shalizi) who is a statistics professor who says as one of his disclaimers that he has never actually taken a statistics class. He goes on at length about how terrible the average scientist is at statistics. (And the median scientist is even worse! That's my own prejudice, not a quotation from him.) Most scientists I have attended to closely do not even know the exact definition of the "at P = .95 level statistical significance".

Goldacre sounds like Captain Obvious to me.
posted by bukvich at 9:31 AM on October 14, 2011 [2 favorites]

Expect me to ignore the latest

[pick one] blueberries/chocolate/laptops
[pick one] cause/inflame/cure
[pick one] cancer/arthritis/hemorrhoids

articles with even more vehemence.
posted by Terminal Verbosity at 9:36 AM on October 14, 2011 [2 favorites]

This happens a lot in linguistics papers, especially corpus linguistics. Another common mistake is running chi squares on data that has been converted to percentages, rather than inputting raw counts (when you convert to percentages, you're basically saying n=100 for all rows that you input).

Here's the link to Shalizi's stats blog, Three Toed Sloth.
posted by iamkimiam at 9:38 AM on October 14, 2011 [2 favorites]

Suppose that you get a stock tip that a stock - XYZ - will go up by 10%. Stocks tend to swing +/-5% about now, so if XYZ does go up by 10%, you can be pretty sure the stock tip is good - and maybe you should pay attention to the source!

At the end of the period, XYZ goes up by 10% - success! Except, another stock you picked entirely at random, as a "control", went up by 5%. XYZ went up 5% because it was a smart pick - or maybe it's just random variability in a market that was rising 5% anyway.

That all sounds right (and I'm no statistics expert, so I'm hardly the definitive word on this), but there's another thing the article is pointing out: it's nonsense to say -- on the one hand -- "Oh, the stock going up from 0 to 5% was an insignificant fluctuation," but then turn around and say -- on the other hand -- "The stock that increased by 10% is doing significantly better than the stock that went up a mere 5%." Both of those involve a difference of 5% (the baseline of zero vs. 5%, and 5% vs. 10%). So it's blatantly inconsistent to call one of those 5% gaps significant while calling another insignificant.
posted by John Cohen at 9:38 AM on October 14, 2011

Valkyrn wins the "Alton Brown Memorial Prize for Helping Boneheads Get It", and he did it in fewer words than the article's author.
posted by Slap*Happy at 9:38 AM on October 14, 2011

I should clarify that the reason why the chi-square error shows up frequently in corpus linguistics work is because corpus data, specifically word frequency counts, are often displayed and compared in tables as parts per million. You need to go back to the word-count-from-the-total-corpus to accurately compare significance of any two or more tokens.
posted by iamkimiam at 9:42 AM on October 14, 2011 [1 favorite]

Now, I should have prefaced this by saying I'm no genius. I'm neither a scientist nor a statistician.

This is actually part of the problem. In order to do fMRI research you need an IT dude to run the servers, a physicist to solve field issues, a computer scientist to automate data processing, a statistician to analyze the data, a psychologist to screen the subjects and set up the test, and a neuroscientist to interpret the results.

And if your lab or study can't afford or doesn't have access to one of those pieces, the science will suffer. This is a big problem, because those fields are disparate and arcane by themselves. And that's just this one little tiny field.

You see this pop up in industry where a dude who needs a program but can't write it works with a programmer who can write it, but doesn't have the expertise to understand the problems. This is how you end up with M.D's inventing calculus.

We've reached a stage where our ability to record and generate data has really outstripped our abilities to analyze it. There will be growing pains while this gets sorted out.
posted by Pogo_Fuzzybutt at 9:43 AM on October 14, 2011 [12 favorites]

I went to a particle physics conference a couple years ago. Statistics is crucial to our understanding of physics, both because of the inherently statistical nature of the quantum world and because all science basically relies on stats these days.

Anyways, a guy gave a talk on statistical methods in physics. He told us that if you understand the statistics you're using, you're ahead of 80% of your colleagues -- he basically claimed that 80% of particle physics researchers blindly use statistical tests that may or may not be appropriate. He pushed the idea that every collaboration needs a "statistics officer" just like it needs a safety officer. Someone who's job it is to constantly be on the lookout for shoddy statistics.

This meshed with my experience doing my research. Whenever I had a stats question that I would go to my peers with, I would generally just get a blank look and a "well, I can show you how I did it" response. We would really on previous work, even if it didn't necessarily fit with what we were doing. None of us really understood the fundamentals. This went for post-docs and some faculty, too.

I guess my point is that if you want to be on the fast track to being successful in pretty much any discipline, take all the stats courses you can get and worry about the "science" stuff as it comes up. Science is easy. Math is easy. Statistics is hard.
posted by auto-correct at 9:53 AM on October 14, 2011 [2 favorites]

I totally hear you Pogo, but the upside is that interdisciplinary collaboration sometimes leads to awesome innovations and cool new ways of thinking about and doing things!
posted by iamkimiam at 9:53 AM on October 14, 2011 [1 favorite]

I, too, am nothing like a scientist, and yet found this to be a completely obvious thing. It's really frightening that so many papers didn't do this.
posted by marginaliana at 9:59 AM on October 14, 2011

Aren't these scientists also committing the sin of claiming to have proved the null hypothesis? In valkyryn's terms, they go from

We expose B to X, and we get a result of 15U. That isn't statistically significant.

B isn't affected by X.

Can they do that with traditional Hypothesis Testing?
posted by benito.strauss at 10:00 AM on October 14, 2011

Two drugs are tested, A and B. The test setup is such that any change greater than 15% is considered statistically significant. Drug A shows a 20% increase, which is considered effective. Drug B shows a 10% increase, which is inconclusive. Can you say that drug A works better than drug B? Well, since the difference between their results is 20% vs 10%, which means on average only 10% of cases worked better with A than with B, you can't make that claim with statistical significance.
Is that understanding correct?
(I'm not a statistician, but I am a statistic)
posted by rocket88 at 10:29 AM on October 14, 2011

Oh man, pogo, you're speaking my language. What happens now is that you have half of those people and they double up on tasks. I have to tell people that I'm working with that I know about image processing and visualization and even though setting up networking and managing databases uses computers, I'm not the best person for the job. The other problem is, although cross-disciplinary research is really important, it doesn't usually turn out that all of the disciplines involved have something 'publishable' to contribute. I think this is especially true for the statisticians. PhD-level statisticians don't want to spend all their time doing routine stuff when they could be writing papers on complex tests. I run into the same problem myself with computer science. There's a lot that needs to be done, but my field is going to say 'It's already been done."
posted by demiurge at 10:39 AM on October 14, 2011 [1 favorite]

my money's on researchers publishing intentionally misleading results

My money's on researchers publishing analyses that happen to jibe with the narrative they are constructing from a lot of data and downstream analyses. From seeing what my boss and his colleagues do, I'd surmise that biology is now about 99% data and experiment management, and 1% actual science (or some ratio that comes close to this).

If you're a PI, I'd guess this is not really about deliberate maliciousness, but simply getting something ready to publish that appears to cover all bases in a way that will pass reviewers, so that you can justify keeping the lab running.

If you've ever collaborated on a paper with more than a few authors, getting said paper out the door is a managerial process, as much as a scientific one. And if this is such a common problem, it puts a spotlight on the failures of reviewers who do not catch this error, as much as the authors.
posted by Blazecock Pileon at 11:09 AM on October 14, 2011

Here's a graph that was helpfully provided in a comment on my blog by a reader named LemmusLemmus. I'm not going to link to his original comment (since I posted the FPP), but here's his explanation of the graph:

The result for condition A is significantly different from zero (i.e., it "is significant"), while the result for condition B is not (the confidence interval for A does not include zero, but the confidence interval for B does). However, the two confidence intervals overlap, which means that the results for conditions A and B are not significantly different from each other.

posted by John Cohen at 11:12 AM on October 14, 2011 [3 favorites]

Say you have a drug that changes peoples' hair color. You test it on blondes and brunettes. Here's what you find:

- 15% of the blondes who are given the drug are turned brunette.
- 14% of the brunettes who are given the drug are turned blonde.

Now, these two numbers seem to indicate that the drug is producing an effect. But it could also be the result of random chance. So you do a statistical test to see whether the effect that you see is the result of random chance or not - that is, you test to see if the results are "statistically significant." You find the following:

- The drug had a (barely) statistically significant effect on the blondes.
- The drug did not have a statistically significant effect on the brunettes.

Now here is the important part. You can conclude that the drug produces an effect in blondes, and you can say that it failed to produce an effect in brunettes. But you cannot say that blondes and brunettes responded to the drug differently.

Whoa! I agree except for the bolded part, which is wrong. This type of hypothesis testing only lets you reject the null hypothesis (that there was no effect) at a certain level of significance. It does not let you reject the alternative hypothesis (that there was an effect). All you can say is that a possible effect in brunettes was not detectable within error or random variation, or equivalently that you couldn't reject the possibility -- at the desired threshold of confidence -- that there was no effect for brunettes.
posted by Mapes at 11:14 AM on October 14, 2011 [4 favorites]

yeolcoatl: "More words: In order to tell if two statistical results are statistically different, you have to do a third statistical test. The error is neuroscience not doing that third test. They just look at the first two results and say "yep, they look different to me.""

Well, in at least one thing I am familiar with anecdotally, someone was told to do a lot of extra statistics to see if it was significant, even tho the error bars on the 2 data sets were no where near overlap. I have linked the actual paper do someone I know who can pull it down and see what the original says vs this summary and subsequent interpretation.
posted by MrLint at 11:23 AM on October 14, 2011

Not only that, I bet if you ask 100 non-statistician MDs, PhD, et al., to explain what a p-value is, maybe 10% will explain it correctly. And we're talking about most fundamental, most commonly used statistic of any research paper. Frankly, I don't think even that many will get it right.

People still argue about this. Note all the stat phil people arguing even more in the discussion! For example, Fisher and Neyman disagreed over exactly what the heck a p-value is. I would accept "a measure of evidence against some null hypothesis" as correct.
posted by a robot made out of meat at 11:25 AM on October 14, 2011

The result for condition A is significantly different from zero (i.e., it "is significant"), while the result for condition B is not (the confidence interval for A does not include zero, but the confidence interval for B does). However, the two confidence intervals overlap, which means that the results for conditions A and B are not significantly different from each other.

A quick box plot can provide visual indication of this. I would be surprised if a lab didn't do this or a quantitative measurement, however.
posted by Blazecock Pileon at 11:26 AM on October 14, 2011

Well, in at least one thing I am familiar with anecdotally, someone was told to do a lot of extra statistics to see if it was significant, even tho the error bars on the 2 data sets were no where near overlap. I have linked the actual paper do someone I know who can pull it down and see what the original says vs this summary and subsequent interpretation.

It's the other way. Non-overlapping independent CIs is sufficient but not necessary.
posted by a robot made out of meat at 11:28 AM on October 14, 2011

mapes: This type of hypothesis testing only lets you reject the null hypothesis (that there was no effect) at a certain level of significance. It does not let you reject the alternative hypothesis

Right!?! And this is an another, different mistake they are making, even before they combine it with the significant effect.

In symbols, they are saying

A and B => we conclude C

Goldacre says you can't conclude C. You're saying they don't even have B.
posted by benito.strauss at 11:31 AM on October 14, 2011 [1 favorite]

It's the other way. Non-overlapping independent CIs is sufficient but not necessary."

Oh I know it is. I guess I wasn't clear in my meaning. If they don't overlap they are significantly different. The issue was that a reviewer wanted some other statistics done because he wasn't convinced it was different *enough*. Which leads to my confusion, isn't it the point that if the CIs don't overlap you are kinda already done.
posted by MrLint at 11:44 AM on October 14, 2011

A quick box plot can provide visual indication of this. I would be surprised if a lab didn't do this or a quantitative measurement, however.

I'm not sure what you mean. As the study found, this is overlooked in academic neuroscience papers half the time.
posted by John Cohen at 11:58 AM on October 14, 2011

This seems glaringly obvious to me. In my head right now I am thinking "this is what happens when people do a couple of paired t tests on before/after measures, rather than a repeated-measures ANOVA with post-tests to compare differences between groups adjusted for pooled variance." You compare group differences FIRST. You look at interactions between group and treatment next. THEN you look at within-group changes. And guess what: If the first two come up empty, the within-group changes are meaningless. No effect of group, no effect of treatment, no interaction, all means no statistically significant result even if the within-group measure comes up p < 0.0000 to infinity.

People do this kind of thing all the time. It drives me nuts. It's wrong. I have colleagues who don't see what the problem is. I know researchers who specifically pick some post-tests because they are less stringent than others. Why would you want to intentionally bias your results towards being erroneous? Why would you publish data that you KNOW is going to get refuted? It just makes you look bad. I will never be known as a rock-star scientist, but I surely don't want to be known as a guy who does sloppy work that can't be reproduced. If my interpretations are off, so be it - but I don't ever want to publish data that turn out to be wrong because of the way I analyzed it.

I have a staff statistician in an office less than 30 yards from my desk. When I'm in doubt, I ask him how to proceed. Not everyone is so lucky, of course. But I bet the vast majority of academic researchers work in an institution that has a statistics department. Back in grad school, when I hit a sticky statistical analysis during one study, I called the stats department and did a consultation with one of the professors there. When I published, I was confident the stats were kosher.

If you have such resources available, you ought to be using them.
posted by caution live frogs at 12:06 PM on October 14, 2011 [1 favorite]

People still argue about this. Note all the stat phil people arguing even more in the discussion! For example, Fisher and Neyman disagreed over exactly what the heck a p-value is. I would accept "a measure of evidence against some null hypothesis" as correct.

What I meant to imply is that your average MD would say that p<0.05 means that there is only 5% chance that the null hypothesis is true or that there is only a 5% chance the result is due to chance.
posted by ssmug at 12:11 PM on October 14, 2011 [1 favorite]

Sounds like things haven't changed too much, unfortunately. A woeful gap in my scientific education was statistical analysis of data. As in zero classes in statistics in graduate school, and one elective in college which was more about the mathematical theory and pretty useless as far as applying it to real data. And these were name-brand schools!

As datasets become larger and experimental effects become subtler, statistical analysis becomes more important. I still feel sort of handicapped by my lack of statistical ability - I'm pretty much at the "Error bars overlap? Yes/No" level, which is OK up to a point but pretty limiting for many types of data.

I think rigorous statistical analysis should be added to the curriculum for experimental sciences, starting in college and continuing in grad school. Once you're out of school it's incredibly hard to get this kind of training - companies don't like to pay for it, it takes time away from your main work, even keeping a statistics consultant around is too expensive for a lot of places*. Even within academia, it can be hard to collaborate with people on different campuses. I was at a medical school and the math/statistics department was clear across town, nobody had any connections or personal contacts there, and our different worlds never collided.

To address the FPP title, I would reply "Never attribute to malice what can adequately be explained by incompetence," since there's a lot of us underedumacated boffins out there.

*I worked at one company that had a real live statistician for analyzing clinical trial data, but they were always totally swamped and under strict orders not to waste their time looking at research data. But hiring a stats consultant was redundant since they already had a statistician in-house ...
posted by Quietgal at 12:27 PM on October 14, 2011

I have a staff statistician in an office less than 30 yards from my desk. When I'm in doubt, I ask him how to proceed. Not everyone is so lucky, of course. But I bet the vast majority of academic researchers work in an institution that has a statistics department.

Statisticians vary, but my impression of statistics-department statisticians is that they tend to be so divorced from the actual process of scientific / empirical inference that I would question how often their advice is, even when correct, useful to the applied researcher. If nothing else, often the right answer will really be to try to find a different strategy for supporting your theory.

Can be, sure. But my first reaction to a tricky stats problem as part of an actual empirical investigation would probably not be to contact a statistician, just like my first reaction to a particularly difficult game-theoretic modeling question would not be to find a game theory expert in the math department, no matter how freaky-genius (s)he is about game theory.
posted by ROU_Xenophobe at 12:28 PM on October 14, 2011

Suppose that you get a stock tip that a stock - XYZ - will go up by 10%. Stocks tend to swing +/-5% about now, so if XYZ does go up by 10%, you can be pretty sure the stock tip is good - and maybe you should pay attention to the source!

At the end of the period, XYZ goes up by 10% - success! Except, another stock you picked entirely at random, as a "control", went up by 5%. XYZ went up 5% because it was a smart pick - or maybe it's just random variability in a market that was rising 5% anyway.

That all sounds right (and I'm no statistics expert, so I'm hardly the definitive word on this), but there's another thing the article is pointing out: it's nonsense to say -- on the one hand -- "Oh, the stock going up from 0 to 5% was an insignificant fluctuation," but then turn around and say -- on the other hand -- "The stock that increased by 10% is doing significantly better than the stock that went up a mere 5%." Both of those involve a difference of 5% (the baseline of zero vs. 5%, and 5% vs. 10%). So it's blatantly inconsistent to call one of those 5% gaps significant while calling another insignificant.

John Cohen - YES! You got it! In my example, the stock-tip-seller wants you to believe that the 10% growth was the important part - but you, as a smarter-than-the-average-scientific-paper-submitter, note that it only really did 5% better than the control, which ISN'T significant.

And it's blatantly inconsistent for those researchers to say "Hypothesis X was significantly validated, although the control group was insignificantly different from the X group."
posted by IAmBroom at 12:37 PM on October 14, 2011

The issue was that a reviewer wanted some other statistics done because he wasn't convinced it was different *enough*. Which leads to my confusion, isn't it the point that if the CIs don't overlap you are kinda already done.

I'm sure the context matters, and maybe you're right in the context of the paper you're talking about, but in principle I don't see any problem with worrying about effect size separately from statistical significance, since they are completely different things.

It's totally plausible that a certain treatment could be shown to increase lifespans in a rigorous, statistically-significant way. It's also totally plausible that the increase in lifespan might be only 1 day, that is, not a large effect size. You probably wouldn't want to spend a lot of money on this treatment even if it was statistically significant.
posted by secretseasons at 12:40 PM on October 14, 2011

He goes on at length about how terrible the average scientist is at statistics.

It probably doesn't help that when I went to high school, maths was split into "calculus" and "maths with statistics", a split that persisted into university, with the implication that calculus was for "real mathematicians" and scientists, while stats was the soft option. Similarly, calc was a pre-req for many hard science courses, but I can't recall any requiring stats.

Perhaps ironically, I've just finished reading John Gribbin's very enjoyable The Fellowship, which discusses the importance of mathematics as the final component in moving away from Aristotelian "recieved wisdom and thinking hard is all you need to do" philosophy to what we call science.

It's not clear to me that neuroscience would be at all improved if all those researchers actually did the statistics correctly, since no one actually reads papers, they just pick out the conclusions and the experiments and decide whether they fit into their own world-view of not based on ineffable personal opinions.

I look forward to seeing your rigorous paper demonstating your claims.
posted by rodgerd at 12:42 PM on October 14, 2011 [2 favorites]

He goes on at length about how terrible the average scientist is at statistics.

If they're lucky, the average physical scientist had some stats in first year. Many though have never had a formal education in stats. Calculus, discrete math, group theory, yes, even computational math, but stats, no. The life and health sciences people usually get a term or two of applied stats, but many still don't.

I spent my universty stats course using calculus to prove the statistical relationships, but no time at all learning how to apply them. My knowledge of applied stats is largely self-taught, and on-going. It's one of the holes in my (and my collegues) knowledge that I'm frankly terrified of, as it's so important to how we do our work.

I just came back from a conference with a special session on forensics where one of the papers from a leading expert advocated a method with a known weakness for Type II errors. This weakness is not obscure, nor unpublished. It's in many textbooks. And yet this method is regulated for use in court to establish criminal liability and civil penalties in more than a dozen countries. The expert was advocating it's use in my country. Indeed, his presention of a case sudy ended with a type II error, rejecting a forensic match where the null hypothesis should have been disproved (and would have been if he hadn't been comparing individual t-tests).

I'm really not surprised by the article, sadly. This is one area where the experimental sciences of all types need to get our collective acts togther in the worst way.
posted by bonehead at 1:03 PM on October 14, 2011 [3 favorites]

there are three kinds of liars: liars, damn liars, and the guy who sold me this "rolex"
posted by This, of course, alludes to you at 1:06 PM on October 14, 2011

There is a fairly well known blogger (Shalizi) who is a statistics professor who says as one of his disclaimers that he has never actually taken a statistics class. He goes on at length about how terrible the average scientist is at statistics.
Which is to say that the sort of empiricism labelled the "scientific method" in middle school has very little to do with how science is actually practiced.

Physics & stats derail: I am a PhD physicist who has wandered, by way of complex systems, into computational biology. Like many other physicists (including Cosma, whose trajectory mine resembles -- I'm also stats faculty now), my academic physics training did not include any stats coursework. In my observation, physics seems exceptional amongst the sciences in this respect: while curricula in the life & social sciences typically include some basic stats instruction (brief and light though it may be), physics curricula often don't.

But I would argue that this has less to do with "how science is actually practiced" than with physics in particular: the fact that mathematics is an unreasonably effective tool for physics; the huge amounts of higher math that must, because of its unreasonable effectiveness, necessarily take space in the curriculum; a culture that prizes the elegance of theory sometimes more than the empiricism of experiment; &c. The scientific method in the life sciences is more reliant on stats than derivation, so their curricula aren't quite as notably stats-light as physics.

Yet even in stats-heavier fields, stuff like this still happens, and I'd agree w/ Cosma's assertion wrt scientists & stats. I just reviewed an attempt at a "survival" analysis in which the authors threw out censored observations without even thinking that perhaps this might affect the stats; I routinely see articles where genetic sequence data are treated as continuous variables; the authors of this paper [doi:10.1016/j.ygeno.2008.05.014] standardized each measurement in two different data sets to zero-mean and unit-variance, and then tested for differences between the data sets to excitedly report that there were none; ....

Long story short, what all this says to me is that there's a lot of statistical crank-turning that is uninformed by an understanding of what the crank is doing; people aren't coming away from their schooling with a good statistical intuition that tells them when something might be off. I wish I knew how to combat this pedagogically, because it's a huge problem -- not just for scientists, but for anyone who participates in a democracy or a market economy.
posted by Westringia F. at 1:12 PM on October 14, 2011 [9 favorites]

Mapes, you're right. That phrase was the result of my oversimplifying things a bit too much - somehow my brain came up with "you can say that it failed to produce an effect in brunettes," instead of "you can say that you failed to find a statistically significant effect in brunettes." Thanks for pointing it out.
posted by googly at 1:37 PM on October 14, 2011

The key is to understand the significance of insignificance. A result is insignificant if it fails to be predictive of future tests. In other words, if you ran the exact same experiment ten more times, the difference between the two distributions would vary enough to "flip the winner".

That's what's going on here. There might be an effect against both distributions, but comparing which effect is reliably stronger is something that may suffer insignificant data.
posted by effugas at 1:54 PM on October 14, 2011

.... so you're saying that LAPTOPS CURE HEMORRHOIDS!?
posted by d1rge at 2:12 PM on October 14, 2011

A statistical equivalent to the Force Concept Inventory [pdf link] might be useful.
posted by benzenedream at 3:21 PM on October 14, 2011

Philosopher Dirtbike: I don't understand the distinction you're making there. bhnyc's interpretation is that people are incorrectly rounding/reducing/quantizing the data (to a 1-bit "significant or not") before comparing the results of different tests, when instead they should be comparing the full observed distributions to each other (or at least some statistics of those observations that are richer than the 1-bit significance results). To me, that sounds exactly like the error googly, dsword, John Cohen, etc describe (or for that matter Ben Goldacre's example in the link). Okay, it's not really rounding or quantization since you're reducing the dimension of the data by a bunch, but conceptually, the error is similar: the results of the individual significance tests no longer contain enough information to compare them with each other in the way that people are trying to do, and an insignificant difference is magnified into significance because the two inputs fall on opposite sides of an arbitrary boundary (the "rejects-null p=.05" boundary, which might not even be related to the "rejects a=b with p=.05" boundary you're interested in).
posted by hattifattener at 3:41 PM on October 14, 2011 [1 favorite]

But researchers only make this error half the time. That's small fry, when 100% of researchers don't understand "correlation isn't causeation!!1!" in 100% of studies that happen to challenge the worldview of random-person-on-the-internet.
posted by -harlequin- at 3:51 PM on October 14, 2011 [2 favorites]

Mathematics is defined as what you get when you analyze any given system of axioms to the point of proving non-obvious stuff that follows from those axioms. When the axioms in question are physical laws, the stuff you prove will describe non-obvious consequences of those physical laws, and if you prove stuff you haven't observed yet, then congratulations, you've used your theory to make a prediction. That's what theories are supposed to do.

To say that math is "unreasonably effective" at physics is tautological. The math that physicists use was specifically developed for the purpose of finding that kind of non-obvious "connection between mathematics and physics". If you want math that isn't effective in that way, you can find it.
posted by LogicalDash at 4:20 PM on October 14, 2011 [2 favorites]

One major problem with this is that one thing that can easily cause insignificant p values is insufficient sample size. Increasing sample size allows you to show statistical significance when you have a smaller experimental effect. So, if you ran your experiment on, say, 200 experimental subjects and 20 controls, you might detect an effect on the subjects but not on the controls simply because you had a larger sample size for the subjects.

Increased variance is another thing that can increase your p and decrease your ability to find statistical significance, so if you had two groups and one of them varied much more than the other, that could also cause this problem.
posted by Mitrovarr at 5:11 PM on October 14, 2011

LogicalDash, that mish-mash you wrote is not even wrong. Physical laws are not axioms. Axioms are highly abstract. Mathematics is a formal system. Anyone who works with formal systems long enough begins to perceive that they are powerful but sharply limited. You can enumerate (place in one-to-one correspondence with the natural numbers) the theorems of an axiomatic mathematical system. The world is a physical system. We model it with real numbers, which are not enumerable. No formal system can encompass it.

Some of the best minds in physics consider the unreasonable effectiveness of mathematics an enigma. At least read Wigner's paper. And don't kid yourself that the underpinnings of a naturalistic scientific world-view are unquestionably solid.
posted by Crabby Appleton at 5:25 PM on October 14, 2011 [1 favorite]

"...no one actually reads papers, they just pick out the conclusions and the experiments and decide whether they fit into their own world-view of not based on ineffable personal opinions."

No, you want Climate Change Denial - Room 112a, just down the hall on your right.
posted by sneebler at 6:28 PM on October 14, 2011 [1 favorite]

LogicalDash:To say that math is "unreasonably effective" at physics is tautological. The math that physicists use was specifically developed for the purpose of finding that kind of non-obvious "connection between mathematics and physics". If you want math that isn't effective in that way, you can find it.

Hamming's response to Wigner's essay (cited by Crabby Appleton) considers this argument in particular and finds in unpersuasive.
posted by pseudonick at 6:46 PM on October 14, 2011 [2 favorites]

demiurge: We have a statistician on the project I am on, and her job is to say "No, you can't say that with this data." That's key to good science.

Yes, that's good, and yet.... who keeps the statistician honest? (I'm imagining the Principal Investigator wining and dining the Statistician at a very expensive restaurant, because the PI desperately needs a paper published. If that doesn't work, maybe a week's vacation in Hawaii will.) What would it take to bribe an otherwise honest statistician when the project is in dire straits?
posted by exphysicist345 at 6:53 PM on October 14, 2011

Well, if outright bribery and fraud are options, it would probably be easier and cheaper to just fake data that are more statistically-significant.
posted by secretseasons at 8:13 PM on October 14, 2011

It's not clear to me that neuroscience would be at all improved if all those researchers actually did the statistics correctly, since no one actually reads papers, they just pick out the conclusions and the experiments and decide whether they fit into their own world-view of not based on ineffable personal opinions.

hey, i'm all about cynicism, but this kind of characterization is just wrong. sure, science is about selling your research, and there are examples of researchers who have over-stepped the bounds. however, in general scientists have much to loose if they recklessly stray into territories of research that may be nothing more than pipe dreams. accepting wholesale papers with which one agrees and ignoring those which which ones doesn't--to do either--would continually endanger ones credibility within the profession, and hence risk ones career.

the scientific method is not perfect but it does function (and the highly-critical, scientific paper cited above is proof of this)
posted by DavidandConquer at 9:06 PM on October 14, 2011

I only have a basic knowledge of statistics, but I find it immensely useful in doing my job. What's the best resource for learning about statistics at this level, so I could understand what a p-value is and avoid making subtle errors?
posted by heathkit at 6:45 PM on October 15, 2011

An interesting, quasi-related article from the Guardian about Bayes Theorem and Forensics: A formula for justice
posted by iamkimiam at 6:44 AM on October 17, 2011

« Older Ballard Geocoded | How and where to find giants on the internet Newer »

This thread has been archived and is closed to new comments

MetaFilter

"I hope this is all just incompetence."
October 14, 2011 8:49 AM Subscribe

Tags

Share

"I hope this is all just incompetence." October 14, 2011 8:49 AM Subscribe

Tags

Share

"I hope this is all just incompetence."
October 14, 2011 8:49 AM Subscribe