The Tyranny of the P-Value
March 18, 2010 10:20 AM   Subscribe

Significantly what?...Or how our most common statistical methods really weren't meant to be used that way and why that study result is likely spurious. Since mefites like to argue about stats, here's some background for us all (and I'm not talking correlation vs causation)!
posted by mandymanwasregistered (51 comments total) 46 users marked this as a favorite
 
Misunderstanding/misusing statistics gets my grar up in a way that few other things do.
posted by desjardins at 10:24 AM on March 18, 2010


199% of statistical claims are false.
posted by It's Raining Florence Henderson at 10:32 AM on March 18, 2010 [1 favorite]


I work with statistics a lot, but I don't have much of an academic background in it. All of my knowledge comes from a few classes I've taken in the last couple years and on-the-job learning. I consider myself a stats n00b.

So it's a bit horrifying when I find myself working with people who really ought to know their stuff, who want to use statistical test results in shockingly wrong ways. I'd say that happens about once a month.
posted by gurple at 10:33 AM on March 18, 2010


I use statistics to test everything. For instance, even though I don't use birth control, I feel safe, because statistically it's not likely that my girlfriend will have an unplanned pregnancy, given her age and educational background.
posted by Astro Zombie at 10:37 AM on March 18, 2010 [12 favorites]




No mention of the widespread misuse of statistical control and ANCOVA? Bah. Read this and get back to me:

Misunderstanding analysis of covariance
posted by emilyd22222 at 10:45 AM on March 18, 2010 [4 favorites]


Some of the people I work take stats seriously, and most of the outlets (conferences/journals) now require reporting of effect sizes. So while we find solitary p very problematic, the majority of people I work with have graduate degrees in various subjects and somehow think that their basic stats class is all they need to make Awesome Discoveries. I have 2-3 meetings per week where I am just fixing things to prevent their release into the wild.

It reminds me of a paper I read about the increased use of factor analysis in dissertations and academic publications. While it is now So Darn Easy to do it, my guess is that it is being done very very poorly. Don't even get me started down that road.
posted by cgk at 10:51 AM on March 18, 2010


...and Greg Miller on how to explain the problems with statistical control to others:

"One can say something like: 5th and 6th graders differ in age, height, and experience with soccer. We might wonder which of those things matters the most in their soccer ability. It's easy to imagine that age BY ITSELF doesn't matter, whereas height is probably a help. But age and height tend to go together at that age. How can we tell which matters? Can we even separate them?

If you're brave, then ask your listener... Does this question even make sense: how tall would 6th graders be, on average, if they were only the age of 5th graders? The question doesn't really make sense, because the average 6th grader simply isn't a 5th grader, couldn't be one - has already finished 5th grade! And then deliver the punchline: some things, even though we have different names for them, are so intertwined that we really can't fully separate their influence."
posted by emilyd22222 at 10:52 AM on March 18, 2010 [2 favorites]


Even more basically, a lot of people (including various trusted professionals) don't seem to understand what a statistic means:

“Determining the best treatment for a particular patient is fundamentally different from determining which treatment is best on average,” physicians David Kent and Rodney Hayward wrote in American Scientist in 2007. “Reporting a single number gives the misleading impression that the treatment-effect is a property of the drug rather than of the interaction between the drug and the complex risk-benefit profile of a particular group of patients.”

If a medication is 99.99% likely to cure you and .01% likely to kill you, that is cold comfort if you are that lucky .01%....
posted by GenjiandProust at 10:55 AM on March 18, 2010


When I work with stats I always keep the following in mind: really unlikely stuff happens all the time. Seems weird, but it's saved my statistical bacon a time or two.
posted by MarshallPoe at 10:55 AM on March 18, 2010 [2 favorites]


I gave this talk at MIT in January in response to widespread lack of error bar labeling and misinterpretation of p-values in the research reports I see. Many intro to statistics classes pay lip service only to the reality of type I and type II errors (i.e., a finding of significance when the result was due to chance, or failing to find significance when an actual effect exists, respectively). MIT has a nice class (7.57, Quantitative Tools in Biology, which I'm sitting in on now) that does a great job of covering and preventing false discovery issues in contexts where it really matters, such as gene expression correlations.
posted by Mapes at 10:57 AM on March 18, 2010 [14 favorites]


So what's the solution? Since the problem has been reported "for decades" (according to the article) and hasn't been fixed, obviously scientists are not willing or able to do it. Maybe journals should start sending all papers to a (staff?) statistician and reject those that misuse the tools?
posted by DU at 10:58 AM on March 18, 2010


One solution is this: restating the logical result of the statistical analysis. This can be summed up by a line in the OP link: “Such results should be considered more as hypothesis formulating than as hypothesis testing.” In other words, the safest way of applying the results of an experiment is forming new questions to test in future experiments, waiting for repeated or combined results before stating real-world-applicable findings of fact.
posted by buzzv at 11:05 AM on March 18, 2010


Maybe journals should start sending all papers to a (staff?) statistician and reject those that misuse the tools?

That would be a fantastic improvement, at least in my subfield. But it would be prohibitively expensive, I think, in the absence of some kind of government support/mandate.

For a couple journals I can think of, anyway, it's often all they can do to track down two referees who understand the subject matter well enough and are willing to spend the time marking up a paper for free.
posted by gurple at 11:06 AM on March 18, 2010


But it would be prohibitively expensive, I think, in the absence of some kind of government support/mandate.

Journals are already providing an expensive service to ensure right answers, why is running something past a stats person that much more so?
posted by DU at 11:11 AM on March 18, 2010


Also, not every distribution is normal. I mean, some are goddamn freaky.
posted by GuyZero at 11:12 AM on March 18, 2010


I think about this kind of error every time I read some horrible newspaper article saying "scientists confirm people who live on the odd side of the street are more likely to contract elephantisis!" Or whatever. Absurd correlations that do not have any rational reason to be causations get reported as fact.

I think the solution is to turn statistics into more of a service business. There's way too many non-statisticians firing up R, importing some datasets, and conjuring up correlations. Either hone the workman-like use of statistics so any dumb-ass physical scientists can't get it wrong, or have real live statisticians working on the research.
posted by Nelson at 11:18 AM on March 18, 2010


I gave this talk at MIT in January in response to widespread lack of error bar labeling and misinterpretation of p-values in the research reports I see.

I highly recommend that anyone who encounters hypothesis testing or error bars in their work should take a look at Mapes's talk.
posted by grouse at 11:20 AM on March 18, 2010 [1 favorite]


not every distribution is normal

This is a huge source of (experimental) problems in my experience. Even raising this as an issue though gets one blank looks. "Whadda mean I can't use an average? How do I get my standard deviation then?"

Sure, there's non-normal statistics, but just about every stats course (for physical sciences, at least) starts with "assuming a bell-shaped curve of results..." in the first lecture and goes from there. P-values mean squat for a bi-modal distribution.
posted by bonehead at 11:23 AM on March 18, 2010


From the article (emphasis mine): But in fact, there’s no logical basis for using a P value from a single study to draw any conclusion. If the chance of a fluke is less than 5 percent, two possible conclusions remain: There is a real effect, or the result is an improbable fluke. Fisher’s method offers no way to know which is which. On the other hand, if a study finds no statistically significant effect, that doesn’t prove anything, either. Perhaps the effect doesn’t exist, or maybe the statistical test wasn’t powerful enough to detect a small but real effect.

Well, no, it's not SUPPOSED to prove anything, and anyone who says that statistics prove anything is misinformed or lying. Statistics doesn't deal in absolute truths.
posted by desjardins at 11:35 AM on March 18, 2010


just about every stats course starts with "assuming a bell-shaped curve of results..."

Reminds me of the joke about physicists and "assume the horses are perfect spheres..."
posted by GuyZero at 11:36 AM on March 18, 2010


Kent Brockman: Mr. Simpson, how do you respond to the charges that petty vandalism such as graffiti is down eighty percent, while heavy sack-beatings are up a shocking nine hundred percent?
Homer Simpson: Aw, people can come up with statistics to prove anything, Kent. Forfty percent of all people know that.
posted by blue_beetle at 11:47 AM on March 18, 2010


The writer talks about an important issue but I find offputting how he seems to insinuate more than once that statistics itself is somehow corrupted.
During the past century, though, a mutant form of math has deflected science’s heart from the modes of calculation that had long served so faithfully. Science was seduced by statistics, the math rooted in the same principles that guarantee profits for Las Vegas casinos.
So statistics is a 'mutant form' of math, and what else will you tie it to but gambling? Not meteorology or optical character recognition. Las Vegas.
Supposedly, the proper use of statistics makes relying on scientific results a safe bet. But in practice, widespread misuse of statistical methods makes science more like a crapshoot.
The use of 'but' to connect these sentences implies that misuse of statistics renders proper use of statistics also a 'crapshoot'. This, of course, is babble.
posted by Anything at 12:00 PM on March 18, 2010 [3 favorites]


I liked, in particular, the dig at social scientists at the end of the article. I was under the impression that many, many experiments in the medicine are not ever repeated, or even tried again.
posted by OmieWise at 12:02 PM on March 18, 2010


P-values mean squat for a bi-modal distribution.

What? They mean the same thing. The probability of observing a whatever, or more extreme, if the data-generating process implied by the null were true.

This one time I needed a null distribution of committee median ideology, where the distributions of individual legislator ideologies are strongly bimodal and variously fucked up. So I grabbed 100,000 (pseudo-)random committees of the appropriate size(s) and stored their median ideologies. Some of the null densities were unimodal, some multimodal. Generating a p-value was then a simple matter of asking "how many simulated committees were as extreme as the observed committee?"

There you have it. P-value from bi-modal distributions.
posted by ROU_Xenophobe at 12:11 PM on March 18, 2010 [3 favorites]


It reminds me of a paper I read about the increased use of factor analysis in dissertations and academic publications. While it is now So Darn Easy to do it, my guess is that it is being done very very poorly. Don't even get me started down that road.


Well, the fact that SPSS doesn't even allow for both EFA and CFA is part of the problem.
posted by k8t at 12:25 PM on March 18, 2010


More like "... SPSS ... is part of the problem."
posted by proj at 12:26 PM on March 18, 2010


Mapes' talk deserves a FPP of its own.
posted by k8t at 12:27 PM on March 18, 2010


Proj, touche.
posted by k8t at 12:27 PM on March 18, 2010


Sure, you can get a p-value for any distribution by working out the equivalent error function, but you can't just bung the numbers into a calculator which assumes a normal distribution and a Gaussian mean, you actually have to think about what you're doing.

You explicitly did the integration (by MC) and calculated the median by hand; that's a very different thing. Were your medians normally distributed?
posted by bonehead at 12:33 PM on March 18, 2010


It's a good article and makes some good points. I especially liked the point about people confusing effect size and statistical significance. I once attended a talk by a postdoc who was doing some kind of genome-wide study (I forget what exactly). He had identified a gene that was highly statistically significantly associated with his trait of interest (at p = 10^-6 or so) but it only accounted for 3% of the total variance in the trait. I was pretty unimpressed by the fact that it only accounted for 3% of the variance, but he kept harping on how statistically significant it was....
posted by pombe at 12:56 PM on March 18, 2010


I'm an ecologist and all I've ever used are non-standard or non-parametric stats because that is what you do in my particular field. If I had a standard distribution, l wouldn't know what to do with it, not to mention all the zeros. It's not that hard so the chicancery that goes on to create "normality" has always confused me. And made me highly suspicious of any study that doesnt explicitly discuss what they did to the data during analysis.
posted by fshgrl at 1:01 PM on March 18, 2010


People still use SPSS for factor analysis? Why?
posted by fshgrl at 1:05 PM on March 18, 2010


fshgrl, I think that people *try* to do FA with SPSS cuz it is all that they know. I'd also imagine that some people don't have access or knowledge of other stat programs.
posted by k8t at 1:11 PM on March 18, 2010


you can't just bung the numbers into a calculator which assumes a normal distribution and a Gaussian mean

From your lips to my students' ears.

Were your medians normally distributed?

Sometimes, sometimes not. I'd have to go look, and I never did a formal analysis of it, but IIRC null densities of big committees from unpolarized settings tended to be normal or at least unimodal, small committees from polarized settings went the other way. So like a 5-person committee from the IA House would be strongly bimodal.
posted by ROU_Xenophobe at 1:15 PM on March 18, 2010


Dierdre McCloskey and Steve Ziliak have been railing on similar themes for a while. Thanks for the links!
posted by stratastar at 3:08 PM on March 18, 2010


I work in biology research. My statistics training consisted of something like four one hour lectures during my undergraduate course, and an (unassessed) afternoon course during the first week of my PhD. I'm doing my best to learn as much as I can by buying stats textbooks and teaching myself; I've learned just enough to be freaked out by how little I know and, worse, how little my colleagues and superiors know.

The sad truth is that a lot of biologists have ended up here because they/we love science but can't handle the maths required for physics or advanced chemistry. There are plenty of exceptions, but overall there seems to be a strong culture of treating statistical tests as "black-box" tools, i.e. plugging the data in and accepting the answer, without really appreciating how the test actually works. When it comes to publication, peer review catches some of this, but nowhere near all of it. This inevitably leads to some nonsense (or, at least, unsupported) claims being published. Also, real effects are presumably being missed by experimenters who haven't bothered -- or known how -- to do the proper power calculations when designing the experiment.

One of the strongest fields I've encountered is bioinformatics, which deals with the analysis of huge data sets like gene expression assays (thousands or tens of thousands of data points per experiment). Good bioinformaticians are hard to find, and tend to be statisticians who've moved into biology, not the other way around. I'm not in a position to give an informed assessment of their work, but they often serve as unofficial department statisticians and the looks of pain on their faces when they're dealing with the rest of us speak volumes.
posted by metaBugs at 4:13 PM on March 18, 2010 [4 favorites]


Maybe journals should start sending all papers to a (staff?) statistician and reject those that misuse the tools?

Some do.
posted by Mental Wimp at 4:30 PM on March 18, 2010


I would be major progress if we could get news journalists to simply understand the difference between mean and median. For example "average income" in many contexts is grossly misleading.
posted by JackFlash at 4:44 PM on March 18, 2010


I'm doing my best to learn as much as I can by buying stats textbooks and teaching myself; I've learned just enough to be freaked out by how little I know and, worse, how little my colleagues and superiors know.

Good on you. The best biology these days is done by teams with statisticians and biologists on them.

Good bioinformaticians are hard to find, and tend to be statisticians who've moved into biology, not the other way around.

On the other hand, RA Fisher started out a geneticist. But my favorite story is about the profession of the discoverer of student's t-distribution and the t-test, William S. Gossett: he was a chemist for Guinness.
posted by Mental Wimp at 4:46 PM on March 18, 2010


In economics (my area), both statistical significance and magnitudes are necessary conditions for anything that is reported. But they are considerably less important than solving identification problems. If you have a credible identification strategy, then you can even find "no results". (P
posted by scunning at 6:39 PM on March 18, 2010


He had identified a gene that was highly statistically significantly associated with his trait of interest (at p = 10^-6 or so) but it only accounted for 3% of the total variance in the trait. I was pretty unimpressed by the fact that it only accounted for 3% of the variance, but he kept harping on how statistically significant it was....

I'm collaborating with a colleague who has found several SNP with highly significant effects. We're having a similar discussion because he makes a big deal in the manuscript about the P-values, but doesn't discuss the effect sizes. And he has some of the biology backwards. *sigh*
posted by wintermind at 7:09 PM on March 18, 2010


Two stories:
1.) When an ultrasound was performed of my son they detected a marker for Down's syndrome (thick nucal fold) which they said increased the probability that he had downs by several orders of magnitude to... 1/200. We were told this meant that he was at high risk for downs so we should schedule an amnio. You want to know what 'high risk' is? When the risk of disorder is greater than the risk of miscarriage due to the amnio. I pointed out that, should we follow through with an amnio the odds were substantially greater that we would lose a baby without a disorder than one with it. (>99% of babies in our situation are healthy. The rate of amnio induced miscarriage would have to be _substantially_ higher for children with Down's than those without for it not to be the case that most amnio miscarriages in our situation be healthy babies) The 'genetic counselor' who was trying to sell us on the amnio had no idea what I was talking about. (As an aside, we didn't go with the amnio. We knew we weren't going to decide to terminate the pregnancy either way.)

2.) Soon after my daughter was born I received a phone call from a nurse managing a study on juvenile diabetes. Who was informing me that based on genetic testing done when my daughter was born she has an increased risk of diabetes. (~3%). As part of trying to understand what this all meant I asked when her risk would fall off.
Answer: "She will always have a risk of 3%"
Me: "But certainly the probability of developing *juvenile* diabetes is higher in childhood. If the lifetime risk is 3% and say, 90% of all cases of type 1 diabetes manifest before 15 then the risk of developing diabetes after the age of 15 must be lower than 3%. I just want to know what that curve looks like."

The nurse put me in contact with the PI. He too had no idea what I was walking about.

Way too many people use stats in medicine who don't know what they are talking about and they can easily scare the shit out of you.
posted by lucasks at 7:33 PM on March 18, 2010


I came here specifically looking for this. A friend sent me the article today, and while I agree with the points the author is making, I couldn't stand the framing of the topic and the writing. I decided to write out all of the things I would like to say if I had seen this on Metafilter. The entire rant is too long to put here, and some of it has been said already above (I'll sneek a little bit in as a response to Anything). Some of the rant is below.

-------begin rant--------

For better or for worse, science has long been married to mathematics. Generally it has been for the better.

Ok. Go on.

Especially since the days of Galileo and Newton, math has nurtured science. Rigorous mathematical methods have secured science’s fidelity to fact and conferred a timeless reliability to its findings.

Shit, that sounds pretty good!

It’s science’s dirtiest secret: The “scientific method” of testing hypotheses by statistical analysis stands on a flimsy foundation.

Wow! How did you find out about this secret?

Experts in the math of probability and statistics are well aware of these problems and have for decades expressed concern about them in major journals.

That doesn't sound very secret.

Ioannidis claimed to prove that more than half of published findings are false, but his analysis came under fire for statistical shortcomings of its own.

I think that only lends credibility to Ioannidis, since, as know, statistics is ``a mutant form of math [that] has deflected science’s heart".

... and so on
-------end rant--------
posted by Someday Bum at 7:39 PM on March 18, 2010 [1 favorite]


The punchline: The article makes it all sound like a soap opera. Science ``has long been married to mathematics''. Math has been good and ``has nurtured" him?her? (math as mother figure?). But statistics ``has deflected science’s heart''. Oh yes! Science has been ``seduced by statistics'' and the two are now having a ``love affair'' which has already ``spawned countless illegitimate'' offspring! And we haven't even gotten to Science's ``dirtiest secret''...
posted by Someday Bum at 7:39 PM on March 18, 2010


I took a class from a very renowned statistician once who said that he was often on students' dissertation committees (our university at the time required one person on your committee from outside your department, and since a lot of students think statistics is a nice subject to have under your belt, being a professor of statistics basically guarantees that you're going to be on a lot of random committees across the sciences and social sciences). He said that it was an incredibly depressing job because he felt it was his academic duty to honestly critique the uses of statistics in these dissertations, and he said that more often than not (yes, more often than not) there would be some glaringly wrong use of a statistical test, or a completely bogus claim of correlation made, or error bars that were just flat out wrong, or whatever. And he'd have to call it out.

Sometimes it was just a matter of recreating a few figures to correct the error bars. But sometimes it was a matter of invalidating years of a student's work because when the math was done right, no conclusions could be drawn. He told us that he had gotten to the point where he absolutely dreaded being tapped for a committee, because he knew he'd ruined kids' lives.

Statistics is harsh, folks.
posted by little light-giver at 9:21 PM on March 18, 2010 [4 favorites]


I was at the AAAS panel referred to at the end of this article.

One of best speakers was Stanley Young from the National Institute of Statistics. whose outrage at the misuse of statistics was palpable. His talk blew my mind.

Ars Technica summarized the session well in this article: We're so good at medical studies, we get most of them wrong.

Excerpt:

Young provided the best measure of where the field stands. In a survey of the recent literature, he found that 95 percent of the results of observational studies on human health had failed replication when tested using a rigorous, double blind trial. So, how do we fix this?

The consensus seems to be that we simply can't rely on the researchers to do it. As Shaffer noted, experimentalists who produce the raw data want it to generate results, and the statisticians do what they can to help them find them. The problems with this are well recognized within the statistics community, but they're loath to engage in the sort of self-criticism that could make a difference. (The attitude, as Young described it, is "We're both living in glass houses, we both have bricks.")

Shaffer described how there were tools (the "family-wise error rate") that were once used for large studies, but they were so stringent that researchers couldn't use them and claim much in the way of positive results. The statistics community started working on developing alternatives about 15 years ago but, despite a few promising ideas, none of them gained significant traction within the community.

Both Moolgavkar and Young argued that the impetus for change had to come from funding agencies and the journals in which the results are published. These are the only groups that are in a position to force some corrections, such as compelling researchers to share both data sets and the code for statistical models.

Moolgavkar also made a forceful argument that journal editors and reviewers needed to hold studies to a minimal standard of biological plausibility. Focusing on studies of the health risks posed by particulates, he described studies that indicated the particulates in a city were as harmful as smoking 40 cigarettes daily; another concluded that particulates had a significant protective effect when it comes to cardiovascular disease. "Nobody is going to tell you that, for your health, you should go out and run behind a diesel bus," Moolgavkar said. "How did this get past the reviewers?"

posted by storybored at 1:44 PM on March 19, 2010


Fun article, good discussion. I've been looking at polisci grad programs, and because my undergrad didn't require a stats class, I'm going to have to take one and all this stuff is pretty fascinating (i'm sure it will be less appealing when I actually have to worry about it myself)
posted by klangklangston at 12:22 PM on March 20, 2010


Shaffer described how there were tools (the "family-wise error rate") that were once used for large studies, but they were so stringent that researchers couldn't use them and claim much in the way of positive results. The statistics community started working on developing alternatives about 15 years ago but, despite a few promising ideas, none of them gained significant traction within the community.

In genomics, it is so easy to ask questions that involve testing multiple hypotheses that something to deal with this problem is essential. In this community, the use of false discovery rate and q-value have gained great traction (q-value : false discovery rate :: p-value : false positive rate).
posted by grouse at 9:46 AM on March 21, 2010


Glad to see you well enough to compute, kk. Where are you looking?
posted by ROU_Xenophobe at 6:29 AM on March 22, 2010


Mostly at UCs, because I'm in state, but none of them seem all that focused on theory, which is kinda what interests me. I'm trying to track down an old prof who introduced me to Oakeshott's writing, just to see what he'd recommend.
posted by klangklangston at 8:28 AM on March 23, 2010


« Older single comic financial times link   |   Drop it, in a manner suggesting high temperature Newer »


This thread has been archived and is closed to new comments