Join 3,497 readers in helping fund MetaFilter (Hide)


A critical moment in statistics
April 11, 2011 12:56 PM   Subscribe

Statistical hypothesis testing with a p-value of less than 0.05 is often used as a gold standard in science, and is required by peer reviewers and journals when stating results. Some statisticians argue that this indicates a cult of significance testing using a frequentist statistical framework that is counterintuitive and misunderstood by many scientists. Biostatisticians have argued that the (over)use of p-vaues come from "the mistaken idea that a single number can capture both the long-run outcomes of an experiment and the evidential meaning of a single result" and identify several other problems with significance testing. XKCD demonstrates how misunderstandings of the nature of the p-value, failure to adjust for multiple comparisons, and the file drawer problem result in likely spurious conclusions being published in the scientific literature and then being distorted further in the popular press. You can simulate a similar situation yourself. John Ioannidis uses problems with significance testing and other statistical concerns to argue, controversially, that "most published research findings are false." Will the use of Bayes factors replace classical hypothesis testing and p-values? Will something else?
posted by grouse (45 comments total) 138 users marked this as a favorite

 
As a statistics nerd in a field outside the general sciences, this is an AWESOME post.


... but minecraft!
posted by strixus at 1:01 PM on April 11, 2011 [7 favorites]


I enjoyed that XKCD a lot. I enjoy most XKCDs that aren't fixated on the author's achingly awkward stance toward women.

In my line of work we often tend to calculate q-values as a correction for multiple hypothesis testing. They're kind of fun, but they're not perfect and certainly not a universal solution.
posted by gurple at 1:01 PM on April 11, 2011 [4 favorites]


I can't be the only one who wants to see duel fought over Bayesian vs. Frequentist interpretations. Are you listening, Heidelberg?
posted by atrazine at 1:02 PM on April 11, 2011 [1 favorite]


Half the problem (at least on the science-journalism side of things) is that the statistical meaning of "significant" has no resemblance to the English word "significant," which unfortunately has the same spelling and pronunciation.
posted by theodolite at 1:07 PM on April 11, 2011 [6 favorites]


See also John Kruschke's Open Letter and sundry other publications, including a textbook combining the MetaFilter favorites of "R" and "puppies".
posted by knile at 1:10 PM on April 11, 2011 [3 favorites]


Daryl Bem's use of p-values to show 'proof' of retroactive human psychic powers is another good example of misuse. There's a great refutation (pdf) of this showing how to use Bayesian models to correctly interpret the evidence.

My question for Bayesian testing in the sciences is: how do we assign prior probabilities? The author of the above refutation uses a value of 0.00000000000000000001 as his estimation of the chance that psychic powers exist, and that Bem's p < 0.05 result updates his posterior probability to 0.00000000000000000019. With the same logic, we could assign a prior probability of 0.0000000001 that the luminiferous aether doesn't exist, and say the Michaelson-Morley experiment isn't convincing at all because it only updated our probability to 0.00001. Bem's definitely a quack, but pulling prior probabilities like that imply that it's almost literally impossible to convince you that your belief is wrong, which isn't a strong foundation for science.
posted by 0xFCAF at 1:14 PM on April 11, 2011 [9 favorites]


I'm just going to drop this in here, my favorite graph in all of science
posted by Blasdelb at 1:16 PM on April 11, 2011 [26 favorites]


In his article "The Earth Is Round (p<.05)" Jacob Cohen (the seminal figure in power analysis) proposed renaming "null hypothesis significance testing" to "Statistical Hypothesis Inference Testing," so that it would have an appropriate acronym from his perspective.
posted by jasper411 at 1:19 PM on April 11, 2011 [3 favorites]


This topic seems terribly important to me, particularly as we enter a new era of science that can easily comb through giant datasets for patterns. Back when I was a student experimental scientist the problem was always that statistics is hard, and boring, and you don't want to understand it. So you feed your data into the magic p-value machine and it comes out <0.05 and voila! you're done! Important scientific papers that rely on statistical analysis really should have a proper statistician as a co-author, not just someone who pasted some spreadsheet data into SPSS.
posted by Nelson at 1:20 PM on April 11, 2011 [6 favorites]


With the same logic, we could assign a prior probability of 0.0000000001 that the luminiferous aether doesn't exist, and say the Michaelson-Morley experiment isn't convincing at all because it only updated our probability to 0.00001.

I'm not sure I follow this. There were many replications of the Michaelson-Morley experiment. Are you suggesting that each experiment be considered on its own? Because Bayesians would cumulate them.
posted by Mental Wimp at 1:25 PM on April 11, 2011 [2 favorites]


This topic seems terribly important to me, particularly as we enter a new era of science that can easily comb through giant datasets for patterns.

There are many frequentist and non-frequentist solutions proposed for this. One is the "false-discovery rate" which controls that number, similar to the way a p-value controls type-I error.

But this is, in fact, a big problem with big science. The dynamic tension between brute force methods and targeted, prior-knowlege-based methods revolves around exactly this problem. I, personally, am more sympathetic to the targeted approach, as I believe it is more likely to extend our knowledge than combing through random findings that are most likely nothing.
posted by Mental Wimp at 1:28 PM on April 11, 2011 [1 favorite]


bayesian reasoning is great if you have good priors, but if you don't i really don't see it as any better than existing frequentist methods. What it comes down to is this: statistics are just a good, consistent way oftelling you what you know/have observed. It's never going to tell you anything more than what you have observed, and should not be regarded as anything more than a way of describing the data you have collected (which is still a powerful tool, but its not a means of deriving 'truth', at all.)
posted by pucklermuskau at 1:38 PM on April 11, 2011 [4 favorites]


Right, but good methods help solve these problems, at least in social science.

First, we have hypothesis testing. If a researcher is sufficiently open and clear about the nature of your hypotheses, and they are not doing a data-mining expedition, then frequentist methods make a lot more sense. Even better if a researcher can use experiments or quasi-experiments (using a change in the law or some other external event) to make sure you know what you are measuring.

Second, any good (social) scientist is going to do robustness checks (resampling, using other methods or other measures) that should show if a result indicates something that is persistent, or just a statistical fluke.

Third, a researcher should have a sense of what their data means, which is why simply using large data sets results in little intuition. Explanations need to be tied to reality, not just statistics.

Finally, good research considers effect size, not just statistical significance. Small effects should not be taken as seriously as large ones.

Obviously, fakers gonna fake and data miners gonna mine, and some people never learn statistics, but it is possible to do really high quality work with p-values, if you want to do so.
posted by blahblahblah at 2:07 PM on April 11, 2011 [7 favorites]


Obligatory FSM link.
added bonus: values decrease in the X-axis
posted by Xoebe at 2:16 PM on April 11, 2011


As someone who's background is in education and music and is currently immersing himself in statistics for my PhD, this is extremely relevant and interesting. In my field, there's rampant misuse of statistics, especially when it comes to data mining, leaving off effect sizes, and not correcting for multiple comparisons. My adviser is basically as conservative as it comes with number play, and we go through articles in our field over and over to find things that are wrong, fudged, or done poorly. It's rather eye-opening to see so much research that has so many false assumptions based on poor data manipulation.

Although my PhD work is in music education, my related field is research design and statistics. After reading all of the poor research, I felt like I could make a change within the field. We'll see if that comes to pass, because let's be honest, I'm just a first year PhD student.

Like I said - I'm a musician and a teacher who's getting into the research side. I don't have a dog in this fight yet, but I guess I'll just keep reading until I do.
posted by SNWidget at 2:57 PM on April 11, 2011 [1 favorite]


I've been using Bayesian methods for my own data analysis for about 10 years now, but most of my undergraduate teaching focuses on classical frequentist methods. When I first made the switch to Bayes, I thought to myself "this is revolutionary, it's going to change everything". And it might, I suppose. Bayesian methods do seem less prone to pathologies, on the whole. But to be honest, in my experience only a tiny fraction of statistical mistakes that I find in papers (both undergrad and academic) tend to be caused by the flaws in frequentist methods. The vast majority arise because the user doesn't understand the tool that they're applying: and so they're testing the wrong hypotheses and misinterpreting the results. A large scale switch to Bayesian methods won't fix this: my suspicions is that, at present, we don't see as many egregious screw ups with Bayesian statistics only because Bayesian scientists are a self-selected group of extremely statistically savvy users. If forced to become frequentists they wouldn't be making a lot of mistakes either. Maybe I'm just feeling especially old and cynical this morning, but I really think a lot of people are hoping Bayes will be a magic bullet. In the end, I reckon there's nothing for it except to have a lot more stats classes.
posted by mixing at 3:17 PM on April 11, 2011 [17 favorites]


Mixing, you mean there's nothing for it except to have a lot more good stats classes, right? I found my undergrad stats classes to be tedious, and it wasn't until later that I learned that the tedium wasn't the fault of the material, but the instruction.
posted by Fraxas at 3:40 PM on April 11, 2011 [1 favorite]


I thought the most important factor in research was the WP-factor, WP standing for Who's Paying (for it).
posted by oneswellfoop at 3:48 PM on April 11, 2011 [1 favorite]


Someone want to post a quick comparison between Bayesian methods and Frequentist methods?
posted by effugas at 4:20 PM on April 11, 2011


R... and PUPPIES?! MINDSPLOSION!!!!!
posted by stratastar at 4:27 PM on April 11, 2011


The Bayesian approach calculates a posterior probability, or what statistician R. Fisher called the "reverse probability" from multiplying a "normalized" likelihood with the prior probability of the null hypothesis, or Bayes' theorem:

P(θ|X) = P(X|θ) P(θ) / P(X)

Where θ refers to determining the parameters (mean, variance, etc.) of a null hypothesis, given the empirical, or observed data set ("X").

In English, what's the probability that what you're looking at has a certain characteristic of interest, given the data set?

As a basic example, Bayes' theorem allows us to state the probability of the fairness of a given coin pulled from a bag of coins, both before and after any tosses, if we make reliable assumptions about the fairness of the coins within the bag, prior to taking any out. That prior assumption is where P(θ) comes in.

A frequent complaint about the Bayesian approach is that it makes subjective, sometimes contentious assumptions about the prior nature of the data being observed. The source of this philosophical contention in the use of Bayesian over classical frequentist testing stems from where (or, perhaps, with whom) these prior probability distributions originate.

So-called "non-informative" priors, such as the Jeffreys' prior, express vague or general information about the parameter, to try to make as few "subjective" assumptions about the data as possible.

Informative priors use previous experience or information to set parameter values in the prior: e.g., a simple guess of the expected temperature at noon tomorrow could be calculated from today's temperature at noon, plus or minus normal, day-to-day variance in observed noon-time temperatures.

So-called "conjugate priors" are used when the prior distribution takes on the same form as the posterior distribution and when the mean and variance are usually dependent, and are therefore often used in analysis of empirical data, which is usually constrained by dependence.

The "empirical Bayes" approach was introduced by Herbert Robbins as a way to infer how accident-prone someone is, given the observed fractions of accidents already suffered by the larger population. The objectivity of this testing stems from using observed, "empirical" data to generate informative priors used in Bayesian inference.

Many empirical Bayesian techniques have been applied in various areas within the field of systems biology, which are data-rich and analysis-poor. In particular, Brad Efron at Stanford is one of the big names in this field, and wrote a fun, straightforward paper that bridges Bayesian and Frequentist modes of thinking. Other papers of his on the subject of empirical Bayesian testing can be found here.
posted by Blazecock Pileon at 4:29 PM on April 11, 2011 [16 favorites]


> Mixing, you mean there's nothing for it except to have a lot more good stats classes, right?

Oh yes. A million times yes. But that's a rant for a different thread.
posted by mixing at 4:47 PM on April 11, 2011


Here's another post concerning this.

I think about this every time we begin "hypothesis testing" in an introductory stats course, and I try to mention to students how overly simplified the process appears. But then, I end up bashing so much math done in classes up to that point (no straight lines in nature, no easy integrals in practice, etc.). I often don't feel like I'm selling the right product. How do folks here think that an introductory statistics course should address significance tests? What should we be focusing on?
posted by klausman at 4:53 PM on April 11, 2011


Will the use of Bayes factors replace classical hypothesis testing and p-values? Will something else?

I don't know! Quit asking me!
posted by TwelveTwo at 4:59 PM on April 11, 2011 [2 favorites]


An Intuitive Explanation of Bayes' Theorem by Eliezer S. Yudkowsky.

An Intuitive Explanation of Eliezer Yudkowsky’s Intuitive Explanation of Bayes’ Theorem by Luke Muehlhauser.
posted by DataPacRat at 5:00 PM on April 11, 2011 [3 favorites]


effugas: One of the best layman-ish explanations I've seen is the paper "Profiting from prior information in Bayesian analyses of ecological data" which, even if you ignore the (minimal) maths and have no understanding of ecology, does a fairly good job of explaining where & how Bayesian stats differs from more familiar frequentist methods.
posted by Pinback at 6:07 PM on April 11, 2011 [1 favorite]


It is a heinous bummer that more people aren't up on their stats, but the fact of the matter is, it's about 1000 times easier to get a paper published with that magic p even if the effect is miniscule. Got a robust effect and a confidence interval you can live with? Good luck publishing it if your hypothesis test is significant at .051+.

"But wait," you (or Rosnow & Rosenthal, back in '89) say, "Surely, God loves the .06 nearly as much as the .05."

God? Probably yes. Reviewers who don't understand any statistical procedure more nuanced than a t-test? Unfortunately, probably not. (There are more of them than any of us would like to admit).
posted by solipsophistocracy at 6:19 PM on April 11, 2011 [1 favorite]


Mixing, you mean there's nothing for it except to have a lot more good stats classes, right?

I think that at a certain point of complexity, the nuances of avoiding statistical pitfalls can just be too damn hard for every person involved in research to be expected to be able to steer clear of them.

The math is hard, and one or two classes taken during a graduate degree isnt typically enough to impart the careful thinking required to produce valid results. Certainly, better education is part of the answer, but (as mixing hints) perhaps the best solution is to employ expert statistians to "certify" statistical results. I'm imaging something akin to 'professionals in the scientific method' - perhaps employed by journals, perhaps separate.

This might smack of elitism, but statistics doesn't seem to be like calculus where you can study for a few terms and solve all of a broad, applicable class of problems. Each real world situation requires lots of careful study, with a mind honed by thinking about these issues.
posted by milestogo at 6:28 PM on April 11, 2011


Here's the link you wanted for and then being distorted further in the popular press
posted by straight at 6:29 PM on April 11, 2011 [1 favorite]


Please don't twist Munroe 's sublime sense of humor into yet another anti-intellectual pile of dog turd.

Read the alt text on that comic - "So, uh, we did the green study again and got no link. It was probably a--"RESEARCH CONFLICTED ON GREEN JELLY BEAN/ACNE LINK; MORE STUDY RECOMMENDED!"

Even he makes fun of how the likes of Jenny McCarthy abuse false positives. Most importantly, a study means nothing unless someone else can reproduce it. And not just one reproduction - If one person gets similar results but 100 refute it, you consider the findings refuted. Yes, it may well warrant further study if you think some ambiguity in the methodology led to different outcomes, but we don't just start dealing cards and write a paper when someone deals the ace of spades (p<.02)
posted by pla at 6:42 PM on April 11, 2011


If one person gets similar results but 100 refute it, you consider the findings refuted

How many people do you think would spend the time ( and how many organizations the money) to reproduce multi-year jellybean* studies? This isn't measuring the charge of the electron, where every interested scientist runs back to their lab to set up the experiment. A lot of the research being discussed isnt easily reproducible for financial reasons, or because it just might not be interesting enough for people to spend their time confirming.
posted by milestogo at 6:47 PM on April 11, 2011


*lots of stand-ins for jellybeans here. Studies that aim to show causal links in long term causes and effects across broad populations.
posted by milestogo at 6:49 PM on April 11, 2011


milestogo : How many people do you think would spend the time ( and how many organizations the money) to reproduce multi-year jellybean* studies?

You jest? You've just described the dream-career of many academics. 30 years of funding, with no publish-or-perish threat to their tenure, to study jelly beans? Sweet (no pun intended)!

Now, the fact that the Ric Romero may somehow get a copy of the original write-up and send the public on a mad rampage against green jellybeans has little bearing on the scientific community. You can read about trivial-but-long-term things proven or disproven (sorry - "rejecting or accepting the null hypothesis") on a weekly basis in every major journal on the market.

Just because most people don't care about the mating habits of the three-toed sloth doesn't mean you don't, somewhere, have a dozen biologists trying to find a way to strap electrodes to the poor bastards' genitalia at any given time.
posted by pla at 7:11 PM on April 11, 2011


You can read about trivial-but-long-term things proven or disproven...on a weekly basis in every major journal on the market.

In Ioannidis' paper "Contradicted and Initially Stronger Effects in Highly Cited Clinical Research" he claims that "Of the 45 highly cited studies with efficiency claims...11 (24%) had remained largely unchallenged" (page 3 of the PDF, middle column). The table later in the article gives specifics. Sorry I'm on my phone and can't be more clear, but those are the kinds of things I was thinking of when I wrote my previous comment.

Just because someone publishes a paper claiming something doesn't mean someone else will try to reproduce it, three-toed sloths not withstanding (I agree they are awesome). There is always some least interesting study.
posted by milestogo at 7:40 PM on April 11, 2011


As we all heard last week (with the Tevatron particle), 5 sigmas is what it usually takes in physics. Is there any reason just setting the traditional p-value to 5 sigmas wouldn't clear up the problem of false positives (at least for usual science, if not the genome-wide data-mining stuff) without need for any complicated bayesianism? True, 99% of social science and psychology would go poof, but that seems like a small price to pay for confidence. Or is it? What's so bad about 99% of published stuff being likely wrong, as long as it gets righter over time? (For non-medical, non-life-threatening research, that is.)
posted by chortly at 8:02 PM on April 11, 2011


Will the use of Bayes factors replace classical hypothesis testing and p-values?

I'm 95% sure it won't.
posted by storybored at 9:10 PM on April 11, 2011 [5 favorites]


Is there any reason just setting the traditional p-value to 5 sigmas wouldn't clear up the problem of false positives?

It would mean testing a prospective medication on 10,000 human subjects. And then having to pass the costs onto the patient ...
posted by sebastienbailard at 10:15 PM on April 11, 2011


chortly : True, 99% of social science and psychology would go poof, but that seems like a small price to pay for confidence.

I think most of us will agree that the "soft" sciences have drastically less rigor than particle physics. You can smash two particles together under the same conditions and get the same outcome; You can't stick two people in a room together and predict, with confidence, how they will behave toward each other based on what color they wore that day.

Unfortunately, the soft sciences also have the annoying potential to answer some of our deepest questions about our existence. Saying that group X tends to stay married 4.7 years longer than group Y, and do so because of increased "happiness", means a lot to the average Joe. The mass of the Higgs Boson, not so much.
posted by pla at 3:41 AM on April 12, 2011


True, 99% of social science and psychology would go poof

No reason it should. Getting t/z statistics beyond 5 is hardly rare, even with small datasets. I just got a t statistic of almost 20 with an N of 25. It was for something more or less blindingly obvious that for inexplicable reasons hadn't seen a formal test yet, but still.

I think most of us will agree that the "soft" sciences have drastically less rigor than particle physics. You can smash two particles together under the same conditions and get the same outcome; You can't stick two people in a room together and predict, with confidence, how they will behave toward each other based on what color they wore that day.

Even if the first were true, the second part of your statement wouldn't be a good example of that. Where social science is nonrigorous is where it has slapdash, crappy theories of what's going on. The fact that social science doesn't deal with mere lumps of dumb matter smacking into each other or just sitting there gravitating doesn't make it nonrigorous, just very difficult and unlikely to lead to precise predictions.
posted by ROU_Xenophobe at 6:46 AM on April 12, 2011 [2 favorites]


All that for an xkcd post?
posted by AndrewKemendo at 6:49 AM on April 12, 2011


With the same logic, we could assign a prior probability of 0.0000000001 that the luminiferous aether doesn't exist, and say the Michaelson-Morley experiment isn't convincing at all because it only updated our probability to 0.00001.
This corresponds to a likelihood ratio p(MM|~LE)/p(MM|LE) of (1e5-1e-5)/(1-1e-5).

Working through the arithmetic, the first time that experiment was independently replicated it would update our probability again, not to 0.00002, but to approximately .5. The next such replication would update us to approximately .99999, and with just four completely independent experiments we get to .9999999999.

So yeah, lousy priors can be a problem, but Bayesian calculations grow even tiny probabilities logarithmically, so (as long as you never have a prior of 0 or 1) you can still get to the truth surprisingly fast.
posted by roystgnr at 8:44 AM on April 12, 2011 [2 favorites]


Important scientific papers that rely on statistical analysis really should have a proper statistician as a co-author, not just someone who pasted some spreadsheet data into SPSS.

I a little bit hate SPSS. I'm in a marketing research class right now and we're using SPSS. Every time I ask for an explanation of the math behind what we're doing, I'm told 'Don't worry about it, just follow the steps.'

So I'm having to do all this extra research on the side to understand what we're doing, and it's not even my primary focus. But I hate hate when people tell me just to follow the steps.

From the reactions of my classmates whenever I ask for additional detail, I'm the only one who cares about it, though. Good thing most of them aren't going into research.
posted by winna at 9:32 AM on April 12, 2011 [1 favorite]


I think most of us will agree that the "soft" sciences have drastically less rigor than particle physics.

This may be true of particle physics, but in psychology (which, I feel obligated to point out, is a discipline under whose broad umbrella research ranges from molecular to cultural levels of analysis), the 'softer' end of the spectrum often has far more rigorous statistical requirements for publication.

For example, I read a great paper in which the researchers took single cell recordings from humans while they were playing a game (that's pretty much like Crazy Taxi) in order to study spatial navigation. The stats were pretty minimal. They reported means and standard deviations for each of the conditions, and did a half-assed chi-square. You tend to be able to get away with stuff like this when you can make an argument like "we COUNTED the damn spikes, ok? You can SEE it."

To get a paper into JPSP (Journal of Personality and Social Psychology, probably the preeminent one in these fields), it is almost a prerequisite these days that your manuscript include a mediation analysis.

The less abstract the variables you're trying to measure, the more you can get away with minimal statistical testing. I'm not saying I think that this is the way things should be, but from my perspective, it's the way things are.
posted by solipsophistocracy at 1:02 PM on April 12, 2011 [1 favorite]


I might (belatedly; sorry) add that I was speaking a bit facetiously in my earlier post. I guess my point was just that if you're super-worried about false positives, you arguably don't need bayes, just a higher sigma level. But in social science and medicine, false negatives are a serious problem too: you don't want to miss out on an important social or medical intervention that might save lives because you are excessively worried about the danger of statistical mis-steps. Bayes doesn't really solve this; you just have to accept that you'll find a certain number of false positives but that the cost -- in the form of unnecessary medicines and social policies and misled future researchers -- is less than the benefit of saving lives that a more cautious approach might missed by years or decades. I suppose in the end it matters what exactly the false positive rate in publication is: if it's 99.99%, that's a problem, but if it's merely 90%, that might be worth it for the true positives (and their social benefits) that are being turned up.
posted by chortly at 7:15 PM on April 12, 2011


Did you know there's a direct correlation between the decline of spirograph use and the rise of gang activity? Think about it!
posted by mrzer0 at 12:02 PM on April 15, 2011


« Older Recently, Secretary of State Hilary Clinton releas...  |  "The Harper government misinfo... Newer »


This thread has been archived and is closed to new comments