Keeping the search for significance at Bayes
August 19, 2015 6:38 PM   Subscribe

From FiveThirtyEight, a 3 part series on p-values, retraction, and the importance of experimental failure and nuanced interpretation: Science Isn't Broken.
posted by Lutoslawski (8 comments total) 25 users marked this as a favorite
 
This is really good.
posted by intermod at 7:34 PM on August 19, 2015 [2 favorites]


The premise of this article is good, but the example of linear regression (that's the method used for the p value about the political parties) is a bit unfair. In my field (genetics), peer reviewers know that linear regressions typically give significant p values, and just showing a significant p value on a linear regression is definitely not a publishable result. I'm not sure if it's like this in economics, but if anything I'd guess people are more strict. In theory, 5% of non-significant datasets should have a significant result. In reality, it's probably like 10 or 15%, which is really bad. But it's definitely not 60%, as the first example would lead you to believe.

The other part that's a bit off in that example is that (again, in my field) just showing p < 0.05 isn't really publishable. Thing is, if your result is real, you'll first see a p value of 0.05, but as you keep collecting data it'll just get lower and lower and lower. So, just to be safe (and because it's kind-of nice to have an excuse to keep doing the same thing!), most researchers won't settle for p < 0.05, but will keep going until they have something orders of magnitude lower. It's also just a practical thing. You can spend 1 month collecting more data now and make sure the result is real, or you can spend years counting on this data, always having the nagging fear that it isn't true.

But yeah, due to p hacking, the probability of non-significant datasets giving a significant result is way higher than 0.05, which is a huge problem.
posted by Buckt at 7:50 PM on August 19, 2015 [4 favorites]


CORRECTION (Aug. 19, 12:10 p.m.): An earlier version of the p-hacking interactive in this article mislabeled one of its economic variables. It was GDP, not productivity.

*snerk*
posted by jedicus at 7:55 PM on August 19, 2015 [1 favorite]


Thing is, if your result is real, you'll first see a p value of 0.05, but as you keep collecting data it'll just get lower and lower and lower.

I think this might be somewhat unique to genetics in some ways. It seems to me that genetics does pretty well with p-value analyses because the data sets tend to be pretty huge, and maybe a little more straight forward than other biomedical/economic/etc fields, where you are working with pretty limited data sets, long and somewhat fuzzy trials, and there's a hard limit of the amount of data you can collect.

I'm not a geneticist, but it has certainly seemed to me in my field (audiology), that you'll get a pretty limited data set (maybe 100 subjects) with a whole bunch of variables, and so finding a significant p-value often does become the sort of "p-hacking" game described in the article. When you look at drug benefit data and the like, so much hinges on what you mean by "benefit" and how that's measured, usually over a relatively small sample size, that I do personally think the over-emphasis on p-values has created this problematic culture in science (greatly exacerbated by the media).
posted by Lutoslawski at 8:57 PM on August 19, 2015 [2 favorites]


The premise of this article is good, but the example of linear regression (that's the method used for the p value about the political parties) is a bit unfair.

Being 538, it shouldn't be too surprising that it's offering things with a social science bias. I'm in political science and we do a lot of linear models (and not-OLS-but-still-GLM stuff like the various flavors of logit/probit). I don't see a lot of econ but what I do see seems deeply wedded to a GLM world. Not snarking: What tools would geneticists use if they were having an in-lab argument about whether Democrats are good for the economy?

The other part that's a bit off in that example is that (again, in my field) just showing p < 0.05 isn't really publishable.

Sure, but you live in a world where you can actually make data. If you wanted even 10,000 observations about Democrats and the economy, you'd have to wait more than 9100 months, or 758 years, for data to accumulate. I mostly study state politics and data availability mean that I often am stuck with cross sections -- no matter what I do, I have 50 states. Well, 49 because Nebraska is fucked up.
posted by ROU_Xenophobe at 8:58 PM on August 19, 2015 [3 favorites]


I found this paper to be immensely enlightening on the history of the p-value:
p-Values are Not Error Probabilities

The main thrust is that back in the 20's and 30's when a lot of the basic stats stuff was being figured out, there were two schools of thought:

School A brought the p-value, and it was tailored towards scientists who had to work with a small number of expensive-to-gather observations.

School B made use of asymptotic estimations of error probabilities, which more-or-less correspond to the precision and recall values that we compute using cross-validation in the machine learning world. The p-value school chastised the asymptotics people because who would ever have so much data?

The paper argues that over time, the two schools got sorta-kinda merged in a messy way, losing sight of the core concepts, to the point that most basic texts to this day contain fundamental errors. And meanwhile (my editorial here), stats has been stuck in the small-experiment mindset, which is useful in lots of areas but has kind of led them to drop the ball on machine learning and data science. ("The fact that data science exists as a field is a colossal failure of statistics," as Hadley Wickham put it.)

Bottom line is that p-values kinda suck, and just about the only thing they have going for them at this point is that most people* claim to understand them. On the other hand, there are a million other choices for measuring the success of your favorite hypothesis, which allows whole new dimensions of value-hacking...

My guess (I've been drinking tonight) is that in the long run the main tools of science will be mathematics, machine learning, and literary criticism. Which is to say, prove it, cross validate a strong predictive model, or GTFO.

* - By people, we mean phd-wielding researchers.
posted by kaibutsu at 12:17 AM on August 20, 2015 [2 favorites]


I may not have learned any statistics from this, but I did learn that Republican presidents are inversely correlated with economic prosperity to a high degree of confidence.
posted by rum-soaked space hobo at 1:09 AM on August 20, 2015 [1 favorite]


The premise of this article is good, but the example of linear regression (that's the method used for the p value about the political parties) is a bit unfair.

The point is to show how selecting the data inputs can completely change the results *and* pass a statistical test. It's explicitly meant to *not* be fair, it's meant to show those who do not know statistics how p-hacking can happen -- that is, you can select your data to fit an arbitrary standard.

It's like the frictionless perfect sphere of uniform density in a perfect vacuum we're always using in teaching physics. No such beast exists anywhere ever, but it makes the equations really simple and makes demonstrations easy.
posted by eriko at 2:59 AM on August 20, 2015


« Older "The human gray area is where I want my hands to...   |   Derek Davison's History of Islam for Dummies Newer »


This thread has been archived and is closed to new comments