Statistical significance is bad for science, p<.05
March 21, 2019 6:38 PM Subscribe

Scientists rise up against statistical significance. In a comment piece published in the March 20 issue of the journal Nature, zoologist Valentin Amrhein, epidemiologist Sander Greenland, statistician Blake McShane, and over 800 co-signatories argue that the time has come to abandon the use of statistical significance in science.

For decades, the concept of statistical significance has been the gold standard for assessing scientific truth in many fields of research. An effect is deemed statistically significant if, assuming a "null hypothesis" of no effect, the probability of the observed data (p-value) is lower than some threshold, usually 5%. But this concept is notoriously prone to misinterpretation, with many scientists arguing that a statistically significant effect proves the alternative hypothesis (it doesn't), or that two studies with different results under significance testing are contradictory (they aren't necessarily), that a lower p-value corresponds to a more significant effect (it doesn't), or that a non-significant result is evidence that there is no effect (it really, really isn't).

Momentum against the (mis)-use of significance testing has been growing for some time. In 2015, the journal Basic and Applied Social Psychology announced it would no longer accept papers relying on significance testing, and the editors found their review process improved and impact factor grew. While the number of opinion and commentary papers calling for caution in, or abandonment of, the use of p-values and/or significance testing has been growing over the last several years in specialist journals, the publication of this comment in the widely-read generalist journal Nature, with such a large number of co-signatories, may indicate that the culture of science has reached a tipping point.

The language of the comment is strongly-worded but measured, with the authors writing that they're "frankly sick" of the widespread misuse and misinterpretation of statistical significance in "presentations, research articles, reviews and instructional materials." Published along with the comment, though, is what amounts to a political cartoon, depicting a young scientist cheerfully guiding her reluctant older colleague to place statistical significance in a dusty closet alongside phlogiston, the four humors, and other artifacts of the history of scientific mistakes.

posted by biogeo (63 comments total) 68 users marked this as a favorite

I am a scientific and medical research editor. This is the best news I've heard in a long time.
posted by catlet at 6:49 PM on March 21, 2019 [17 favorites]

I'm against the mis-use of statistical significance as much as the next person, but this seems like it's mostly an argument against sloppy language. Changing "showed no association" to "showed insufficient evidence of association" would address most of what they argue.
posted by If only I had a penguin... at 6:58 PM on March 21, 2019 [23 favorites]

This will solve the replication crisis!
posted by Going To Maine at 6:59 PM on March 21, 2019 [1 favorite]

And no scientist worth their salt argues that any empirical study "proves" anything. People may use sloppy shorthand and say "no association" when they mean "no/insufficient evidence of association" but claiming to have "proved" anything is beyond the pale.
posted by If only I had a penguin... at 7:00 PM on March 21, 2019 [2 favorites]

Excellent news!

Also, everything else in this thread pleases me. Yes, yes, and yes!
posted by darkstar at 7:03 PM on March 21, 2019

And no scientist worth their salt argues that any empirical study "proves" anything.

I think this might be a little too strong. Certainly any statement about empirical proof that relies on statistical hypothesis testing is beyond the pale, but some experiments certainly can prove a result. Especially for questions of the form "is phenomenon X possible," you only need to show it once to prove that it is. I'm picking nits, of course, but what's a post about statistics without pedantry?
posted by biogeo at 7:10 PM on March 21, 2019 [13 favorites]

I'm a great fan of hunches and statements like "well it happened to my uncle, so it must be true/real".
posted by greenhornet at 7:10 PM on March 21, 2019 [4 favorites]

And no scientist worth their salt argues that any empirical study "proves" anything. People may use sloppy shorthand and say "no association" when they mean "no/insufficient evidence of association" but claiming to have "proved" anything is beyond the pale.

If you put people in a context where they have to think about it, sure. But both scientists and consumers of science (doctors especially bad on this) frequently treat p>0.05 as false, and p<0.05 as true.
posted by vogon_poet at 7:19 PM on March 21, 2019 [5 favorites]

sure! and if we throw out t-stats and correlation as well the model that I looked at today might be salvageable from the 'burn it with fire!' pile.
posted by Nanukthedog at 7:19 PM on March 21, 2019 [1 favorite]

Especially for questions of the form "is phenomenon X possible," you only need to show it once to prove that it is.

This is true, but (and this is a real question, I don't claim to know the kinds of research questions asked in every field ever), how often is this the question? I mean it would basically be "does this thing exist ?" (e.g. a black swan) and the evidence would have to be an actual black swan, not just evidence of a black swan. Are there fields where "does this thing exist" is a common research question and where one finds the actual thing not just evidence of it?

I guess the one I can think of is essentially the black swan: Is this animal extinct" We found one, so we proved it's not. But that's not an experiment and while it's technically empirical insofar as empirical means based on the senses, it doesn't seem like what one would traditionally call an empirical study.
posted by If only I had a penguin... at 7:21 PM on March 21, 2019 [3 favorites]

Confidence. Meh. Soft sciences attempting to emulate hard sciences with confounding language.
posted by CrowGoat at 7:24 PM on March 21, 2019 [3 favorites]

I think it absolutely depends on the field. In genetics, for example, historically there have been many questions of the form "is gene X necessary for trait Y?" Knock out gene X, and if you still observe Y, even in a subset of your animals, the answer is no. Of course, if you don't observe Y, the answer isn't necessarily "yes." You can then further refine the question to one that is interesting but not susceptible to hard proof in the same way: "does gene X have an effect on trait Y?" But in biology a lot of times the former type of question is already interesting and difficult to answer.
posted by biogeo at 7:29 PM on March 21, 2019 [4 favorites]

Are there fields where "does this thing exist" is a common research question

Yes! And in particle physics, the gold standard is 5 sigma, or a p-value of of 3x10-7. But also, things are looked at from other angles as well, so that is a necessary but not, ultimately, sufficient condition.
posted by sjswitzer at 7:29 PM on March 21, 2019 [3 favorites]

Relevant XKCD.
posted by Quackles at 7:30 PM on March 21, 2019 [12 favorites]

I have mixed feelings about this, partly due my own lack of knowledge on the topic since I'm still relatively new to statistics. However, I can say that it just seems like sloppy science to not report size effects along with the exact p-values. In many cases though, size of effect with confidence intervals I think would suffice without falling into the "dichotomy trap".

Some more reading on the topic.
posted by piyushnz at 7:32 PM on March 21, 2019 [2 favorites]

I'd throw out "lines" for p-values altogether. Any arbitrary line is going to be interpreted as a yes/no because it's easier to interpret than giving a sense of what p = 0.04 vs p = 0.06 really entails.

Of course, an easier solution is likeable. It's easier.
posted by solarion at 7:33 PM on March 21, 2019 [1 favorite]

Are there fields where "does this thing exist" is a common research question

Yes! And in particle physics

The biology/gene example seems like a good one, but particle physics, which I acknowledge is one of those fields I know nothing about beyond what I've taken from reading newspaper-level articles about it, was exactly the kind of thing I was thinking about as: the question may be "does x exist" but the evidence isn't "I found X." It's more like "stuff acted in a way one would expect if X existed." It's like finding old paintings of black swans or mentions of black swans in diaries, but not an actual black swan. It's possible that though the behaviour is consistent with what one would expect if the particle existed, the particle doesn't exist and something else is going on.

Apparently I was wrong about the questions (as per gene example). Am I wrong about how evidence works in particle physics?
posted by If only I had a penguin... at 7:36 PM on March 21, 2019 [2 favorites]

it just seems like sloppy science to not report size effects along with the exact p-values.

Holy crap are there fields where people report p-values but not effect sizes? This is turning into an eye opening science education for me.

Don't worry, I'm going to bed now before I embarrass myself further with my lack of academic cosmopolitanism.
posted by If only I had a penguin... at 7:39 PM on March 21, 2019 [2 favorites]

I'd argue particle physics is actually an example of a field that can happily continue to do significance testing without much issue. The fact that physical systems are comparatively quite simple, the statistical models required are generally extremely simple, the field's accepted standard threshold is sufficiently conservative, and results usually come from giant consortia with hundreds of authors and extensive internal review, all mean that when the community reports a result at 5-sigma, it's probably real. But particle physics is the exception, in my opinion.

Holy crap are there fields where people report p-values but not effect sizes?

Oh my god, even the concept of effect size is something that's embarrassingly shaky for a surprising number of people in my field.
posted by biogeo at 7:44 PM on March 21, 2019 [13 favorites]

I am certainly not an expert, but looking for black swans is something that particle physicists do so they are both very good about being able to to detect them if they are there and at the same time quite wary of finding false confirmation if they are not.

It is not at all clear if their methods are transferable to other fields, though.
posted by sjswitzer at 7:47 PM on March 21, 2019

I guess this article is interesting, but I don't know whether it means anything.
posted by Joe in Australia at 8:20 PM on March 21, 2019 [5 favorites]

Yes, but if we keep everyone focused around statistical significance we can all continue to make easily identifiable and well understood errors. Take that away and every paper will have completely unique problems in interpretation.
posted by Tell Me No Lies at 8:38 PM on March 21, 2019 [5 favorites]

Yes, the way p-values are used (and abused) is a serious problem. p = 0.045 is not a more practically meaningful or 'real' result than 0.055, other than for career advancement maybe.

But not convinced that we can solve problems with it by just ditching it.

I would be more comfortable is saying that p on its own is not particularly useful, unless it is very high or very low.

Most of the critiques of p I have read sound more like arguments for it to be used more carefully, less evangelically, and not in isolation.

Maybe more of a cultural problem in the way we use it, than any inherent technical problem with it.
posted by Pouteria at 9:29 PM on March 21, 2019 [2 favorites]

I'm way more than 17 percent interested in all this
posted by philip-random at 10:05 PM on March 21, 2019 [2 favorites]

I skimmed a few articles on p-values yesterday and apparently the definition of a p-value is this conditional probability:

P ( delta mu | null hypothesis ), where delta mu = difference between actual measurement and theoretical measurement.

And my first reaction as someone without any background in stats is, this seems totally useless in reality, unless you can prove the null hypothesis in the first place! Whatever you want to say using p is contingent on that. It's in the definition. So my mind is blown on the one hand because it's an intriguing and new idea to me, but on the other hand I'm going to have to read an introduction to it some time. One of the articles said that most scientists don't conceptually even understand what a p-value actually is correctly, because the article presenter quizzed them in the room and they kept showing their intuitions about the theory behind it wrong.
posted by polymodus at 12:49 AM on March 22, 2019 [1 favorite]

We must learn to embrace uncertainty.
I'm on board with this idea, but let's not forget that, in many cases, someone will have to make a Yes-or-No decision at some point. Can we use this drug to cure people? Is this alloy strong enough? Do we authorize the sale of green jelly beans because they may cause acne? The answer can maybe, maybe not for a while, but it cannot be postponed forever.
posted by elgilito at 2:35 AM on March 22, 2019 [2 favorites]

Especially for questions of the form "is phenomenon X possible," you only need to show it once to prove that it is. I'm picking nits, of course, but what's a post about statistics without pedantry?
That's not picking nits. This is picking nits:

You're still not proving it. To use the black swan example - ok you say you saw a black swan. Did you see it or was it an animatronic black swan? Or are you in the Matrix and you're being made to think you saw a black swan?

You can't prove anything in science in the mathematical or logical sense except when it tautologically is in the mathematical or logical sense. And if you're using 'prove' in a looser sense you're just deciding on some level of probability (for carefully chosen definitions of probability).

Broadly, this is Cromwell's Rule.
posted by edd at 3:34 AM on March 22, 2019 [2 favorites]

....this seems totally useless in reality, unless you can prove the null hypothesis in the first place!

I don't think that's how this works....the "null hypothesis" is that, say, there is no relationship between smoking and cancer rates. Then you can find that smokers have a much higher rate of cancer than non-smokers, which provides support for (not "proves") the idea that there is a causal relationship. You don't need to state by proving that you don't know that there is a relationship; that is your natural starting place.
posted by thelonius at 3:43 AM on March 22, 2019

I’m not so much concerned about p values as I am about p hacking, which is basically analyzing your data 37 different ways until something comes up randomly significant and then pretending your results mean a damn thing.

Had a colleague who would repeatedly re-run experiments when they came up insignificant, claiming equipment issues or some nonsense, and then proudly proclaim success when a “valid” result came in. Completely ignoring the 5-6+ times the data showed the hypothesis was crap. (Perhaps not coincidentally, this colleague is not currently funded. Go figure.)
posted by caution live frogs at 5:06 AM on March 22, 2019 [3 favorites]

.... p hacking, which is basically analyzing your data 37 different ways until something comes up randomly significant and then pretending your results mean a damn thing.

It wasn't about this, but a philosophy teacher of mine had a metaphor of shooting arrows into a barn and then painting bullseyes around them
posted by thelonius at 5:18 AM on March 22, 2019 [6 favorites]

I'm philosophically not a fan of p-values for lots of reasons. But p-hacking is avoidable without much fuss through Bonferroni correction. If you're not going to use that (or some other sensible method) then you're not being rigorous and not doing statistics right, and you can do statistics badly and wrongly in lots of ways without having to use p-values and frequentist statistics more generally.
posted by edd at 5:23 AM on March 22, 2019 [5 favorites]

Ug. Let’s not even talk about the way some people mis-apply Bonferroni corrections. It’s ridiculous to think that my current analysis should cause prior work with that data to retroactively change its significance scores, yet that’s essentially what happens in (and scuttles the impact of) lots of modern research.
posted by SaltySalticid at 5:56 AM on March 22, 2019

Wouldn't a more honest phrase for p>.05 be "this study did not have sufficient power to detect a difference between the groups"?
posted by clawsoon at 6:06 AM on March 22, 2019 [1 favorite]

biogeo: Certainly any statement about empirical proof that relies on statistical hypothesis testing is beyond the pale, but some experiments certainly can prove a result. Especially for questions of the form "is phenomenon X possible," you only need to show it once to prove that it is.

I recently read The Golem after seeing it recommended on Jeremy Fox's ecology blog. I think you might find it interesting, especially if you're a working scientist with your own experiences to compare it to. They go through a bunch of case studies to see what actually happens when a scientist claims they have an experimental result proving that phenomenon X is possible against standard theory. Does the Platonic ideal of the scientific process that you've stated hold? According to them, it doesn't. The usual response to an anomalous experiment, from Pouchet's proof of spontaneous generation to Pons and Fleischmann's proof of cold fusion, is "you must've done your experiment wrong."

(As it turns out, Pouchet did his experiments just as carefully as Pasteur did; Pasteur only got experimental support for his anti-spontaneous-generation theory because he was lucky. Pasteur only got scientific consensus in support of his theory because his theory seemed more likely, so other scientists concluded that Pouchet must've done something wrong.)

There's an even longer and more interesting discussion of the experiments which were used to prove relativity correct. The attractiveness of the theory of relativity did more to shape scientists' views of the ambiguous early data than the data did to shape their view of the theory.

I'd be curious to know how many examples there are of a single experiment overthrowing an established theory which don't get a lot muddier when you look at them more closely. Maybe Poisson's ill-fated thought experiment, subsequently turned into a real experiment, about Fresnel's wave theory of light? After reading The Golem, I wonder whether even that one is as simple and triumphant as it's usually told.
posted by clawsoon at 6:33 AM on March 22, 2019 [2 favorites]

Wouldn't a more honest phrase for p>.05 be "this study did not have sufficient power to detect a difference between the groups"?

I think that assumes a difference between the groups (or an association) in the population. A study with great power could find a very precise effect size around 0. The issue in that case would not be insufficient power but rather that there really is no association between vaccines and autism (or whatever) and the study could show evidence in support of the null hypothesis (The confidence interval around the odds ratio is .99999999999999999999999999-1.000000000000000000000000001) which would mean A) that this is not about power and b) If there is an association (because of course the probability is not 0) then the effect size is so small that it is not meaningful (i.e. substantively significant).

On the black swans, I realized finding a black swan (or a physics particle assuming one finds the actual particle, not just evidence of it), isn't really a relevant thing. "I found a black swan" isn't a situation where you need to generalize from a sample to a population. It literally just is "this thing exists." If it is the only one in the whole universe, that doesn't change that it exists. The "is it an animatronic swan" isn't a question about alpha error, it's a question about measurement error, which is a whole other thing (A thing, which btw, tests of significance assume does not exist. If you have measurement error, your standard errors are wrong).
posted by If only I had a penguin... at 6:34 AM on March 22, 2019 [3 favorites]

They go through a bunch of case studies to see what actually happens when a scientist claims they have an experimental result proving that phenomenon X is possible against standard theory. Does the Platonic ideal of the scientific process that you've stated hold?

The classic read on this is Kuhn's Structure of Scientific Revolutions. I highly reccomend the 50th anniversary edition, which includes an introductory essay discussing how Kuhn's ideas changed over time after he wrote the book. If you've read the book and not the essay, the essay is worth getting if you're interested in such things.
posted by If only I had a penguin... at 6:38 AM on March 22, 2019 [2 favorites]

One of these days I'll get around to reading Kuhn. :-) Lakatos (sorry, haven't found a better link yet) was also recommended in the ecology blog I linked, as "much like Kuhn but more realistic".

On the black swans, I realized finding a black swan (or a physics particle assuming one finds the actual particle, not just evidence of it), isn't really a relevant thing. "I found a black swan" isn't a situation where you need to generalize from a sample to a population. It literally just is "this thing exists."

Do you have any examples of scientific black swans, where a scientific theory said "this thing cannot exist" and then that thing was found? I guess Poisson's dot would be one; any others? More philosophically... when do we ever "find the actual thing" rather than "evidence of the thing"? Isn't finding the actual thing just a case where confidence in our observations is high?
posted by clawsoon at 7:01 AM on March 22, 2019 [3 favorites]

If that question is stupid, it's because I'm still sorting through your statement about the differences between measurement error and sampling error. (It's also because I don't know what I'm talking about, and would be glad to be corrected.) I can envision a view of measurement which posits it as a sample of the infinite number of measurements which could be taken, and thus subject to standard statistical tools, but I'm not sure if that makes any sense.
posted by clawsoon at 7:04 AM on March 22, 2019

A week later, we had more than 800 signatories — all checked for an academic affiliation or other indication of present or past work in a field that depends on statistical modelling (see the list and final count of signatories in the Supplementary Information). These include statisticians, clinical and medical researchers, biologists and psychologists from more than 50 countries and across all continents except Antarctica.

Getting an Antarctica signature for the comment would be pretty sweet. Should be possible, no?
posted by clawsoon at 7:37 AM on March 22, 2019

>Wouldn't a more honest phrase for p>.05 be "this study did not have sufficient power to detect a difference between the groups"?

Good studies (these are unfortunately a minority) do power calculations before performing the study. So if you have an appropriately powered study, a p over the threshold is a genuine negative.

Studies performed without prior power calculations should be given serious side eye.
posted by Easy problem of consciousness at 7:45 AM on March 22, 2019 [2 favorites]

Even with an appropriately powered study, p=0.049 is not meaningfully different from p=0.051.
posted by GCU Sweet and Full of Grace at 8:02 AM on March 22, 2019 [3 favorites]

Do you have any examples of scientific black swans, where a scientific theory said "this thing cannot exist"

Well I've said this is not my area and until biogen told me, I didn't even know there were fields asking questions like this. But I don't think a black swan is something where a theory says a thing cannot exist. It's just a thing thought not to exist. I mean there's no theory that says unicorns can't exist, but we all generally believe they don't. The discovery of one unicorn would presumably disprove that (And if it turned out to be a horse with some a tumor that wouldn't be an alpha error it would be measurement error). So if we think of them that way, I would say biogen's example -- this trait requires gene X then trait is observed without gene X -- would be an example.

when do we ever "find the actual thing" rather than "evidence of the thing"?

Finding black swans is one example, obviously. When people who thought all swans were white found black swans they found actual black swans, not just black feathers, or accounts of black swans, or some genetic material that they analyzed and realized as a) From a swan but B) contained a gene that always makes things black (yes, i realize gene's aren't that simple or findable. just making an example).

For other things we tend to find evidence of the thing -- the slight flicker of a star suggest a planet orbitting around it, but we don't see the planet (if we had a *way* more powerful telescope and somehow saw the planet itself and watched it orbiting around a star, that would be finding the thing). When we find a new species, it's usually that a scientist sees/captures a specimen (finding the thing), not just that they deduce a position in the local ecology with some assorted evidence ("something must be eating all the mosquitos and here's some weird mosquito-filled poop" which would be finding evidence of the thing without finding the actual thing)

Studies performed without prior power calculations should be given serious side eye.

This seems like an example of a statement that assumes what makes sense in one field makes sense universally. However, what makes sense in one field does not make sense universally.
posted by If only I had a penguin... at 8:07 AM on March 22, 2019

I can envision a view of measurement which posits it as a sample of the infinite number of measurements which could be taken, and thus subject to standard statistical tools, but I'm not sure if that makes any sense.

No, measurment error just means you're wrong about the properties of the thing. If you weighed your sample but the scales was miscalibrated, or you made a typo when you recorded the weight, or someone had their thumb on the scale or you sneezed and so it was a tiny bit wet or anything that means the weight in your data is wrong, that's measurement error. So if you found this thing and thought it was a black swan, but actually it's an aardvark in its Halloween costume, that's measurement error.
posted by If only I had a penguin... at 8:13 AM on March 22, 2019 [1 favorite]

No, measurment error just means you're wrong about the properties of the thing. If you weighed your sample but the scales was miscalibrated, or you made a typo when you recorded the weight, or someone had their thumb on the scale or you sneezed and so it was a tiny bit wet or anything that means the weight in your data is wrong, that's measurement error.

But aren't you always wrong - slightly, randomly wrong - about the properties of a thing? Isn't there always some tiny quantum thumb on the scale, one way or another? Don't you always have to take multiple measurements in order to ~~rule out~~ account for random error?
posted by clawsoon at 8:31 AM on March 22, 2019

I recently read The Golem after seeing it recommended on Jeremy Fox's ecology blog.

The classic read on this is Kuhn's Structure of Scientific Revolutions.

Lakatos (sorry, haven't found a better link yet) was also recommended in the ecology blog I linked, as "much like Kuhn but more realistic".

I haven’t read The Golem, but it sounds quite similar to Feyerabend’s classic of “epistemological anarchy” Against Method. Feyerabend was Popper’s research assistant and a contemporary of Kuhn and Lakatos. The book was intended to be a collaboration with Lakatos called “For and Against Method”, but Lakatos died unexpectedly.

I’m probably not doing it justice (it’s been a while since I read it) but it takes a much more anthropological point of view of the “scientific process”. A lot of the book hinges around Galileo’s advocacy for the heliocentric model and how, in Feyerabend’s telling, the evidence at the time wasn’t very strong - it was more Galileo’s hunch, backed up by plenty of political manoeuvring.

Here’s the Stanford Encyclopedia of Philosophy on the book:

He emphasised that older scientific theories, like Aristotle's theory of motion, had powerful empirical and argumentative support, and stressed, correlatively, that the heroes of the scientific revolution, such as Galileo, were not as scrupulous as they were sometimes represented to be. He portrayed Galileo as making full use of rhetoric, propaganda, and various epistemological tricks in order to support the heliocentric position. The Galileo case is crucial for Feyerabend, since the “scientific revolution” is his paradigm of scientific progress and of radical conceptual change, and Galileo is his hero of the scientific revolution. He also sought further to downgrade the importance of empirical arguments by suggesting that aesthetic criteria, personal whims and social factors have a far more decisive role in the history of science than rationalist or empiricist historiography would indicate.

posted by chappell, ambrose at 8:38 AM on March 22, 2019 [2 favorites]

But aren't you always wrong - slightly, randomly wrong - about the properties of a thing? Isn't there always some tiny quantum thumb on the scale, one way or another?

Yes. However, at least some statistical methods assume the answer is no. They require an assumption of non-stochastic (i.e they don't vary, they just are) properties of the thing. Since for a lot of things it's true to some level of precision, we make do and just assume the measurement error away (i.e. maybe the thing we assumed weights 1.0000000000000 grams actually weights 1.000000000000001 grams) and sometimes we know our objects properties are stochastic (e.g there is no"true" weight for a person) and just pretend we don't so we can still do stats. And taking multiple measures doesn't fix it, because again the required assumption is that it doesn't vary.
posted by If only I had a penguin... at 8:39 AM on March 22, 2019 [3 favorites]

And my first reaction as someone without any background in stats is, [p-values against a null hypothesis] seems totally useless in reality, unless you can prove the null hypothesis in the first place! Whatever you want to say using p is contingent on that.

The null hypothesis is some specific way of saying "Nothing interesting is happening here and anything that appears interesting is just a result of bad luck in drawing your samples." There's no difference between the people who took the medicine drug and the people who took nothing. The variance in part size on each of these two machines is the same. Race has no effect on voting. etc.

The way (frequentist) hypothesis tests work is that you specify some reasonable null hypothesis or null process, some way of saying "The data we get from the world might appear to be interesting but are really just random crap," and then you ask "How hard would it have been to draw a sample like I actually got if that null process were true?" And if that probability is low enough, you reject it. It would be really hard to get these data if race had no effect on voting, so I can reasonably say race doesn't have no effect on voting.

You're right in thinking that the specific way you draw up the null hypothesis matters, especially for more complex measures or techniques. I've occasionally gotten something to review where I thought the authors were using an unreasonable / wrong null process.

But you don't need to prove the null -- the whole exercise is trying to say "It would be really hard to see an apparent effect this big if there really were no effect, so there must be an effect."
posted by GCU Sweet and Full of Grace at 8:46 AM on March 22, 2019 [7 favorites]

>This seems like an example of a statement that assumes what makes sense in one field makes sense universally. However, what makes sense in one field does not make sense universally.

Obviously some fields have barriers to performing large adequately powered studies. I don't mean to malign researchers dealing with those barriers in any way. I just mean that if we don't/can't, then we have to discount positive findings of inadequately powered experiments roughly proportionally to how we discount negative findings or we have a systematic positive bias.
posted by Easy problem of consciousness at 8:54 AM on March 22, 2019

So if you found this thing and thought it was a black swan, but actually it's an aardvark in its Halloween costume, that's measurement error.

I don't think that's what people usually mean by the term "measurement error." What you're describing is what I'd call a mistake. In most work I'm familiar with, measurement error is just the difference between the observed value and the true value of a parameter. Measurement error occurs due to various factors, including limitations in the precision or accuracy of your measurement apparatus or nature of the parameter being measured.

E.g., suppose I'm trying to measure the dry weight of a substance with a high degree of precision. If I take this measurement using a scale that's open to the room atmosphere, small air currents will push down or up on the scale, creating a force that looks like fluctuations in the weight of the sample. Vibrations in the room caused by heavy vehicles driving by outside can also affect the reading in a random way. These are sources of unbiased error, which can be reduced but never entirely eliminated. But if the substance I'm weighing is hygroscopic, it will pull moisture from the air, and my measured weight will depend on the humidity of the room and how long I've allowed the sample to equilibrate. My resulting measurement of the dry weight will always be an overestimate, and I have a biased measurement error.

I can take steps to reduce both biased and unbiased sources of error, and maybe if I'm careful even eliminate bias, but every measurement will always be accompanied by some source of unbiased error. The error in question may be well below the threshold I care about, allowing me to effectively treat my measurement as exact, but measurement error is always there. Ideally, data analysis should take measurement error into account and propagate it accurately (there's a textbook on the shelf next to me right now on exactly this topic, "Data Reduction and Error Analysis for the Physical Sciences"). In practice, outside of physics and statistics, I don't know how often this actually happens.
posted by biogeo at 9:23 AM on March 22, 2019 [3 favorites]

Obviously some fields have barriers to performing large adequately powered studies.

And some fields that do very large studies without doing power calculations first.
posted by If only I had a penguin... at 10:47 AM on March 22, 2019

biogeo: I don't disagree with anything you said in that last comment. It seems consistent with the way I learned about these things. I would add that regression analysis require the assumption that there is no measurement error, including random measurement error. Obviously there is always measurement error, so if it's not too big we kind of pretend it doesn't matter and it probably mostly doesn't, but technically speaking from a mathematical proof perspective, your estimates are wrong if you have measurement error.

Nitpick: well, you don't measure parameters, you estimate them. So the weight of the sample is a value on a variable. You measure that value. The correlation between that weight and...I dunno, the time of day...in the population is a parameter. The correlation you calculate from your data after weighing a bunch of samples of the substance and the time of day is your point estimate of the parameter.
posted by If only I had a penguin... at 10:56 AM on March 22, 2019 [3 favorites]

regression analysis require the assumption that there is no measurement error, including random measurement error

Well, in the independent variable. Measurement error in the dependent variable is obviously fine.
posted by en forme de poire at 11:04 AM on March 22, 2019 [1 favorite]

looking for black swans is something that particle physicists do

Except that existing dark matter detectors would be completely insensitive to black swans... *returns to drawing board*
posted by heatherlogan at 11:48 AM on March 22, 2019 [1 favorite]

I’m not so much concerned about p values as I am about p hacking, which is basically analyzing your data 37 different ways until something comes up randomly significant and then pretending your results mean a damn thing.

Particle physicists deal with this too (and it's one reason why our standard of "discovery" is 5 sigma). The more "experiments" (read: analyses) you do, the more chances you have for statistics to give you that accidental one-time-in-20 that p<0.05 implies. This is known as the "look-elsewhere effect", and is accounted for in statistical analyses by incorporating the "trials factor" -- the additional probability hit that comes from performing multiple analyses. You'll hear this referred to when there's a bump in some distribution and the experiments quote a local significance (the probability of the number of events in this one bin having fluctuated upward to the observed level, given a background-only hypothesis) as well as a global significance (which accounts for the fact that there are a whole bunch of bins, any of which could have fluctuated upward).
posted by heatherlogan at 11:59 AM on March 22, 2019 [2 favorites]

I recently read The Golem [...] They go through a bunch of case studies to see what actually happens when a scientist claims they have an experimental result proving that phenomenon X is possible against standard theory. Does the Platonic ideal of the scientific process that you've stated hold? According to them, it doesn't. The usual response to an anomalous experiment, from Pouchet's proof of spontaneous generation to Pons and Fleischmann's proof of cold fusion, is "you must've done your experiment wrong."

Oh my goodness no. The actual response to Pons and Fleischmann's ~~proof~~ experimental result indicating possible cold fusion was for dozens of labs all over the world to immediately try to reproduce the phenomenon, which, if real, would have been a fantastic discovery. Nobody else could reproduce the reported effects. Other anomalies turned up too, like the apparent absence of the massive neutron dose that Pons and Fleischmann ought to have received if their reported results were true. Only then did the rest of the community conclude that "you must've done your experiment wrong".

The real world is real, and accessible through experimentation by anyone. Science doesn't stop after only one experiment.
posted by heatherlogan at 12:06 PM on March 22, 2019 [5 favorites]

Wouldn't a more honest phrase for p>.05 be "this study did not have sufficient power to detect a difference between the groups"?

Better is "the difference between the groups is smaller than such-and-such quantitative amount". Then your "null result" is actually a useful contribution to human knowledge. Particle physics does this when we set quantitative limits on things that we don't discover (i.e., the vast majority of particle physics publications). :)

Or what If only I had a penguin... wrote.
posted by heatherlogan at 12:13 PM on March 22, 2019

>And some fields that do very large studies without doing power calculations first.

Did you have a particular field/barrier in mind? I'd like to distinguish between "it would make it harder to do good science because X" and "most people in this field don't culturally do that but their work would be more informative if they did".
posted by Easy problem of consciousness at 12:20 PM on March 22, 2019

Health and epidemiology. Social Sciences. It's not just about culturally not doing it. With big studies you're not looking at the relationship between two variables. You're looking to understand a whole host of things within a given topic area and the people designing the study don't even know all the things it will ultimately be used for, so how would you do power calculations beforehand?
posted by If only I had a penguin... at 12:53 PM on March 22, 2019 [1 favorite]

Oh yeah, that kind of study has strengths (especially in uncovering associations to investigate further) but no findings can be considered nearly as convincing as a study (often even a much smaller study) designed to look specifically at that question. Sometimes in health even weak evidence can prevent further studies, of course, due to ethical concerns.
posted by Easy problem of consciousness at 1:23 PM on March 22, 2019

The way (frequentist) hypothesis tests work is that you specify some reasonable null hypothesis or null process, some way of saying "The data we get from the world might appear to be interesting but are really just random crap," and then you ask "How hard would it have been to draw a sample like I actually got if that null process were true?" And if that probability is low enough, you reject it. It would be really hard to get these data if race had no effect on voting, so I can reasonably say race doesn't have no effect on voting.

I can see why empirical scientists find this attractive but a definition using formal probability theory, given the value of

P (X | H)

one cannot say anything about P (H | X) which is what the bold part means. The reason as any introductory engineering course is because the two, P (H|X) and P (X|H), are connected by P (H), i.e. by Bayes' Theorem. In contrast, there's a fundamental philosophical assumption required by frequentism, which of course may or may not be justifiable depending on the scenario. Even probability theory and intro information theory classes don't cover such more philosophical issues, so there's not a broad awareness of this ongoing work (I'm thinking of a stanford.plato article explaining problems with the meaning of probability).

Actually, it sounds like the null hypothesis is just a statistics theoretic analog of what in probability is simply called independent random variables. And then applying an extra axiom to make it empirically usable.

But again this issue isn't something I'm at all familiar with, these are my random uninformed thoughts, it's just interesting to hear about in the news.
posted by polymodus at 3:12 PM on March 22, 2019

I found it interesting that the authors aren't promoting Bayesian approaches as the alternative:

The trouble is human and cognitive more than it is statistical: bucketing results into ‘statistically significant’ and ‘statistically non-significant’ makes people think that the items assigned in that way are categorically different. The same problems are likely to arise under any proposed statistical alternative that involves dichotomization, whether frequentist, Bayesian or otherwise.

posted by clawsoon at 3:27 PM on March 22, 2019 [1 favorite]

existing dark matter detectors would be completely insensitive to black swans...

Yes, obviously what you need is duck matter detectors instead.
posted by Joe in Australia at 4:01 AM on March 23, 2019 [3 favorites]

Better is "the difference between the groups is smaller than such-and-such quantitative amount".

There are actually testing paradigms in which you pick a minimum effect size Z, and then separate results into positive, negative (CI entirely contained within the interval (-Z, Z)), and inconclusive (CI includes zero but extends outside of (-Z, Z), indicating that a real effect is also consistent with the data). The one I linked to is called conditional equivalence testing. It's a more sensible option in many ways than standard NHST, because while you can have many options resulting in a non-significant p-value, there's a big difference between a 95% CI that goes from (-1, 100) and one that goes from (-0.0001, 0.0001). It's also nice because it forces people to think about effect sizes more explicitly: in standard NHST effect size really only enters the picture during study design or in interpretation. And CIs are not a totally foreign idea to people; I think the forest plots of CIs you see in meta-analysis papers are actually a lot more intuitive than p-values.

I don't think NHST is evil and I think sometimes its detractors get hung up on things that are of not much practical significance. The "all nulls are false" is one of these for me; maybe this is true in some absolute sense or maybe it isn't, but nevertheless, in lots of real-world settings a null hypothesis of zero effect sure does a pretty good job of explaining the data! Sometimes also the argument seems to be comparing the ideal way to do a different type of analysis with the worst way to do NHST: like I said above, it's true that a p-value by itself doesn't tell you about effect size, but also, you can't really get away without submitting a power analysis when you are, for example, registering a clinical trial. If you do one of those, that means you are actually thinking about effect sizes along with "just" significance. (Even if you don't do one beforehand, post-hoc power analyses are still helpful in interpretation, as long as you don't just test the effect size you happened to find.)

On balance, though, I do kind of agree with the authors that the best thing is probably to move away from binary NHST and towards just reporting how accurately you estimated a particular effect. I do think it's worth not throwing out the fact that theoretically false "null models" can be very useful in practice. Neutral and nearly-neutral models in ecology and evolutionary biology have been amazingly helpful in establishing a bar for, e.g., evidence of adaptation.
posted by en forme de poire at 1:07 PM on March 23, 2019 [4 favorites]

« Older "It is absolutely not the place for nipples on... | Borrow the Sugar Newer »

This thread has been archived and is closed to new comments

MetaFilter

Statistical significance is bad for science, p<.05
March 21, 2019 6:38 PM Subscribe

Tags

Share

Statistical significance is bad for science, p<.05 March 21, 2019 6:38 PM Subscribe

Tags

Share

Statistical significance is bad for science, p<.05
March 21, 2019 6:38 PM Subscribe