# Great. Now I'm even uncertain about how uncertain I should be.

August 2, 2019 11:10 AM Subscribe

An uptick in the inability to reproduce recent scientific conclusions, aka the Replication Crisis, is calling into question the practice of using significance testing to make inferences. Some think we should use Bayes' Theorem instead to make inferences and quantify uncertainty.

Suddenly throwing results into the local pond to see if they float doesn't seem quite so backward and ignorant, does it?

Yeah, that's what I thought.

posted by Naberius at 11:49 AM on August 2, 2019 [10 favorites]

Yeah, that's what I thought.

posted by Naberius at 11:49 AM on August 2, 2019 [10 favorites]

*An uptick in the inability to reproduce...*

To clarify for those who haven't been following this, there's been an increase in failed replications, but that is only because previously few attempts at replications were made. I don't know if anyone has any evidence on whether earlier findings were more reliable. Perhaps they were, and the growth of statistical software and number of publications and pressure to publish or get media attention has made it easier and more tempting to generate false findings. Or perhaps there was just as much bad science done in the 1950s through 1980s and no one bothered to check it. See "Has Much Changed? in this article.

posted by Mr.Know-it-some at 12:07 PM on August 2, 2019 [14 favorites]

*An uptick in the inability to reproduce recent scientific conclusions*

Obviously the first thing we need to find out is whether this is a statistically significant uptick.

posted by flabdablet at 12:14 PM on August 2, 2019 [20 favorites]

Use of empirically-derived p-values seems fine. There are scenarios where you can count how many times you observe something in a subset, can count how many times you expect to observe something based on all events, and can derive a p-value from those counts. I worry that the take-away from these articles is that people hear the phrase "p-value" and get dismissive without really having a good understanding why.

posted by They sucked his brains out! at 12:16 PM on August 2, 2019 [4 favorites]

posted by They sucked his brains out! at 12:16 PM on August 2, 2019 [4 favorites]

Empirically speaking, as I get older my bladder's p-value appears to be diminishing.

posted by Greg_Ace at 12:27 PM on August 2, 2019 [18 favorites]

posted by Greg_Ace at 12:27 PM on August 2, 2019 [18 favorites]

The Nautilus article does a good job of explaining the base rate fallacy that makes many p value interpretations misleading, even when the analysis is pretty straightforward.

This comment on the hacker news discussing of that article have a very nice point that I hadn't considered before: measuring an effect as the Bayesian 'update' alone doesn't require inventing/agreeing on a prior, which fixes the usual argument against the Bayesian approach.

posted by kaibutsu at 12:29 PM on August 2, 2019 [2 favorites]

This comment on the hacker news discussing of that article have a very nice point that I hadn't considered before: measuring an effect as the Bayesian 'update' alone doesn't require inventing/agreeing on a prior, which fixes the usual argument against the Bayesian approach.

posted by kaibutsu at 12:29 PM on August 2, 2019 [2 favorites]

I came across a study of publication bias in publications related to publication bias while writing an article about the subject for a client, if anyone else here gets the meta-giggles about that sort of thing.

posted by BS Artisan at 12:38 PM on August 2, 2019 [18 favorites]

posted by BS Artisan at 12:38 PM on August 2, 2019 [18 favorites]

*an uptick in the inability to reproduce recent scientific conclusions*

paging Liu Cixin to the white courtesy telephone please

posted by Two unicycles and some duct tape at 12:54 PM on August 2, 2019 [5 favorites]

Maybe send'em to a couple physics classes? That guy, you know... the e=mc2 guy, keeps getting replicated

posted by sammyo at 1:27 PM on August 2, 2019 [3 favorites]

posted by sammyo at 1:27 PM on August 2, 2019 [3 favorites]

When asked what it would mean if the highly anticipated experiment to measure the deflection of starlight by the Sun during an upcoming total Solar eclipse did not agree with his theory, didn't Einstein say 'then the experiment is wrong.' ?

posted by jamjam at 2:26 PM on August 2, 2019 [1 favorite]

posted by jamjam at 2:26 PM on August 2, 2019 [1 favorite]

I read the Nautilus article this morning and couldn't make heads or tails of it. Perhaps it's because I break into a cold sweat whenever the word "Bayesian" swims into view. I'm not what you'd call math-literate.

posted by kozad at 2:48 PM on August 2, 2019

posted by kozad at 2:48 PM on August 2, 2019

*Maybe send'em to a couple physics classes? That guy, you know... the e=mc2 guy, keeps getting replicated*

If only the dang speed of light would just stay constant!

posted by hopeless romantique at 2:53 PM on August 2, 2019

In my experience, people tend to pull out Bayesian statistics when

The solution to the replication crisis isn't Bayesian statistics, it's collecting enough data to be able to make a more robust statistical statement.

Particle physics went to the 5 sigma standard for new physics discoveries like 20 years ago.

posted by heatherlogan at 3:08 PM on August 2, 2019 [9 favorites]

*they don't have enough data to draw robust frequentist conclusions*. Remember that the "standard of evidence" in medicine is (*shudder*) a p-value of 0.05, which means 95% confidence level, or the famous "19 times out of 20". When you slice-and-dice your data 20 different ways to see if you can drag out anything significant, chances are you're going to get a p < 0.05 just from statistical fluctuations.The solution to the replication crisis isn't Bayesian statistics, it's collecting enough data to be able to make a more robust statistical statement.

Particle physics went to the 5 sigma standard for new physics discoveries like 20 years ago.

posted by heatherlogan at 3:08 PM on August 2, 2019 [9 favorites]

To anyone who may consider the word "Bayesian" intimidating: all it really means is that extraordinary claims require extraordinary evidence. Most people already think this way.

Laplace formulated this mathematically in 1814, per https://en.wikipedia.org/wiki/Sunrise_problem . The modern version applies a trivial mathematical identity to an enormous class of problems, formulated as "we believe this is what produces our observations," followed by testing that belief against evidence. If the evidence does not support the belief, then after a while, you discard the belief. Vigorous debates go on and on about where the tipping point is for discarding a belief (or theory, or whatever), but one might wonder whether such thresholds can ever be universal.

(Spoiler: they can't. There's a reason that a statistician's first answer is almost always "it depends" -- saying "the sun will rise tomorrow" implies a much lower burden of proof than "the world will end tomorrow". This is also the issue with having a "standard" p-value cutoff for "significance". Physicists decided that a false negative was more acceptable than a false positive many decades ago, and agreed that a much smaller chance of random error was acceptable in discoveries. As a result, physics rarely has replication issues any more.)

posted by apathy at 3:08 PM on August 2, 2019 [8 favorites]

Laplace formulated this mathematically in 1814, per https://en.wikipedia.org/wiki/Sunrise_problem . The modern version applies a trivial mathematical identity to an enormous class of problems, formulated as "we believe this is what produces our observations," followed by testing that belief against evidence. If the evidence does not support the belief, then after a while, you discard the belief. Vigorous debates go on and on about where the tipping point is for discarding a belief (or theory, or whatever), but one might wonder whether such thresholds can ever be universal.

(Spoiler: they can't. There's a reason that a statistician's first answer is almost always "it depends" -- saying "the sun will rise tomorrow" implies a much lower burden of proof than "the world will end tomorrow". This is also the issue with having a "standard" p-value cutoff for "significance". Physicists decided that a false negative was more acceptable than a false positive many decades ago, and agreed that a much smaller chance of random error was acceptable in discoveries. As a result, physics rarely has replication issues any more.)

posted by apathy at 3:08 PM on August 2, 2019 [8 favorites]

*Maybe send'em to a couple physics classes?*

Unpopular opinion: physics is the easiest science.

posted by klanawa at 3:10 PM on August 2, 2019 [13 favorites]

> The solution to the replication crisis isn't Bayesian statistics, it's collecting enough data to be able to make a more robust statistical statement.

Porque no los dos? At least in principle, the shrinkage from a prior will tend to INCREASE the amount of data required to show a convincing posterior effect. Of course, if your sample sizes are big enough, there won't be much difference between a shrunken (Bayesian or penalized) and unshrunken effect size estimate, which is kind of the point...

In situations where clear structure exists to the generating process (e.g. genetics) and multiple outcomes are measured, using an unbiased approach is demonstrably less powerful than an (empirical) Bayesian approach, cf. Stein's phenomenon. I don't consider myself "a Bayesian", but my lab uses empirical Bayes approaches all the time, simply to reduce the overall error of our estimates.

Personally, I suspect that a major reason for replication failures is simple bad incentives and bogus (sometimes fraudulent) claims, but bad intent isn't required for poor science -- and in fact, replication studies alone cannot stop the propagation of bad science. See The Natural Selection of Bad Science for a lovely dissection of these issues.

posted by apathy at 3:14 PM on August 2, 2019 [4 favorites]

Porque no los dos? At least in principle, the shrinkage from a prior will tend to INCREASE the amount of data required to show a convincing posterior effect. Of course, if your sample sizes are big enough, there won't be much difference between a shrunken (Bayesian or penalized) and unshrunken effect size estimate, which is kind of the point...

In situations where clear structure exists to the generating process (e.g. genetics) and multiple outcomes are measured, using an unbiased approach is demonstrably less powerful than an (empirical) Bayesian approach, cf. Stein's phenomenon. I don't consider myself "a Bayesian", but my lab uses empirical Bayes approaches all the time, simply to reduce the overall error of our estimates.

Personally, I suspect that a major reason for replication failures is simple bad incentives and bogus (sometimes fraudulent) claims, but bad intent isn't required for poor science -- and in fact, replication studies alone cannot stop the propagation of bad science. See The Natural Selection of Bad Science for a lovely dissection of these issues.

posted by apathy at 3:14 PM on August 2, 2019 [4 favorites]

For a moment, I was concerned we had a problem reproducing the Reproduction Crisis.

posted by I-Write-Essays at 3:28 PM on August 2, 2019 [1 favorite]

posted by I-Write-Essays at 3:28 PM on August 2, 2019 [1 favorite]

*Vigorous debates go on and on about where the tipping point is for discarding a belief*

Easy: That tipping point is itself a belief which we should test against empirical evidence.

posted by I-Write-Essays at 3:32 PM on August 2, 2019 [1 favorite]

I'm just mad that the Replication Crisis isn't about a swarm of robots endlessly building copies of themselves.

Who was it that said science is always coming up with cool names for boring stuff and boring names for cool stuff?

posted by Mr.Encyclopedia at 3:34 PM on August 2, 2019 [3 favorites]

Who was it that said science is always coming up with cool names for boring stuff and boring names for cool stuff?

posted by Mr.Encyclopedia at 3:34 PM on August 2, 2019 [3 favorites]

*Unpopular opinion: physics is the easiest science.*

My bachelor's degrees are in biology and physics, and I approve this message.

posted by biogeo at 4:03 PM on August 2, 2019 [5 favorites]

*Who was it that said science is always coming up with cool names for boring stuff and boring names for cool stuff?*

Calvin and Hobbes, for one.

posted by Greg_Ace at 4:15 PM on August 2, 2019 [2 favorites]

*Calvin and Hobbes, for one.*

I knew it had to be either Calvin and Hobbes or XKCD.

posted by Mr.Encyclopedia at 4:52 PM on August 2, 2019 [1 favorite]

*The solution to the replication crisis isn't Bayesian statistics, it's collecting enough data to be able to make a more robust statistical statement.*

I'd argue more strongly than that.

*Even*Bayesian statistics

*and*more data aren't enough to solve the replication crisis on their own. What we need is more nuanced ways of thinking and talking about the outcomes of studies in terms of the quality of evidence they provide.

My research involves collecting quantitative behavioral data as well as physiological data. For certain kinds of physiological data (in my case, the firing rates of neurons), it's not that uncommon to run analyses that produce p=0. That is, p is exactly zero within the computer's ability to compute p-values accurately, so something like p < 10^-200. But that only reflects the fact that you're analyzing something like hundreds of thousands or millions of action potentials (just read "data points" if you don't know what those are), and because of the way the data is collected, very small biases can occur between conditions for all kinds of reasons that are unrelated to the phenomenon that you're actually interested in studying. So sometimes you look at these "p=0" results and find the actual effect size is something like 1%, and you got a wildly significant p-value just because of the number of spikes you recorded. This is unlikely to represent anything biologically meaningful, and so usually when I am filtering my data for effects, I filter on both p-value

*and*effect size; generally the effect size part is what excludes most of the results. By contrast, my behavioral data rarely has this issue, because I don't generally get millions of behavioral samples. (I know, millions of samples may not seem that much in comparison to, say, particle physics, but for biology where your data sources are much messier, it's plenty for biases to dominate your results.)

So just collecting

*more*data isn't really enough, because depending on the nature of the data it can easily be that you become almost certain to find

*some*kind of positive result. Thinking in terms of effect sizes, and confidence intervals on effect sizes, is in my opinion more fruitful than just blindly collecting more data.

The reason that I tend to prefer Bayesian thinking about statistics is not because it makes it more likely that I get a significant result (I do my best to avoid reporting significance values at all if I can, because I don't think they tell us what we actually want to know). Rather, it's because Bayesian thinking

*better captures my intuitions about data analysis*. There is almost never a case when I want to know the answer to the question "What is the probability of observing data at least as unusual as the data that I observed, under the assumption that my hypothesis is false?" What I

*actually*want to know is, "How likely is it that my hypothesis is true?" Or maybe better, "After the observations I made, how much more or less should I believe my hypothesis?" You can actually approach these questions perfectly rigorously using either Bayesian or frequentist methods, but Bayesian approaches to this tend to be (to me) much more intuitive, and much more transparent about what assumptions you're making as you go about answering the question.

Based on all this, I don't think the 5 sigma standard is necessarily the right one for biomedical research. Biological systems are much, much more complex than particle interactions, and while I am not a particle physicist and may be wrong about this, I think most new hypotheses in particle physics are generally tested in the context of models with a comparatively small number of free parameters (i.e., the Standard Model) which are already very well bounded, and the important questions to answer are ones that are in some sense somewhat orthogonal to what we already know about that model. By contrast, new hypotheses in biology and biomedicine are developed and tested in the context of (often implicit) models with a huge number of (usually implicit) parameters, and the important questions to answer are often quite parallel to previous or concurrent research. Because of this, studies in biology and biomedicine quite commonly "pseudoreplicate" one another; that is, both depend, partially or entirely, on the same underlying model, even if the nature of the experiments are quite different. Any one study may provide only limited evidence for or against the underlying model, but the sum total of the literature can provide a strong test, similar to what the 5-sigma standard is intended to achieve. Unfortunately, because these models and their parameters are usually implicit rather than explicit, it's difficult to actually interrogate the literature in a rigorous way.

I think the replication crisis requires making the underlying models and hypotheses that go into the formulation of an experiment, and the assumptions that go into the statistical analyses of the results of those experiments, explicit rather than implicit. We need better cultural norms around how to talk about the strength of evidence that a given study provides, and we need to abandon the notion that the function of a scientific study is to produce a binary true/false assessment of a hypothesis. We need to recognize the independent value of both exploratory and confirmatory science, and recognize that there are different standards in play between the two (e.g., for whether you should prefer type I or type II statistical errors when doing traditional statistical inference), and stop forcing scientists to pretend that their studies are hypothesis-driven when they're actually observation-driven. I think adopting Bayesian methods and language for statistics can help with all of those things, but Bayesianism is neither necessary nor sufficient for achieving those goals.

posted by biogeo at 5:10 PM on August 2, 2019 [26 favorites]

Plus, nobody wants to do studies that would require killing 5 sigmas worth of mice. I think probably any animal research ethics board in the U.S. or Europe would reject a proposal designed at that level as a clear violation of the "reduce" aim of the 3Rs for animal research ethics.

posted by biogeo at 5:20 PM on August 2, 2019 [8 favorites]

posted by biogeo at 5:20 PM on August 2, 2019 [8 favorites]

apathy, I think Bayes goes beyond extraordinary claims requiring extraordinary evidence-- it's also a matter of adjusting your beliefs a little bit when you find ordinary evidence against them.

posted by Nancy Lebovitz at 6:09 PM on August 2, 2019 [1 favorite]

posted by Nancy Lebovitz at 6:09 PM on August 2, 2019 [1 favorite]

*The solution to the replication crisis isn't Bayesian statistics, it's collecting enough data to be able to make a more robust statistical statement.*

The solution to the replication crisis is incentivizing and funding replication. For things with little immediate consequence, we can just publish the stuff and let people attack it later. For things with stronger consequences, we can fund and insist on independent replication before we actually do anything with the information.

I really don't think it's an accident that the replication crisis is so heavily concentrated in discplines that

(1) Are very difficult, in the sense that there are not any good theories at the scale of relativity to apply to particular problems

(2) Study areas where the data-generating processes almost certainly really are very noisy and probabilistic, even if you had relativity-scale theories in operation

(3) Are funded and therefore operated in ways that reward the creation of a series of novel datasets with 50-250 observations

*Particle physics went to the 5 sigma standard for new physics discoveries like 20 years ago.*

That makes some sense for disciplines that societies shovel money at as hard as they shovel it at particle physics, except that almost every discipline is messier, noisier, and harder in the sense of being less amenable to theory than particle physics is.

Following biogeo, it's also the case that sometimes getting the data you'd like is impossible. If I want data on only 1000 elected US Presidents, I don't have any alternative but to wait for at least another 3800 years.

posted by GCU Sweet and Full of Grace at 6:10 PM on August 2, 2019 [6 favorites]

*Use of empirically-derived p-values seems fine. There are scenarios where you can count how many times you observe something in a subset, can count how many times you expect to observe something based on all events, and can derive a p-value from those counts.*

That's a rough mathematical description of p-values, so of course you can calculate that. Then you need to decide what to do with that calculation.

The problem is that what a lot of people want to do is then say "I have a low p-value and therefore my hypothesis is likely true (or my intervention is effective or whatever)." Or maybe they conclude the opposite: A pretty high p-value so likely you hypothesis is wrong and/or what you are doing is having no effect.

It seems like this should work but it turns out it usually doesn't. In practice the number of situations for which you can make the leap from "low p-value" to "likely to reproduce" is much smaller than people intuit. For example: Did you look at that data set of cancer rates while you were developing a hypothesis that you are now calculating a p-value for? Then the p-value isn't reliable as a significance test.

Fisher's original formulation, which I will paraphrase as "if you have a true hypothesis you should be able to repeatedly perform experiments that give you a p-value < 0.05" is not that bad, assuming it's the kind of hypothesis you can design repeated experiments for that way. But it's rarely how it's used in the standard paper or scientific discussion.

posted by mark k at 8:15 PM on August 2, 2019 [5 favorites]

Is this truly a representative study, or is there a bias in which publications about publication bias in publications about publication bias get published on Metafilter?I came across a study of publication bias in publications related to publication bias while writing an article about the subject for a client, if anyone else here gets the meta-giggles about that sort of thing.

posted by roystgnr at 10:13 PM on August 2, 2019

I'd argue that if you get meaningfully different answers using frequentist and Baysian techniques, in at least one case you're definitely asking the wrong question. If the Baysian approach forces you to actually write down some of the assumptions that are implicit otherwise, maybe it's worthwhile. But, it's not magic bullet against bad experimental design and failing to take into account selection effects, the stopping problem, etc.

In my own corner of physics, we're plagued by both underestimating systematics in measurement and by failing to include uncertainties in the models themselves. I think most people are genuinely trying their best. Given how hard it is to honestly report significance when looking at something with a few dozen well-defined parameters and a very specific measurement goal, I can only imagine how hard it is when dealing with people.

I've published a paper that formally claimed a 92-sigma detection. (Of something that wasn't actually the point of the paper, and which has become a bit of a running joke for many decades since the first 2 sigma detection.) I can show you the MCMC plots. To what extent do I believe the underlying physical model used in those fits is actually true? 3 sigma would be optimistic. (Clearly, an alternative model would have to generate very similar values in the measured parameters.) To what extent do I believe the absolute calibration of the instruments that produced those data? That's a much harder question. Simulations and jackknives suggest we're not wrong enough to change the result.

posted by eotvos at 7:47 AM on August 3, 2019 [3 favorites]

In my own corner of physics, we're plagued by both underestimating systematics in measurement and by failing to include uncertainties in the models themselves. I think most people are genuinely trying their best. Given how hard it is to honestly report significance when looking at something with a few dozen well-defined parameters and a very specific measurement goal, I can only imagine how hard it is when dealing with people.

I've published a paper that formally claimed a 92-sigma detection. (Of something that wasn't actually the point of the paper, and which has become a bit of a running joke for many decades since the first 2 sigma detection.) I can show you the MCMC plots. To what extent do I believe the underlying physical model used in those fits is actually true? 3 sigma would be optimistic. (Clearly, an alternative model would have to generate very similar values in the measured parameters.) To what extent do I believe the absolute calibration of the instruments that produced those data? That's a much harder question. Simulations and jackknives suggest we're not wrong enough to change the result.

posted by eotvos at 7:47 AM on August 3, 2019 [3 favorites]

« Older Some neighborhoods were not worth fixing. | A Whirling Motion of Fluid or Air Newer »

This thread has been archived and is closed to new comments

posted by Huffy Puffy at 11:34 AM on August 2, 2019 [2 favorites]