the replication revolution
August 22, 2018 3:45 AM   Subscribe

The competing narratives of scientific revolution - "Scientific revolutions occur on all scales, but here let's talk about some of the biggies:

  • 1850-1950: Darwinian revolution in biology, changed how we think about human life and its place in the world.
  • 1890-1930: Relativity and quantum revolutions in physics, changed how we think about the universe.
  • 2000-2020: Replication revolution in experimental science, changed our understanding of how we learn about the world.
"We are in the middle of a scientific revolution involving statistics and replication in many areas of science, moving from an old paradigm in which important discoveries are a regular, expected product of statistically-significant p-values obtained from routine data collection and analysis, to a new paradigm of . . . weeelll, I'm not quite sure what the new paradigm is." (via)

also btw...
-The Knowledge Loop[*]
-Collective Awareness:1,2,3,4,5
-Real World vs. Book Knowledge
-How Econ Went From Philosophy to Science (what if... profit?)1,2,3
-Atomism is basic: emergence explains complexity in the Universe
-The Supreme Court isn't equipped to judge the statistics in the Harvard discrimination case
posted by kliuless (20 comments total) 28 users marked this as a favorite
 
What are the chances we could start doing rigorous replication of results of the form "read their methods section, follow those exactly, publish whether it replicates or not" and start only taking seriously studies that actually replicate? Of course it would double-triple the cost of science (or halve-third the rate of new findings) but it sure would fix a lot of problems and increase correctness of our knowledge.

It would require major shakeups to bring us back to the scientific method, of course.
posted by Easy problem of consciousness at 5:32 AM on August 22, 2018 [1 favorite]


But what is the cost of studies that later turn out to be wrong, invalidating everything predicated on them? I don't think a return to the scientific method would necessarily increase costs.
posted by merlynkline at 5:49 AM on August 22, 2018 [4 favorites]


or halve-third the rate of new findings

If a large percentage of new studies is not replicable, then what is the harm in reducing the rate of "discoveries"
posted by RustyBrooks at 6:11 AM on August 22, 2018 [5 favorites]


I'm not sure it's a revolution by his definition. It's a big deal but from my perspective "we can imagine a world in which the replication revolution in science was no revolution at all, but just a gradual reform" seems about what we're living through. Gelman's on the barricades so he perceives it as bloody. The scale I'm seeing change at is the metaphorical equivalent of parliamentary reforms--Nature is adding data quality statements, norms are shifting on data sharing, undergrads know what p-hacking is, and so on. Guess whether it's a paradigm shift depends on your dominant paradigm.

Looking forward to digesting the other links when I have time.
posted by mark k at 6:54 AM on August 22, 2018 [7 favorites]


This is a deep and interesting post. Thank you!
posted by HumanComplex at 7:15 AM on August 22, 2018


What are the chances we could start doing rigorous replication of results of the form "read their methods section, follow those exactly, publish whether it replicates or not" and start only taking seriously studies that actually replicate

This is an interesting question, with several answers.

1. The chances are not exactly bad in theory - the major federal granting agencies like the NSF and NIH have and will continue to fund replication studies. The bigger problem is that it's hard to make a name and reputation for yourself as a replication scientist, and the bigger name and reputation you have as a scientist the better gigs you get and the more money you make.

2. The idea behind doing meta-analyses and systematic reviews is to get a sense of the overall answer to some question by looking at all the studies which have tried to answer it. It's not replication but it does a similar sort of thing in theory by trying to weigh all of the studies that have looked at some similar question. It is the highest form os scientific evidence.

3. Actual replication is very difficult. Often we read a methods section and things seem clear and sensible enough on a read-through, but when you actually go back and try to follow them to the letter and re-create the study, it is much more difficult with a lot more ambiguity than you might have thought at first.

We should have much better mechanisms in place to do replications, particularly of studies which have been given a lot of weight. There are definitely efforts to try and make this more doable. And there are efforts to make all data from federally funded studies openly available, so at the very least other scientists can re-do stats and analyses on the same data to try and reduce things like p-hacking, which seems a very easy and reasonable first step.
posted by Lutoslawski at 8:03 AM on August 22, 2018 [5 favorites]


The bigger problem is that it's hard to make a name and reputation for yourself as a replication scientist, and the bigger name and reputation you have as a scientist the better gigs you get and the more money you make.

Whatever incentive there is for people to be article reviewers should be nth-tupled (with whatver n is appropriate) for people to do replication studies. It should be a service that every academic scientist feels an obligation to occasionally do.

Actual replication is very difficult. Often we read a methods section and things seem clear and sensible enough on a read-through, but when you actually go back and try to follow them to the letter and re-create the study, it is much more difficult with a lot more ambiguity than you might have thought at first.


But that seems to be entirely on the authors of the original paper. Write your methods section extremely well or risk having your findings tossed out.
posted by straight at 8:14 AM on August 22, 2018 [2 favorites]


What are the chances we could start doing rigorous replication of results of the form "read their methods section, follow those exactly, publish whether it replicates or not" and start only taking seriously studies that actually replicate?

It depends a lot on the kind and scale of study.

If replication means downloading the same set of publicly-available datasets, performing the appropriate manipulations and merges, and conducting the same set of statistical analyses, that's trivial but not really a replication since you're using exactly the same datasets. That's just checking whether you fucked up doing your data manipulations and entering the commands for your analyses. Which still matters, duh, but is not the same thing.

If replication means forcing another 25--100 undergraduates to spend an hour each doing some kind of psychological experiment as part of passing psych 101, that's nontrivial but obviously doable.

If replication means performing a new survey of 25,000 people at a cost of millions of dollars, that's obviously impractical.

If replication means waiting for a new dataset of presidential election results to self-generate or requires you to go back in time to perform a new survey of the American mass public in 1974, replication before publication is impossible.

tl;dr: Insisting on real replication is something that seems really only relevant to small-scale experimental work.
posted by GCU Sweet and Full of Grace at 8:23 AM on August 22, 2018 [2 favorites]


Well this dude certainly has a high opinion of himself.

Experimental Psychology has a problem that most scientific disciplines do not in that most of the questions that it asks are difficult to answer in more than one way, making rote replication unusually valuable, particularly given the dangers of the statistical illiteracy that remains endemic to many of its sub-disciplines. Indeed, unlike biologists who can build their models on testable structure/function arguments like an appropriately strong foundation for a house, or physicists can anchor themselves with intricately testable mathematical models, many disciplines of experimental psychologists have only the philosophical assumptions they've brought with them. Thus, especially with disciplines like Social Psychology that have largely been exercises in applying the same test to different aspects of a good just so story, when the only support the story has is generally the one kind of test, it is especially important that the one kind of test be grounded in good statistical assumptions. However, most of the scientific enterprise has other, and generally more productive, ways to validate our work.

Rote reproduction is generally not necessary because when scientists are sufficiently clever, asking good questions that would lead them to develop good models, there should be better and more useful ways to attempt to falsify those models and their assumptions then just doing the same thing all over again. That is especially if the tests were performed and analyzed with good statistical methods if they are necessary. A good scientist doing most forms of basic research is always thinking about what they will do with the answers they get, such that either they or others will be able to both verify their efforts and continue asking better questions of the new model proposed at the same time, rather than just wasting cycles asking the same one over and over again, which is generally likely to only produce trivial answers. This is part of being good stewards of the precious resources we get. We do have to be careful at each of these steps to check the assumptions of the models we use and interrogate, both with our own and other's data, and published scientific papers are intentionally designed to help us do this.

For example, if a bacteriophage geneticist wants to know what a bacteriophage gene does they have a lot of options. They could start by cloning it into each member of a library of strains of yeast that have been prepared to perform what is called a yeast two hybrid test to see if this protein that the gene makes interacts with any one of a set of other bacterial proteins in its host, which would each give an indication of function. However, this test is notorious for producing both false-positive and false-negative results in different kinds of contexts, some of these false results are repeatable and some of them are not. So instead of just repeating the test over and over again, which will only allow you to eliminate the non-repeatable false results, its a good idea to instead also do a bacterial two hybrid screening. This asks essentially the same question with a somewhat different technique that has different advantages and disadvantages, but is also pretty notoriously unreliable for differently complex reasons. Now lets say that one test comes up with binding activity by the unknown protein to DNA gyrase, a protein involved in spinning DNA as part of a process required for replication and the other turns up a transcription factor (protein that turns on or off specific genes by binding to DNA), but neither test confirms the other. A pure statistician with no particular expertise in any specific scientific field might be inclined to throw up his hands and call these tools clearly useless, a microbial geneticist would be likely to hypothesize that maybe this unknown protein interacts with the DNA that these two different proteins would spend most of their time bound to in both critters. Its certainly not a clear demonstration of DNA binding activity by any means, but its certainly strong enough evidence to make asking a different question make a lot more sense than just repeating either of those two tests again. For example, DNA can be made to migrate across a salty gel if you push it with an electric current, and you can watch it go. If we were to incubate both the DNA of the bacteriophage and the DNA of its bacterial host with the protein, we could see if one or both get slowed down by binding activity from the protein. Let it incubate for longer and we can see if the protein degrades DNA by seeing if the DNA disappears. If any one of these things happen, then there will be a much stronger foundation for more investigation than anything that more two-hybrid screening could tell you.

In this context, like in most, the failure to replicate each result wasn't really an indication that anything actually went wrong but was instead an indication that the right answer was likely to a question that wasn't being asked yet. In molecular genetics we have the luxury of assembling papers from a number of different kinds of tests Even in disciplines like Social or Evolutionary Psychology, where the "replication crisis" does appear to be at least in part the direct result of abysmal research practices being normalized, it is not actually at all a direct indication of those bad research practices. This dude is still thinking of science as an enterprise concerned with generating TrueFacts to the exclusion of supposed facts that are not true. However, by treating scientific papers as if they were the units that truth is supposed to come in, he is fundamentally misunderstanding what they're for. Scientific papers are for the communication of findings, the methods used to arrive at those findings, and the certainty of the results as well as possible in order to allow the scientific community to collaboratively build better models of the natural world. If those findings are unreplicatable, or even actively misleading, that isn't necessarily an indication that anything has actually gone wrong with the process as it should be functioning.

While mass replication of high-impact papers does actually make a lot of sense in the context of specific problems that are facing specific fields in experimental psychology, and has clearly done a lot of good for the field, the way this dude wildly over-interpreting the results will only get in the way of that. Each "failed replication" will need to be interpreted individually by people who actually understand the context of the questions being asked, and ideally have the statistical literacy that is not yet a given in some fields, to figure out what value each one has. Was the published result a fluke of p-hacking? Were the methods unsound in some other way? Are the results from undergrads being assessed in each effort diverse across institutions? Or across time? Similarly, there are very good reasons for even respectable and responsible scientists being terrified of him training his deeply flawed understanding of scientific process on them. While the specific fields he has been addressing are filled with unsound methods, the idea that they are the only reason why results might not be replicated by other researchers is not only wrong, its profoundly toxic to the more ontologically complex things that those fields need to be doing to sort themselves out.
posted by Blasdelb at 9:15 AM on August 22, 2018 [5 favorites]


See also, Psychology Is Not In Crisis:
An initiative called the Reproducibility Project at the University of Virginia recently reran 100 psychology experiments and found that over 60 percent of them failed to replicate — that is, their findings did not hold up the second time around. The results, published last week in Science, have generated alarm (and in some cases, confirmed suspicions) that the field of psychology is in poor shape. But the failure to replicate is not a cause for alarm; in fact, it is a normal part of how science works.

Suppose you have two well-designed, carefully run studies, A and B, that investigate the same phenomenon. They perform what appear to be identical experiments, and yet they reach opposite conclusions. Study A produces the predicted phenomenon, whereas Study B does not. We have a failure to replicate. Does this mean that the phenomenon in question is necessarily illusory? Absolutely not. If the studies were well designed and executed, it is more likely that the phenomenon from Study A is true only under certain conditions. The scientist’s job now is to figure out what those conditions are, in order to form new and better hypotheses to test.

A number of years ago, for example, scientists conducted an experiment on fruit flies that appeared to identify the gene responsible for curly wings. The results looked solid in the tidy confines of the lab, but out in the messy reality of nature, where temperatures and humidity varied widely, the gene turned out not to reliably have this effect. In a simplistic sense, the experiment “failed to replicate.” But in a grander sense, as the evolutionary biologist Richard Lewontin has noted, “failures” like this helped teach biologists that a single gene produces different characteristics and behaviors, depending on the context. Similarly, when physicists discovered that subatomic particles didn’t obey Newton’s laws of motion, they didn’t cry out that Newton’s laws had “failed to replicate.” Instead, they realized that Newton’s laws were valid only in certain contexts, rather than being universal, and thus the science of quantum mechanics was born.

In psychology, we find many phenomena that fail to replicate if we change the context. One of the most famous is called “fear learning,” which has been used to explain anxiety disorders like post-traumatic stress. Scientists place a rat into a small box with an electrical grid on the floor. They play a loud tone and then, a moment later, give the rat an electrical shock. The shock causes the rat to freeze and its heart rate and blood pressure to rise. The scientists repeat this process many times, pairing the tone and the shock, with the same results. Eventually, they play the tone without the shock, and the rat responds in the same way, as if expecting the shock. Originally this “fear learning” was assumed to be a universal law, but then other scientists slightly varied the context and the rats stopped freezing. For example, if you restrain the rat during the tone (which shouldn’t matter if the rat is going to freeze anyway), its heart rate goes down instead of up. And if the cage design permits, the rat will run away rather than freeze.

These failures to replicate did not mean that the original experiments were worthless. Indeed, they led scientists to the crucial understanding that a freezing rat was actually responding to the uncertainty of threat, which happened to be engendered by particular combinations of tone, cage and shock. Psychologists are usually well attuned to the importance of context. In our experiments, we take great pains to avoid any irregularities or distractions that might affect the results. But when it comes to replication, psychologists and their critics often seem to forget the powerful and subtle effects of context. They ask simply, “Did the experiment work or not?” rather than considering a failure to replicate as a valuable scientific clue. As with any scientific field, psychology has some published studies that were conducted sloppily, and a few bad eggs who have falsified their data. But contrary to the implication of the Reproducibility Project, there is no replication crisis in psychology. The “crisis” may simply be the result of a misunderstanding of what science is.

Science is not a body of facts that emerge, like an orderly string of light bulbs, to illuminate a linear path to universal truth. Rather, science (to paraphrase Henry Gee, an editor at Nature) is a method to quantify doubt about a hypothesis, and to find the contexts in which a phenomenon is likely. Failure to replicate is not a bug; it is a feature. It is what leads us along the path — the wonderfully twisty path — of scientific discovery."
posted by Blasdelb at 9:18 AM on August 22, 2018 [8 favorites]


Write your methods section extremely well or risk having your findings tossed out.

"Writing well" in the sense of clear communication of complex ideas or methods, is far too rare in almost all walks of life. And ironically, "English" classes in school don't really teach that skill.
posted by Greg_Ace at 9:18 AM on August 22, 2018 [1 favorite]


So, tracing the replication crisis ( or whatever you want to call it)
to the particular pitfalls and theory structure of social/evolution psych is problematic because we are seeing similar troubles in fields with verydifferent incentives, stakes, theory structure etc. too:

http://www.slate.com/articles/health_and_science/future_tense/2016/04/biomedicine_facing_a_worse_replication_crisis_than_the_one_plaguing_psychology.html
posted by heyforfour at 11:06 AM on August 22, 2018 [1 favorite]


we are seeing similar troubles in fields with very different incentives, stakes...

I don't know that the incentives and stakes are all that different for the individual researchers. If you want a research career in academics, you have to bring in money, and if you don't bring in money, you have to go do something else. Sure, the stakes for society are very different between social psych and cancer research, but the decisions in the lab are made by the researchers, who have every incentive to get stuff published no matter what so they can get the next grant.

I have a suspicion that, functionally, the purpose of a great deal of research as it is done these days is not to solve problems or further knowledge, but to generate new problems to solve (using the same old approaches) and to manufacture interesting little facts that don't mean anything. If someone came along and cured cancer once and for all, all those billions that are divvied up amongst the researchers would disappear: no more grants, no more summer salary, no more travel to conferences in far off places, no more teaching buyouts, no more indirect costs to the university coffers. The incentives for researchers tend to be weighted on the side of not making real progress.
posted by logicpunk at 11:28 AM on August 22, 2018 [1 favorite]


"Rote reproduction is generally not necessary because when scientists are sufficiently clever, asking good questions that would lead them to develop good models, there should be better and more useful ways to attempt to falsify those models and their assumptions then just doing the same thing all over again. That is especially if the tests were performed and analyzed with good statistical methods if they are necessary. A good scientist doing most forms of basic research is always thinking about what they will do with the answers they get, such that either they or others will be able to both verify their efforts and continue asking better questions of the new model proposed at the same time, rather than just wasting cycles asking the same one over and over again, which is generally likely to only produce trivial answers. This is part of being good stewards of the precious resources we get. We do have to be careful at each of these steps to check the assumptions of the models we use and interrogate, both with our own and other's data, and published scientific papers are intentionally designed to help us do this."

Okay.

[2018] Don't characterize replications as successes or failures. Discussion of ``Making replication mainstream,'' by Rolf A. Zwaan et al. {\em Behavioral and Brain Sciences}. (Andrew Gelman)
No replication is truly direct, and I recommend moving away from the classification of replications as “direct” or “conceptual” to a framework in which we accept that treatment effects vary across conditions. Relatedly, we should stop labeling replications as successes or failures and instead use continuous measures to compare different studies, again using meta-analysis of raw data where possible.
[2018] Large scale replication projects in contemporary psychological research. {\em American Statistician}. (Blakely B. McShane, Jennifer L. Tackett, Ulf Bockenholt, and Andrew Gelman)
Replication is complicated in psychological research because studies of a given psychological phenomenon can never be direct or exact replications of one another, and thus effect sizes vary from one study of the phenomenon to the next—an issue of clear importance for replication. Current large scale replication projects represent an important step forward for assessing replicability, but provide only limited information because they have thus far been designed in a manner such that heterogeneity either cannot be assessed or is intended to be eliminated. Consequently, the non-trivial degree of heterogeneity found in these projects represents a lower bound on heterogeneity.

We recommend enriching large scale replication projects going forward by embracing heterogeneity. We argue this is key for assessing replicability: if effect sizes are sufficiently heterogeneous—even if the sign of the effect is consistent—the phenomenon in question does not seem particularly replicable and the theory underlying it seems poorly constructed and in need of enrichment. Uncovering why and revising theory in light of it will lead to improved theory that explains heterogeneity and increases replicability. Given this, large scale replication projects can play an important role not only in assessing replicability but also in advancing theory.
If you want to correctly understand his arguments and avoid merely setting up a hot take strawman, more can be found here.
posted by Ivan Fyodorovich at 12:01 PM on August 22, 2018 [3 favorites]


Tangentially, lets not forget the revolution in the Earth Sciences @1955-1975 with the development of plate tectonics.
posted by TDIpod at 12:12 PM on August 22, 2018


If someone came along and cured cancer once and for all, all those billions that are divvied up amongst the researchers would disappear: no more grants, no more summer salary, no more travel to conferences in far off places, no more teaching buyouts, no more indirect costs to the university coffers. The incentives for researchers tend to be weighted on the side of not making real progress.

This is pretty disconnected from the reality of how grants, money and science itself actually work. It's like a more extreme version of saying climate scientists want to overhype climate results.

Success gets rewarded with grants and patents. Any team curing cancer would be billionaires if they wanted to. Teams that could have but didn't out of some misguided sense of professional courtesy would be out in the cold. The whole setup is like a virtuous version of the prisoners' dilemma, where you have to assume the other folk are defecting so you need to defect first--only here defecting means "working for the good of humanity and science" not "ratting on your neighbor." (Google about the CRISPR patent fight for the stakes on a breakthrough.) And the way science works curing cancer would guarantee grant money to answer the now really meaningful "why" question in ever more detail for at least the career life of every PI now working.

I've seen lots of promised breakthroughs on cancer (from antibodies and angiogenesis inhibitors to I/O and CAR-T) and 100% people swarmed those when there's even a whiff of transformational progress.

More generally I think it's clear the lesson of the reproducibility issue in biomedical fields is that the incentives to publish quickly about "breakthroughs" are way too high, not too low. This leads to rushes to get thing in print with a half-dozen mice and non-selective "probes" to stake your claim.
posted by mark k at 7:23 PM on August 22, 2018 [6 favorites]


Getting an R01 grant from the NIH generally involves having pilot data on all the experiments proposed - you need to show that you can do what you are proposing to do. The review panels are made up of senior scientists in the field who educated, directly or indirectly, the junior scientists applying for the grant - there is a body of knowledge and methodology that is assumed, and deviations from that are likely to hurt an application. The results of an studies you conduct need to run the gauntlet of journal reviewers who are busy with their own lives and don't have the time or patience to understand something new.

There are multiple constraints on how innovative a scientist can be, especially one who is hustling for tenure. Do you spend your pre-tenure years developing a novel methodology that will have problems getting funded, may not work anyway, and most likely won't get into the top journals, or do you do the same thing you did as a postdoc with a slight twist? It's a pretty clear answer for most scientists. Once you have tenure, you have a running lab, postdocs and grad students of your own that you're educating in the same methods that earned you tenure, and that you want to see succeed, which means continuing to get grants using the same methods, etc. etc etc.

There is considerable inertia in science that really does require funerals to overcome. I don't doubt that every single cancer researcher sincerely wants to be the one who cracks the problem, but given the fantastic odds against that, it simply makes sense for them to continue to do the things that ensure their livelihood. This includes the aforementioned salary supplement, travel budget, etc. After a sufficient period of time has elapsed where scientists doing the usual things fails to produce a solution to a problem, it becomes clear that the incentive structure is not completely aligned with actual advances in knowledge. You don't need to imagine a conspiracy of scientists respectfully withholding results so that their conference drinking buddies can get their funding renewed; it's the product of individuals doing what is in their own best interest, just as you say.
posted by logicpunk at 5:23 AM on August 23, 2018


Ivan Fyodorovich: If you want to correctly understand his arguments and avoid merely setting up a hot take strawman, more can be found here.
If there is one thing that bothers me more than sloppy writers who make grand statements about how stupid all of the experts around them are, its sloppy writers who do this so prolifically that they can easily be made to debate themselves.
"Relatedly, we should stop labeling replications as successes or failures and instead use continuous measures to compare different studies, again using meta-analysis of raw data where possible."
...Just do a find search for yourself for "failed replication" in the OP. Its truly amazing that anyone could be so internally inconsistent and yet pompous as to write these two pieces in the same year.
posted by Blasdelb at 7:07 AM on August 23, 2018


also btw...
Online Bettors Can Sniff Out Weak Psychology Studies - "So why can't the journals that publish them?"
While the SSRP team was doing their experimental re-runs, they also ran a “prediction market”—a stock exchange in which volunteers could buy or sell “shares” in the 21 studies, based on how reproducible they seemed. They recruited 206 volunteers—a mix of psychologists and economists, students and professors, none of whom were involved in the SSRP itself. Each started with $100 and could earn more by correctly betting on studies that eventually panned out.
Ioannidis: Most Research Is Flawed; Let's Fix It - "John Ioannidis built a career on critiquing other researchers' work, and now he is looking for ways to improve the scientific process, he tells Eric Topol."

oh and :P
The Paradox of Karl Popper - "The great philosopher, renowned for his ferocious attacks on scientific and political dogmatism, could be quite dogmatic."

Compositionality – Now Open For Submissions - "Our new journal Compositionality is now open for submissions! Compositionality refers to complex things that can be built by sticking together simpler parts. We welcome papers using compositional ideas, most notably of a category-theoretic origin, in any discipline. This may concern foundational structures, an organising principle, a powerful tool, or an important application. Example areas include but are not limited to: computation, logic, physics, chemistry, engineering, linguistics, and cognition."
posted by kliuless at 4:19 AM on August 28, 2018


"Much of this 'replication crisis' is something else. It's a failure to understand that in social science there are few universal truths and most results are contingent on the environment."
posted by kliuless at 4:20 AM on August 30, 2018


« Older Gandini Juggling   |   Everything That Belonged to Us is Coming Back Newer »


This thread has been archived and is closed to new comments