re-pro-duction
September 2, 2015 9:51 PM   Subscribe

The Reproducibility Project out of UVa recently published their findings in Science: Estimating the reproducibility of psychological science
We conducted replications of 100 experimental and correlational studies published in three psychology journals using high-powered designs and original materials when available. ... The mean effect size (r) of the replication effects (Mr = 0.197, SD = 0.257) was half the magnitude of the mean effect size of the original effects (Mr = 0.403, SD = 0.188), representing a substantial decline. Ninety-seven percent of original studies had significant results (P < .05). Thirty-six percent of replications had significant results;
CONCLUSION
No single indicator sufficiently describes replication success, and the five indicators examined here are not the only ways to evaluate reproducibility. Nonetheless, collectively these results offer a clear conclusion: A large portion of replications produced weaker evidence for the original findings despite using materials provided by the original authors, review in advance for methodological fidelity, and high statistical power to detect the original effect sizes. Moreover, correlational evidence is consistent with the conclusion that variation in the strength of initial evidence (such as original P value) was more predictive of replication success than variation in the characteristics of the teams conducting the research (such as experience and expertise). The latter factors certainly can influence replication success, but they did not appear to do so here.

Reproducibility is not well understood because the incentives for individual scientists prioritize novelty over replication. Innovation is the engine of discovery and is vital for a productive, effective scientific enterprise. However, innovative ideas become old news fast. Journal reviewers and editors may dismiss a new test of a published idea as unoriginal. The claim that “we already know this” belies the uncertainty of scientific evidence. Innovation points out paths that are possible; replication points out paths that are likely; progress relies on both. Replication can increase certainty when findings are reproduced and promote innovation when they are not. This project provides accumulating evidence for many findings in psychological research and suggests that there is still more work to do to verify whether we know what we think we know.
The Atlantic: How Reliable Are Psychological Studies?

Crooked Timber: The Great Replication Crisis

New York Times:Psychology Is Not In Crisis

In The Pipeline: Thoughts on Reproducibility
As you’ll have heard, the Reproducibility Initiative has come out with data on their attempts to replicate a variety of studies (98 of them from three different journals) in experimental psychology. The numbers are not good:
The mean effect size (r) of the replication effects (Mr = 0.197, SD = 0.257) was half the magnitude of the mean effect size of the original effects (Mr = 0.403, SD = 0.188), representing a substantial decline. Ninety-seven percent of original studies had significant results (P < .05). Thirty-six percent of replications had significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68% with statistically significant effects.
Right there you can see why the reporting on these results have been all over the map, because that paragraph alone assumes more statistical fluency than you could extract from an auditorium full of headline writers. So you can pick your headline – everything from “total crisis in all of science” to “not much of a big deal at all” is available if you dig around some.
posted by the man of twists and turns (22 comments total) 23 users marked this as a favorite
 
nature: Over half of psychology studies fail reproducibility test, as was preliminarily reported back in April.

Maybe The Trouble With Scientists is a Publication Bias that's led to a Reproducibility Crisis in several fields.
posted by the man of twists and turns at 9:57 PM on September 2, 2015


Is that title a Grease 2 joke?
posted by His thoughts were red thoughts at 9:58 PM on September 2, 2015 [5 favorites]


I was waiting for this to show up on the Blue. Seems to me that the situation is normal. Which is to say, all fucked up. I was never a fan of deciding that small studies using undergraduates (which is sometimes the case) was somehow representative of human nature writ large. But that's because I'm a cranky pants. And rightfully so, it would seem. I gotta say, it was thrilling to hear that folks were attempting to reproduce these studies. I mean, isn't that one of the hallmarks of good science?
posted by Bella Donna at 10:20 PM on September 2, 2015 [10 favorites]


Re-pro-duction. Yes, indeed. You hear the most outrageous lies about it. Half-baked goggle-box do-gooders telling everybody it's old news. --Repro Man
posted by hades at 10:57 PM on September 2, 2015 [2 favorites]


Part of it is a relative ignorance of statistics - it seems as if a lot of researchers are taught that as long as you plug your numbers into the ANOVA, and p < .05, then the result is "significant". Which is wrong for a number of reasons: 1) at best, a p value can tell you about whether a finding is reliable - it has nothing to say about significance; 2) this approach implicitly encourages fishing for low-but-obtainable-by-chance p-values; and 3) it actually matters what statistical methods you use, both for validity as well as for maximizing the power of the method to find an effect.

If you're not getting p values of < .01 (or better yet .001), either your data/method is weak or you're fishing, or both.

Also, not getting a p value under .05? Doesn't mean there's no effect, even though some researchers are happy to say so. It could just mean you have a test with low power to find an effect.
posted by parudox at 11:42 PM on September 2, 2015 [5 favorites]


This isn't surprising to me. And really, it shouldn't be to anyone. Science progresses by replicating results, not by magic p values. P values are a helpful indication, but there's lot's of ways to get a significant result when you don't really have one. If this is an argument for replication, then it's absolutely important, but while I am unfamiliar with the full body of psychological literature, I am aware that key results at least have been replicated: if someone thinks a result is really interesting, they may well try to replicate it.

Psychology studies often have a lot of problems associated with them: their study population is usually not terribly random (bored and poor students being completely representative of the general population, right?), and they often use proxies to talk about things we're actually interested in (one thing I read about recently was the "strange situtation", which puts a child in a situation where strangers come into the room while they play, and you see how they react. This tells you precisely how a child reacts in this situation, but it is often interpreted to say other things (like say, a child being more or less attached to their parent in general)).
posted by Cannon Fodder at 12:01 AM on September 3, 2015 [2 favorites]


If this is an argument for replication, then it's absolutely important, but while I am unfamiliar with the full body of psychological literature, I am aware that key results at least have been replicated: if someone thinks a result is really interesting, they may well try to replicate it.

I suppose it depends on how you define "key results" but is it not the case that very few results in psychology are replicated (because very few replications are attempted)? I don't think that enough replication is taking place to say that key results have been replicated, unless you define key results very narrowly.

As far as someone finding a result very interesting and trying to replicate it, that may happen sometimes, but the point the authors are trying to make here is that because the incentives for researchers are much stronger to start a new study than to attempt a replication, very little replication is happening. It seems clear that there just aren't enough replication attempts. Consequently, you can't really trust much of published psychology. That's a grim situation.
posted by ssg at 1:00 AM on September 3, 2015 [2 favorites]


While there are indeed severe problems with statistical illiteracy in way too many psychological disciplines, this project is too poorly designed on an ontologically fundamental level to actually assess them. Its authors are still thinking of science an an enterprise concerned with generating TrueFacts to the exclusion of supposed facts that are not true. Not only is this fundamentally impossible in very important ways, but the pitfalls this orientation will lead you straight into will only leave you incredibly frustrated if you ever need to directly evaluate scientific evidence to accomplish a useful purpose. The whole point of science is not to say this is true and that is not, but to use data in clever ways to generate or improve theoretical models that usefully explain natural phenomena.

Almost by definition a theoretical model cannot be perfectly correct, it is a model of the truth, our best attempt to create a mirror image of it in a form that we can understand - and our understanding will never be perfect or without distortion. The map is not the territory, and ceci n'est pas une pipe. For example, there are a variety of ways to produce really awesome models of the 3D shape of biological macromolecules but none of these strategies can produce for you the real structure. NMR spectroscopy, X-ray crystallography, and electron microscopy each have their advantages and their disadvantages, but none of them will ever give you quite exactly the biological truth, even though they can each provide incredibly valuable answers to specific kinds of questions. This problem is universal to all of science and it really ends up creating a lot of non-intuitive weirdness in the philosophy and communication of science.

Being fundamentally unable to create perfect models for understanding things we have to be content with generating good ones. For a theory to be a good one, it must be validated by solid data from diverse sources and approaches, explain natural phenomena, and be useful for making verifiable predictions of what those phenomena will do. For example, the theory of evolution by natural selection is all of these things while the theory of Intelligent Design is only able to explain phenomena based on subjective reasoning that can neither be repeated nor viewed and appreciated as the same from other perspectives. That does not mean that creationism as an artistic representation of who we are through a metaphorical description of where we come from is stupid, bad, or even unreasonable. However it does mean that as an explanation of natural phenomena it is unverifiable, as well as more importantly, fundamentally not useful. This is the biggest reason why Intelligent Design has no place being taught next to or as a viable replacement for the extraordinarily useful theory evolution in a science classroom. To do so would be to fail to teach science, not just as the collection of facts your teachers tried to cram into you once upon a time but the as the practice of trying to understand the natural world in an intellectually honest way.

Rote reproduction is generally not necessary because when scientists are sufficiently clever, asking good questions that would lead them to develop good models, there should be better and more useful ways to attempt to falsify those models then just doing the same thing all over again. A good scientist doing most forms of basic research is always thinking about what they will do with the answers they get, such that either they or others will be able to both verify their shit and continue asking better questions of the new model proposed at the same time rather than just wasting cycles asking the same one over and over again, which is overwhelmingly likely to only produce trivial answers. This is part of being good stewards of the precious resources we get. We do have to be careful at each of these steps to check the assumptions of the models we use and interrogate, both with our own and other's data, and published scientific papers are intentionally designed to help us do this. When you get past the TrueFacts model for understanding science to a model based one, you can see how even a paper that happens to develop a model that is 'wrong' can still be incredibly useful if the research is designed well and the paper is written properly, and even a paper that makes conclusions that are good can be really harmful to understanding if designed badly.

While the way this project is designed and the way the media covers science both seem to treat scientific papers as if they were the units that truth comes in, that is absolutely not what they're for. Scientific papers are for the communication of findings, communicating the findings, the methods used, and the certainty of the results as well as possible to allow the scientific community to collaboratively build models.

This is all also said much better by Dr. Lisa Feldman Barrett, a professor of psychology at Northeastern University, in the Times:
Psychology Is Not In Crisis:
An initiative called the Reproducibility Project at the University of Virginia recently reran 100 psychology experiments and found that over 60 percent of them failed to replicate — that is, their findings did not hold up the second time around. The results, published last week in Science, have generated alarm (and in some cases, confirmed suspicions) that the field of psychology is in poor shape. But the failure to replicate is not a cause for alarm; in fact, it is a normal part of how science works.

Suppose you have two well-designed, carefully run studies, A and B, that investigate the same phenomenon. They perform what appear to be identical experiments, and yet they reach opposite conclusions. Study A produces the predicted phenomenon, whereas Study B does not. We have a failure to replicate. Does this mean that the phenomenon in question is necessarily illusory? Absolutely not. If the studies were well designed and executed, it is more likely that the phenomenon from Study A is true only under certain conditions. The scientist’s job now is to figure out what those conditions are, in order to form new and better hypotheses to test.

A number of years ago, for example, scientists conducted an experiment on fruit flies that appeared to identify the gene responsible for curly wings. The results looked solid in the tidy confines of the lab, but out in the messy reality of nature, where temperatures and humidity varied widely, the gene turned out not to reliably have this effect. In a simplistic sense, the experiment “failed to replicate.” But in a grander sense, as the evolutionary biologist Richard Lewontin has noted, “failures” like this helped teach biologists that a single gene produces different characteristics and behaviors, depending on the context. Similarly, when physicists discovered that subatomic particles didn’t obey Newton’s laws of motion, they didn’t cry out that Newton’s laws had “failed to replicate.” Instead, they realized that Newton’s laws were valid only in certain contexts, rather than being universal, and thus the science of quantum mechanics was born.

In psychology, we find many phenomena that fail to replicate if we change the context. One of the most famous is called “fear learning,” which has been used to explain anxiety disorders like post-traumatic stress. Scientists place a rat into a small box with an electrical grid on the floor. They play a loud tone and then, a moment later, give the rat an electrical shock. The shock causes the rat to freeze and its heart rate and blood pressure to rise. The scientists repeat this process many times, pairing the tone and the shock, with the same results. Eventually, they play the tone without the shock, and the rat responds in the same way, as if expecting the shock. Originally this “fear learning” was assumed to be a universal law, but then other scientists slightly varied the context and the rats stopped freezing. For example, if you restrain the rat during the tone (which shouldn’t matter if the rat is going to freeze anyway), its heart rate goes down instead of up. And if the cage design permits, the rat will run away rather than freeze.

These failures to replicate did not mean that the original experiments were worthless. Indeed, they led scientists to the crucial understanding that a freezing rat was actually responding to the uncertainty of threat, which happened to be engendered by particular combinations of tone, cage and shock. Psychologists are usually well attuned to the importance of context. In our experiments, we take great pains to avoid any irregularities or distractions that might affect the results. But when it comes to replication, psychologists and their critics often seem to forget the powerful and subtle effects of context. They ask simply, “Did the experiment work or not?” rather than considering a failure to replicate as a valuable scientific clue. As with any scientific field, psychology has some published studies that were conducted sloppily, and a few bad eggs who have falsified their data. But contrary to the implication of the Reproducibility Project, there is no replication crisis in psychology. The “crisis” may simply be the result of a misunderstanding of what science is.

Science is not a body of facts that emerge, like an orderly string of light bulbs, to illuminate a linear path to universal truth. Rather, science (to paraphrase Henry Gee, an editor at Nature) is a method to quantify doubt about a hypothesis, and to find the contexts in which a phenomenon is likely. Failure to replicate is not a bug; it is a feature. It is what leads us along the path — the wonderfully twisty path — of scientific discovery.
posted by Blasdelb at 2:43 AM on September 3, 2015 [10 favorites]


xkcd had this figured out.

A recent article from FiveThirtyEight is worth considering too.
posted by Pararrayos at 2:47 AM on September 3, 2015


n psychology, we find many phenomena that fail to replicate if we change the context.

while I sort of agree, I think this is a little misleading. Studies can fail to replicate because the effect wasn't there in the first place, not because the context was changed, and it's worth considering that possibility. The broad philosophical point is something I agree with though.

I think the danger comes (always comes) if people take a single study and start basing what they are doing on it. That was always bad science and that should be obvious, and maybe efforts like this remind people of that fact.
posted by Cannon Fodder at 3:00 AM on September 3, 2015 [1 favorite]


Has anyone tried to reproduce the results of the Reproducibility Project's study?
posted by eustacescrubb at 3:22 AM on September 3, 2015 [3 favorites]


If you want to create useful theoretical models, then you really need reproducible results. That's fundamental. If your results aren't reproducible, then in what way is your theoretical model useful? Sure, it might be an interesting intellectual exercise or it might lead to further study that is itself useful, but if a model isn't backed up by reproducible results, why should anyone have any reason to trust that model?

wasting cycles asking the same one over and over again, which is overwhelmingly likely to only produce trivial answers. This is part of being good stewards of the precious resources we get.

Well, a whole bunch of scientists just did exactly what you propose as a waste of cycles that is overwhelmingly likely to produce trivial answers and it turns out that it was not a waste and the answers they got were not trivial at all. That's science. That's being good stewards of their resources.
posted by ssg at 3:58 AM on September 3, 2015 [6 favorites]


I hope this is not a total derail but it seems like we're, how to say.., "running out of science". When Darwin, Freud and Newton and the other familiar giants made their discoveries there were not tens of thousands of PhD post docs running relief. Are there no clear cut theories available? Will there be just ambiguous subtleties and refinements of special cases? Will the next great discovery just be that chocolate is not good for health (to be refuted the next cycle)?
posted by sammyo at 4:20 AM on September 3, 2015


"I hope this is not a total derail but it seems like we're, how to say.., "running out of science". When Darwin, Freud and Newton and the other familiar giants made their discoveries there were not tens of thousands of PhD post docs running relief. Are there no clear cut theories available? Will there be just ambiguous subtleties and refinements of special cases? Will the next great discovery just be that chocolate is not good for health (to be refuted the next cycle)?"
We're doing nothing resembling running out of science, we're doing the exact opposite. If anything the questions we can now ask, the answers we can now expect to get, and the feats we can accomplish with them are only getting exponentially more powerful - just progressively more divorced from ordinary experience. Since Freud, we've been able to look back in time and watch the very first moments of creation, add letters to the language of life, forge elements the universe may have never seen before, mix and match the the useful features of life's building blocks into revolutionary medicines and crops, create a permanent presence in the heavens rapidly circling around us to serenade us with Bowie's songs, build weapons out of the power of the sun so existentially horrifying and powerful that they have forced us to choose between an end to the escalation of war and the total eternal annihilation of everything we are and love, build a box colder than anything the universe may have ever seen before, cure bacterial disease by allying with the yet more powerful forces that cause disease in bacteria, defeat viruses by ripping out their weak spots and displaying that part of their artfully mutilated corpses to the immune systems of our children so that they can be educated in advance, carry around access to a huge portion of everything that has ever been written down in our pockets, send missions to every planet in our solar system and plenty more celestial bodies besides, exterminate smallpox, and get a clearer and perhaps more terrifying view of our place in cosmic, geological, and evolutionary time.
posted by Blasdelb at 5:11 AM on September 3, 2015 [14 favorites]


Part of it is a relative ignorance of statistics - it seems as if a lot of researchers are taught that as long as you plug your numbers into the ANOVA, and p < .05, then the result is "significant".

My Applied Linguistics workshop leader flat out told us that after we run the studies we should hand them over to the numbers-crunchers to do the statistics for us. He then proceeded to give us a cursory overview (for the length of just one class period) of the basic statistical tests and how they worked, "so that we wouldn't be caught out if someone asks a question during your dissertation defense," but then acknowledged that our committee would most likely be full of people who likewise didn't understand the statistical methods, so it was very unlikely that we would be called on it.

The mind boggles. And yet, although I'm not proud of it, I know that 1) math is my Achilles' heel, and 2) I do not have time to both research and bone up on my statistics. We aren't even required to take a stats course to progress and I sure won't do it on my own (nonexistent) time.

Highly considering side-stepping into Literature/Culture because of it, to be honest.
posted by chainsofreedom at 5:40 AM on September 3, 2015 [1 favorite]


My girlfriend was one of the people doing a replication study! It's been interesting to see this go from her to do list to publication to become national news.
posted by lownote at 6:08 AM on September 3, 2015 [1 favorite]


He then proceeded to give us a cursory overview (for the length of just one class period) of the basic statistical tests and how they worked

The important thing is perhaps not how they work, but what question they are able to answer. I think it's really important to know what a confidence interval is for example, and I'd bet that a lot of people publishing one don't wouldn't give you a correct answer if you asked them.
posted by edd at 6:13 AM on September 3, 2015


"If you want to create useful theoretical models, then you really need reproducible results. That's fundamental. If your results aren't reproducible, then in what way is your theoretical model useful? Sure, it might be an interesting intellectual exercise or it might lead to further study that is itself useful, but if a model isn't backed up by reproducible results, why should anyone have any reason to trust that model?"
Failures to replicate results are not necessarily, or even generally, an indication that anything has gone wrong with the scientific process but are for the most part an indication that something has gone intriguingly right. There are certainly trivial reasons why results might not be repeatable, like the misuse of proper statistical hypothesis testing or other kinds of errors, but this project is incapable of separating them from more meaningful and interesting kinds of failures in aggregate.
Well, a whole bunch of scientists just did exactly what you propose as a waste of cycles that is overwhelmingly likely to produce trivial answers and it turns out that it was not a waste and the answers they got were not trivial at all. That's science. That's being good stewards of their resources.
While I certainly hope that the individual authors picked studies that would be individually meaningful in different ways to replicate in the contexts they could create, as an aggregate effort the coordinators can't really do very much meaningful to look at the scope of the 'problem' due to the unavoidable selection bias they disclaim. Similarly, the diverse contexts that the data were gathered in mean that they couldn't really look very deeply into the nature of the 'problem,' even if the coordinators weren't so blinkered into seeing it as inherently problematic. The linked paper is itself an interesting microcosm of the power and limitations of what professional statisticians can bring to scientific discourse, the analyses are simultaneously exactly appropriate to the questions being asked, elegantly displayed, and almost entirely irrelevant to the problems the analyses are meant to address.
posted by Blasdelb at 6:25 AM on September 3, 2015


No one is saying that papers which fail to replicate are useless, and it doesn't mean that the original study was poorly designed. However, it may mean that the study was insufficiently documented. I mean, trivially, if I were to perform exactly the same experiments in exactly the same place and time as you, followed by the same post-processing and analysis steps, I would obtain the same result. A scientific publication is an attempt to comprehensively document a research activity; the interpretation of the result is separate from that.

Re: Newton's laws of motion: If you publish your findings on dropping an object from a tall building and I fail to replicate, it could mean that Newton's laws are wrong -- or, it could mean that you failed to mention in the paper that you dropped a bowling ball, and I tried a feather. Better documentation would have saved us both some time. (Why the difference exists is, as mentioned above, a perfect avenue for future work.)

Re: Fear learning: Regardless of the existence of the cage effect, if I set the experiment up the same way you did, I should get the same result.

Models are constructed from observations of nature. We need to at least agree on what we've observed (not necessarily the interpretation), or else a whole bunch of us are wasting time and money. This is why I'm a supporter of more automated data/metadata capturing tools in science.

(I'm speaking as a physical scientist here. I recognize there are huge practical challenges with getting a reproducible sample of a human population versus, e.g., a pure sample of physical matter, and that agreeing on what the salient features of a human population are in a particular case is precisely the issue in the social sciences.)
posted by anifinder at 9:25 AM on September 3, 2015 [2 favorites]


Failures to replicate results are not necessarily, or even generally, an indication that anything has gone wrong with the scientific process but are for the most part an indication that something has gone intriguingly right.

I don't think you share the same understanding of the scientific process as the scientific community in general. We have developed an extraordinarily effective scientific method, which, while clearly imperfect, has enabled all kinds of wonderful developments and improved human lives in countless ways (and enabled all kinds of horrible things too). If people don't want to work within that general paradigm, then they really aren't doing science as it is commonly understood.
posted by ssg at 10:59 AM on September 3, 2015


Pararrayos: this is the same xkcd currently collecting data apparently designed to cultivate bullshit correlations?
posted by edd at 4:02 PM on September 3, 2015


this is the same xkcd currently collecting data apparently designed to cultivate bullshit correlations?

Indeed. I filled out that survey and can't wait to see what he's going to do with that.
posted by Pararrayos at 5:55 AM on September 4, 2015


« Older Stepping out of the clown shoes   |   Animal sacrifice is, however, not on the syllabus Newer »


This thread has been archived and is closed to new comments