Midlife Replication Crisis
October 6, 2016 12:11 PM   Subscribe

 
Somewhat related, someone wrote an automated tool to check for basic rounding errors in APA-formatted psychology papers. It turns out there are a lot, sometimes to the extent of altering the conclusions of the paper.
posted by figurant at 12:25 PM on October 6, 2016 [9 favorites]


"HARKING" (Hypothesis After Results Known)

It's easy to make bullseyes if you shoot an arrow into the barn, and then paint one around it....
posted by thelonius at 1:15 PM on October 6, 2016 [1 favorite]


Great link, thanks!
posted by languagehat at 1:43 PM on October 6, 2016


One thing that would be nice to see is negative results (i.e. hypothesis proves false) regarded as positive contributions to scientific knowledge, because they are. Learning that x does not do y is just as valuable as learning that it does. Ruling stuff out is important.

The surest way to find a needle in a haystack is to sort through it all bit by bit, removing hay until the needle is found. As things are today, scientists just keep adding new hay to the stack, saying, "This might be the needle!" That is not a contribution. That is an obstruction.
posted by Sys Rq at 1:51 PM on October 6, 2016 [13 favorites]


I think the Nib is doing some of the best comics reporting right now. I give a "meh" to the presidential race stuff, but things like this are phenomenal.
posted by jillithd at 2:05 PM on October 6, 2016


I don't understand small study bias. A smaller n means less likely to detect any effect.
posted by MisantropicPainforest at 2:12 PM on October 6, 2016


MisantropicPainforest: Green jelly beans.
posted by sukeban at 2:28 PM on October 6, 2016


About 7/10 of the way through the comic, I found myself scrolling to the end, and I thought to myself how much of science has been reduced to sound bytes and hot takes by the TL;DR crowd. I then dutifully scrolled back up and read the comic again.

I wonder to what extent this affects even the harder scientific crowd.
posted by Mooski at 2:30 PM on October 6, 2016


Green jelly beans only relates to multiple comparisons and has nothing to do with sample size.
posted by MisantropicPainforest at 2:32 PM on October 6, 2016


MisantropicPainforest, you can think of it like this: given that I did detect a statistically significant effect with a small n (ie, that I have a publishable result) I am more likely to have overestimated the true effect than I would be with a larger study, precisely because my power with low n is small; I needed a bigger effect to get a publishable finding in the first place! You're right that smaller sample size makes it less likely to detect an effect, but the key here is that I'm conditioning on published findings -- "detected effects" that did get a significant p. Amongst the false discoveries that will have gotten into print, the ones with the smaller n will be "farther off" than the ones with larger n because they needed to have a bigger effect simply to get into print.

This becomes a problem when you then want to do a meta-analysis, because the combination of small n and the filter of "publishable finding" means you have some highly over-estimated effects that are type-I errors. (cf: http://www.bmj.com/content/341/bmj.c3515)
posted by Westringia F. at 2:33 PM on October 6, 2016 [5 favorites]


Ok I think I got it---so the small sample bias isn't w/r/t whether something is significant or not but rather w/r/t the size of the effect?
posted by MisantropicPainforest at 2:36 PM on October 6, 2016


if you have a small sample size, it's easier to retry until you get an effect you like.

The green jelly beans comic simply would not work if each panel was using 1,000,000 jelly beans. The law of large numbers would smooth out all those spurious correlations.
posted by So You're Saying These Are Pants? at 2:44 PM on October 6, 2016 [1 favorite]


The comic says "this is known as reproducibility or replication". However, from what I've read, "replication" consists of independent researchers doing the experiment again and gathering new data that confirms the original factual findings, where "reproducibility" is about whether the authors publish their data and methods, giving others the ability to crunch the original data and confirm the original analysis.

The article where I learnt this is A Simple Explanation for the Replication Crisis in Science, which is, to my layman's understanding, what it says on the tin, with extra points for not looking down on the "softer" sciences, but acknowledging their difficulty.

The summary is: some sciences have a strong basic theory (particle physics, astronomy) and some sciences don't (epidemiology, clinical studies). On another axis, some studies have a tradition of controlled experiments (physics, clinical studies) and others don't (epidemiology, astronomy). And here are some quotes:
The replication crisis in science is concentrated in areas where (1) there is a tradition of controlled experimentation and (2) there is relatively little basic theory underpinning the field.

Further, in general, I don’t believe that there’s anything wrong with the people tirelessly working in the upper right box. At least, I don’t think there’s anything more wrong with them compared to the good people working in the other three boxes.

(...)

[T]he replication crisis in science is largely attributable to a mismatch in our expectations of how often findings should replicate and how difficult it is to actually discover true findings in certain fields.
Really, read it all. It's a great article.
posted by kandinski at 2:56 PM on October 6, 2016 [2 favorites]


The green jelly beans comic simply would not work if each panel was using 1,000,000 jelly beans.

No, even with a million billion trillion jelly beans, if there were no effect you would still expect 5\% of studies to show an effect that's significant at the .05 level. That's what "significant at the .05 level" is defined as. What would happen is that you'd be able to hit that .05 level with a much smaller effect size than if you had 100 jelly beans.
posted by ROU_Xenophobe at 3:00 PM on October 6, 2016 [6 favorites]


MP, yeah. So if I do my study with n=3, to get a significant p I'd need to have a pretty large effect... and if (by pure chance) I get it, now you're stuck trying to replicate my "really gigantic" effect.

It's sorta unfair, IMO, to call it a "small-study bias," because in a sense it's really just the publication bias all over again. The trouble is that the smaller the study, the more severe the impact that publication bias has on what's published.

Also, this:

> if you have a small sample size, it's easier to retry until you get an effect you like.

is totally true, and contributes to the problem of p-hacking!

That said, this:

> The green jelly beans comic simply would not work if each panel was using 1,000,000 jelly beans.

isn't right. Remember, we've set our type-I error rate at 0.05 a priori. Under the null -- no jelly bean associations -- we're going to get 5% false rejections by chance, regardless of whether n=10 or n=10^6. In general, if we do 20 studies, one of 'em will erroneously come up significant with p<0.05, no matter how large the sample is for those studies.

On preview: what ROU_Xenophobe said!
posted by Westringia F. at 3:07 PM on October 6, 2016 [3 favorites]


While p hacking and harking are undeniable widespread, I truly believe that the deeper reason for the crisis is that the findings researchers are reporting are, very often, findings that you wouldn't expect to be replicable given the complex dynamics of human behavior.

This isn't immediately evident because of how most findings are reported in terms of operationalized variables, rather than in terms of actual behavioral events (something which also makes the results more provocative), and how actually determining what the latter are is time consuming.

The research on ego depletion is a perfect example. Ego depletion is an abstraction that could describe an enormous variety of different trajectories of human activity. Even if researchers attempt to replicate multiple people doing the same activity over and over, what's to say that, psychologically speaking the activity is "the same"?, and that motivation should therefore change in systematic and similar ways from case to case?

Ultimately it boils down to what's been repeatedly leveled on psychology, but failed to really change the mainstream: human behavior occurs in specific contexts, whose meanings are often substantially conventional (and therefore open to wide variation).
posted by patrickdbyers at 3:22 PM on October 6, 2016


Also, I'll add that with the small-study thing, it's not just type-I errors that are affected; it's also correct rejections of H0. Suppose there's a real effect; the variability from sampling will sometimes overshoot the estimate of that effect, and sometimes undershoot. How far we overshoot or undershoot -- that is, the width of the sampling distribution of whatever we're estimating (aka, the standard error) -- depends on our sample size. Smaller n means a wider sampling distribution; we're more likely to WAY overshoot or WAY undershoot than we are with larger n. But since it's only the overshoots that are "significant" & publishable, what appears in the literature will tend to overestimate real effects.
posted by Westringia F. at 3:25 PM on October 6, 2016


One thing that would be nice to see is negative results (i.e. hypothesis proves false) regarded as positive contributions to scientific knowledge, because they are. Learning that x does not do y is just as valuable as learning that it does. Ruling stuff out is important.

I agree that this effects would be useful to publish, but it is important to remember that an unsupported hypothesis does not mean (statistically, anyway) that the relationship does not exist. It means that the study does not provide evidence that the relationship existed. Studies start off assuming no relationship and look for evidence that this is false. Failing to find evidence for a relationship is different than finding no relationship. This is a big distinction and is often forgotten.

This becomes a problem when you then want to do a meta-analysis, because the combination of small n and the filter of "publishable finding" means you have some highly over-estimated effects that are type-I errors.


That's true, but a good meta-analysis should include a test for publication bias. It's actually pretty cool- when you gather together a body of literature, studies should cluster around an effect size, which we'll call X. Good studies will cluster around it nice and close (because they can estimate X with a lot of precision) but shitty studies should have a wider spread (some much bigger than X, some much smaller, some close). Publication bias tends to happen when people do shitty (meaning imprecise) studies and then only publish when they find a relationship. So when you're looking at how the effect sizes cluster around X, and the shitty studies in your graph are all the same size or bigger than the number that the great studies cluster around, you can say with some confidence that there's publication bias in your field. Getting a sense of how many shitty studies found an effect greater than or equal to X also allows you to get a sense of how many studies would have been smaller than X. You can then plug those hypothetical smaller-than-X numbers in and re-estimate X with a good degree of precision.
posted by quiet coyote at 3:33 PM on October 6, 2016 [1 favorite]


In general I don't take any "science" seriously until it has spawned an engineering field (even if it's not called "engineering" as such). That means it is reliable enough so that its predictions can be used to create useful outcomes. Every time an engineer (or someone else) uses that science to design a useful product (or application), they are proving that the particular science is good.

Not necessarily perfect. Artillerists used Galilean mechanics to calculate their shots, and Galilean mechanics was close enough to being right so that the artillerists hit what they were aiming at. The Theory of Relativity rewrote Galilean mechanics, but the difference between the two wasn't significant at the scale of artillery.

Every time you turn on your computer, you are testing quantum mechanics. Because the computer turns on and runs (unless it's broken) you are a confirmation case of quantum mechanics.

Psychology should be able to be used this way, but the experience is that it mostly can't be. Such predictions as you can derive from it too often are either outright wrong or they are trivial and uninteresting. And the main reason is that researchers in this field are not very rigorous.

It's a field-wide cultural problem and the only way it can be solved from outside is to stop rewarding bogus research with grants and publishing. That's possible, but it probably won't happen. But grant-making institutions could impose such discipline and make it stick. DARPA already does, pretty much. NSF could do it, but the political cost of doing so is probably too great.
posted by Chocolate Pickle at 3:36 PM on October 6, 2016


Thanks for the link! The cartoonist's other work is also worth a gander.
posted by storybored at 3:47 PM on October 6, 2016


In general I don't take any "science" seriously until it has spawned an engineering field (even if it's not called "engineering" as such). That means it is reliable enough so that its predictions can be used to create useful outcomes.

Psychology is not a monolith. I think that intervention researchers do pretty well generating reliable/useful outcomes. Randomized controlled trials have a whole 'nother set of critiques, but people take basic animal research and use it to develop treatments that seem likely to work to treat human disorders, rigorously test them in controlled conditions, and voila, we have reliable treatments for lots of conditions and my patients' disorders are knocked out in a few months. And I get to replicate in the clinic what lots of researchers have replicated in the lab!

Opposite problem: A collaborator of mine did a treatment study recently for PTSD and substance use disorders where his entire sample was 100% abstinent at follow-up. He was hoping to publish a paper on the characteristics of the treatment delivery or participants that predicted whether people got better or not, but he had no variability to predict, so he couldn't publish the paper!
posted by quiet coyote at 3:51 PM on October 6, 2016 [1 favorite]


In general I don't take any "science" seriously until it has spawned an engineering field (even if it's not called "engineering" as such). That means it is reliable enough so that its predictions can be used to create useful outcomes. Every time an engineer (or someone else) uses that science to design a useful product (or application), they are proving that the particular science is good.

Teaching.

Therapy.

There are plenty of practical applications of psychology.
posted by GuyZero at 4:02 PM on October 6, 2016


All right, let me preface the following remark by stating that I have a PhD in a scientific field. I understand science, I understand logic and rationality and the need for replicability and all. I get it.

That said: The people I have met who are most attached to the whole psychology-is-bullshit-replicability-crisis stuff are the ones least capable of actually making a reasonable critique of it. I was once told by such an adherent (who himself had a math PhD) that "cognitive behavioral therapy has been shown to have better outcomes than traditional talk therapy, and he didn't believe in sitting around talking about your parents and all that stuff." Well, sure, and band-aids also are related to better outcomes than surgery; so maybe when you have a bowel impaction or a tumor you can insist that your physician treat you with band-aids.

There's also an element in those circles of really having a complete disregard for putting statistics in context, as if the statistics somehow overrule every other consideration of an issue. Someone once quoted me chapter and verse studies about how sunscreen hasn't been definitively shown to have an impact on melanoma rates, but did clearly have an impact on Vitamin D production. Well, duh, yes--but part of the recommendations to use sunscreen relate to the fact that one can take a pill to supplement vitamin D intake but there is no pill to prevent the development of one of the most deadly cancers known to humankind.

I'm just irritated by the whole cleverer than thou crowd that gloms onto this stuff.
posted by Sublimity at 4:22 PM on October 6, 2016 [4 favorites]


And the main reason is that researchers in this field are not very rigorous.

I would say that the main reason is that artillery, or the quantum mechanics in pcs, or the physics of nuclear fusion, are absolute tinkertoys compared to human mentalities. Much less what happens when human mentalities interact with each other, like the other social sciences look at.
posted by ROU_Xenophobe at 4:23 PM on October 6, 2016 [5 favorites]


I was once told by such an adherent (who himself had a math PhD) that "cognitive behavioral therapy has been shown to have better outcomes than traditional talk therapy, and he didn't believe in sitting around talking about your parents and all that stuff." Well, sure, and band-aids also are related to better outcomes than surgery; so maybe when you have a bowel impaction or a tumor you can insist that your physician treat you with band-aids.

What? Speaking as a psychology PhD, the idea that CBT is superior to talk therapy for mental disorders is pretty widely agreed upon and has been backed up by multiple meta-analyses. Talk therapy is often used as the placebo against which other treatments are compared, and treatments (including CBT variants) don't get to move forward unless they're superior. I'm not sure why CBT is the band-aid here or what this has to do with replicability.
posted by quiet coyote at 5:21 PM on October 6, 2016


My favorite comment on psychology studies was from an industrial-organizational psychologist who said that if a psych researcher mentioned 'longitudinal studies' it meant they had exactly two data points in time from which they attempted to extrapolate a trend. The engineers in the room (CS, CEG, and EE just wept a little bit). I do not think I impressed a group of them when, during a discussion on the 'incredible new idea called statistical power', I handed them my twenty-year-old statistics textbook and pointed out the paragraph right after p-value that pointed out it's weaknesses, and that statistical power was more important.

To be fair, there are many areas of psychology that are very rigorous mathematically and I would generally trust research results from those areas, but then there are other areas that are much 'softer' which, whenever I hear a new breakthrough idea, I assume that it's a cool anecdote.

(I admit to horrible bias re: psychology as an engineer, so... take my comments with a grain of salt of some, uh, indeterminate size)
posted by combinatorial explosion at 5:25 PM on October 6, 2016


quiet coyote, CBT is the band-aid here because it presupposes that the (new, naive) patient knows what the problem is--that someone who's struggling with psychological or mental disorders can self-diagnose and get themselves right into ship shape by just lining up the right cognitive-behavioral buttons to push.
posted by Sublimity at 5:36 PM on October 6, 2016


Speaking as a psychology PhD, the idea that CBT is superior to talk therapy for mental disorders is pretty widely agreed upon and has been backed up by multiple meta-analyses.

A lot depends on particular type of "mental disorder", and the time scales used for outcomes. Also, how you operationalise the outcome for "superiority".

The Efficacy of Psychodynamic Psychotherapy

Small powering of all these studies is a problem, as is the presence of a nocebo framing when doing any head to head comparison manualised studies vs non-manualised in real, living people.
posted by meehawl at 9:47 PM on October 6, 2016 [1 favorite]


For all you engineers out there looking for psychology in engineering applications, it so widespread that you can't really see it.

Psychology informs every discipline, everything from the gauges on your dashboard to the design of the traffic controls. It is a major backbone of education, developmental psychology as well as how to teach and how to learn. Right in front of you, you can see the design of the computer, the browser and the OS are at least partially designed around the the psychology of the human using them. It is perhaps a little too useful when applied by modern PR firms and media outlets.

We are literally swimming in it.
posted by psycho-alchemy at 12:28 AM on October 7, 2016 [3 favorites]


This was interesting. Thanks for posting it.
posted by Tell Me No Lies at 1:19 AM on October 7, 2016


Those deep learning networks that play Go and Atari 2600 games: brought to you by psychology. Rumelhart and McClelland were psychologists whose work sparked the connectionist revolution (pdf) in the '80s, which in turn derived from Frank Rosenblatt's (another psychologist) Perceptron in the '50s. Parts of those networks run on reinforcement learning algorithms, which you can trace back to Skinner, Pavlov, and Thorndike.

That "engineering" enough for you?
posted by logicpunk at 2:15 AM on October 7, 2016 [1 favorite]


Don't forget marketing! The chemical weapons of psych research.
posted by Potomac Avenue at 5:33 AM on October 7, 2016 [1 favorite]


Andrew Gelman: What has happened down here is the winds have changed.
The title and headings of this post allude to the fact that the replication crisis has redrawn the topography of science, especially in social psychology, and I can see that to people such as Fiske who’d adapted to the earlier lay of the land, these changes can feel catastrophic.
posted by rollick at 5:57 AM on October 7, 2016 [1 favorite]


For all you engineers out there looking for psychology in engineering applications, it so widespread that you can't really see it. ... Psychology informs every discipline, everything from the gauges on your dashboard to the design of the traffic controls.

The challenge is, that while the behaviors psychology research attempts to quantify are important and apply to all disciplines, the actual published research is misinforming practitioners. For example, I sat in a search advocate seminar today in which we discussed, among other things, how things as simple as blood glucose levels affect potentially life changing parole decisions. This work is by a pair of biz school researchers who largely lean on Baumeister's work on ego depletion, which this post is in part about. Without a solid biological explanation, it's much easier to consider the substantial number of uncontrolled variables possibly unrelated to cognitive bias. Maybe the clerk was really good at sorting the docket. Maybe the lawyers knew when their clients had a great case for parole, or maybe with a great parole granted case hire lawyers who show up early. Or maybe the socioeconomic status that comes with having a lawyer who shows up early is entangled with the parole granting. Or hell, maybe the whole thing was bunk. But hey, please grab another cookie, because we need to engage system 2 thinking this afternoon.

This stuff is important, which is why it's important even to engineers that the researchers get this stuff right. And why a Statistics for Psychologists course should have the same reputation as Statistics for Engineers.
posted by pwnguin at 10:37 PM on October 25, 2016 [1 favorite]


« Older Breaking the "Code of Silence"   |   When her best friend died, she rebuilt him using... Newer »


This thread has been archived and is closed to new comments