Many Labs Replication Project
November 27, 2013 6:48 AM Subscribe

Nature reports that a large international group set up to test the reliability of psychology experiments has successfully reproduced the results of 10 out of 13 past experiments. The consortium also found that two effects could not be reproduced.

Successfully replicated:

Four effects from Measures of anchoring in estimation tasks (Jacowitz & Kahneman, 1995)
Experiments in wording questions: II (Rugg, 1941)
Two effects from Math = Male, Me = Female, therefore Math ≠ Me (Nosek, Banaji, & Greenwald, 2002)
The retrospective gambler’s fallacy: Unlikely events, constructing the past, and multiple universes (Oppenheimer & Monin, 2009)
The framing of decisions and the psychology of choice (Tversky & Kahneman, 1981)
Response scales: Effects of category range on reported behavior and comparative judgments (Schwarz et al., 1985)
Prestige, suggestion, and attitudes (Lorge & Curtis, 1936)
The current status of American public opinion (Hyman & Sheatsley, 1950)
Instructional manipulation checks: Detecting satisficing to increase statistical power (Oppenheimer et al., 2009)

Weak replication:

Elaboration enhances the imagined contact effect (Husnu & Crisp, 2010, Study 1)

Not replicated:

A single exposure to the American flag shifts support toward Republicanism up to 8 months later (Carter et al., 2011; Study 2)
Mere exposure to money increases endorsement of free-market systems and social inequality (Caruso et al., 2012)

posted by a snickering nuthatch (22 comments total) 34 users marked this as a favorite

Awesome. The Open Science Framework people are doing good work, checking to see if the results of prominent experiments can be repeated.

One thing, though: I had assumed that this work was already being done.

I guess it turns out not? (Or at least not in every case)
posted by Jacob Knitig at 6:57 AM on November 27, 2013 [1 favorite]

In Experiment 1, which was conducted online during the 2008 U.S. presidential election, a single exposure to an American flag resulted in a significant increase in participants’ Republican voting intentions, voting behavior, political beliefs, and implicit and explicit attitudes, with some effects lasting 8 months after the exposure to the prime.

This sounds like an episode of Look Around You where the next scene after the experiment being explained would be a naked undergrad being towel-whipped with an American flag by a scientist in a cheap rubber Obama mask.
posted by griphus at 7:06 AM on November 27, 2013 [21 favorites]

I don't think it's that the work is not being done - rather, if the previous findings are replicated, then the journal says "This is nothing new, no paper." And if the previous findings are not replicated, the journal says "Your methodology was different, something must have been off, no paper."
posted by rebent at 7:10 AM on November 27, 2013 [2 favorites]

I'm happy to hear this. I recall my psych 101 prof lamenting this problem 20 years ago about how a psych major was questioning something and was told "we've already ran that experiment and these are the results" -- as in, don't replicate or reproduce, that problem is solved once and done/closed.

That said, I couldn't find out if these folks are still using undergrads as the sample pool for their experiments.. I still think that's a horrible flaw.
posted by k5.user at 7:12 AM on November 27, 2013 [4 favorites]

I'd rather know if any of these meet the threshold of actual significance. The social "sciences" discount is not talked about nearly enough. 95% is merely interesting, not proof of anything.
posted by SkinnerSan at 7:15 AM on November 27, 2013

well, "proof"... what is required for that? 99.9% confidence still means that 1 experiment in a thousand makes an error in judgement.
posted by rebent at 7:32 AM on November 27, 2013

a naked undergrad being towel-whipped with an American flag by a scientist in a cheap rubber Obama mask.

How did you get into my head?
posted by CynicalKnight at 7:50 AM on November 27, 2013 [1 favorite]

95% is merely interesting, not proof of anything.

Statistical significance is not the same as actual significance.

Andrew Gelman has a nice related post today:

So, just to say this again, I think that researchers of all sorts (including statisticians, when we consider our own teaching methods) rely on two pre-scientific or pre-statistical ideas:

1. The idea that effects are “real” (and, implicitly, in the expected direction) or “not real.” By believing this (or acting as if you believe it), you are denying the existence of variation. And, of course, if there really were no variation, it would be no big deal to discard data that don’t fit your hypothesis.

2. The idea that a statistical analysis determines whether an effect is real or not. By believing this (or acting as if you believe it), you are denying the existence of uncertainty. And this will lead you to brush aside criticisms and think of issues such as selection bias as technicalities rather than serious concerns.

posted by leopard at 8:18 AM on November 27, 2013 [12 favorites]

Replication is a huge problem for all the sciences. What frequently happens, at least in psychology, are "pseudoreplications" - experiments designed to test a slightly different flavor of the same hypothesis. Take, for example, the finding that seeing an American flag affects one's tendency to disclose the intent to vote for a Republican candidate. That experiment is a "pseudoreplication and extension" of similar previous work on the priming phenomenon that suggests, for example, that Subliminal exposure to national flags affects political thought and behavior. Notice that the idea and logic are the same, but it is not a direct replication of the methods and measurement used.

There are multiple reasons (some good, some less good) to perform pseudoreplications and extensions instead of direct replications. 1: It's of greater theoretical interest to see that an effect does not depend on a specific set of stimuli, methods, lab environment, and so forth. 2: The publish-or-perish atmosphere tends to reward novelty rather than replication. 3: If an effect is directly replicated, we have not learned anything "new." But with pseuoreplications, we can (it feels like, anyway) sort of cross-validate previous work and learn something new. Pseudoreplications are also a good way to reveal issues with original experiments. This piece by Ed Yong draws the history from an original effect - that priming people with words related to old age causes them to walk slower - was actually due to researcher bias rather than people's actual walking speed.

But clearly, replication has value. Without actually testing for direct replications, we can never truly know whether an effect was particular to that lab or group of participants. This is a very recent push to encourage direct replications, and shows that science can indeed be self-policing.

A last note on undergraduate participants as subjects: yes, this is very much still the norm and can be problematic if one wishes to generalize findings beyond WEIRD (Metafilter FPP) populations.

Thanks for putting this FPP together, Jpfed!
posted by nicodine at 8:24 AM on November 27, 2013 [5 favorites]

This sounds like an episode of Look Around You where the next scene after the experiment being explained would be a naked undergrad being towel-whipped with an American flag by a scientist in a cheap rubber Obama mask.

Note the results down in your copy book.
posted by The 10th Regiment of Foot at 8:53 AM on November 27, 2013

a naked undergrad being towel-whipped with an American flag by a scientist in a cheap rubber Obama mask.

This would never get through a risk assessment or ethical review. The mask might suffocate the investigator.
posted by cromagnon at 8:55 AM on November 27, 2013 [7 favorites]

As a huge fan of the lines of psychological research pioneered by Kahneman and Tversky, seeing them on the confirmed-list makes me happy.
posted by LSK at 10:17 AM on November 27, 2013

The social "sciences" discount is not talked about nearly enough. 95% is merely interesting, not proof of anything.

Can we seriously not do this? Please don't use scare quotes. And what social scientist is treating 95% significance as proof?
posted by MisantropicPainforest at 10:29 AM on November 27, 2013 [1 favorite]

What's really funny is, in the era of gigantic social datasets vs. untestable theoretical physics, who exactly has the claim to being "hard science" is kind of reversing.
posted by effugas at 10:46 AM on November 27, 2013

Are they replicating the findings among American undergrads?
posted by docgonzo at 10:48 AM on November 27, 2013 [1 favorite]

CynicalKnight: "a naked undergrad being towel-whipped with an American flag by a scientist in a cheap rubber Obama mask.

How did you get into my head?"

He's from the NSA. He just dug threw your computer search records.

(Naughty, naughty CynicalKnight! No more porn for you today!)
posted by IAmBroom at 11:13 AM on November 27, 2013

SkinnerSan: "I'd rather know if any of these meet the threshold of actual significance. The social "sciences" discount is not talked about nearly enough. 95% is merely interesting, not proof of anything."

Citation needed.
posted by IAmBroom at 11:16 AM on November 27, 2013

95% is merely interesting, not proof of anything.

After sampling myself, I'm p = 0.049999 likely to accept that hypothesis. I'm not sure what your point is.
posted by jaduncan at 12:18 PM on November 27, 2013 [1 favorite]

Not surprisingly, the 3 poorly reproducible studies were the most recent ones.
posted by euphorb at 12:53 PM on November 27, 2013

Quite a solid vindication for Kahneman after the doubt and controversy thrown up around some of those priming studies recently.
posted by smoke at 3:54 PM on November 27, 2013

cheap rubber Obama mask

Try as I might, I can't picture that. My brain refuses to link "cheap rubber mask" with anything but "Nixon".

Why is that, Doctor?
posted by flabdablet at 1:11 AM on November 28, 2013

well, "proof"... what is required for that? 99.9% confidence still means that 1 experiment in a thousand makes an error in judgement.

Just a quick note about this, so please forgive any oversimplifications. I'm guessing you're talking about p-values here. I'm going to talk about this stuff from a frequentist perspective (even though I prefer Bayesian reasoning) because that's how most psychology research is done and because that's the framework in which these hypothesis tests make the most sense.

I'd like to clarify that if you shoot for p<0.001, that doesn't mean that 1 experiment in 1000 makes an error in judgment. If only things were that simple!

--------(*puts on frequentist hat*)--------

There is an effect, or there isn't. Whether there's an effect is not subject to probability; it's just a fact about the universe that we have yet to uncover.

Alpha

I'm about to run an experiment. I don't know if there's an effect or not. I'm willing to tolerate a certain risk that I will incorrectly report that there is an effect. There is some probability that (in the case that there's no effect) I will incorrectly report that there is an effect. I would like that probability to be 0.05, or 0.01, or (as you suggest) 001. I'm going to call my choice for this probability "alpha".

To achieve this, I will make some assumptions about the population of possible observations. For example, a common set of assumptions in psychology is that the observed variables I'm interested in characterizing are the result of adding lots of small unobserved influences that we kinda don't care about and a few (possibly larger) influences from variables that the experiment measures or manipulates (that is, they are normally distributed about some linear combination of the values of manipulated variables).

I then formulate hypotheses that make sense within those assumptions. One hypothesis (the "null hypothesis" for my experiment) is that the manipulated variables do not influence the observed variables- there is no effect. Another hypothesis is that the manipulated variables do influence the observed variables.

Once we have chosen our assumptions, then for each hypothesis, we can calculate for any given dataset the probability that we would observe that dataset. Each different hypothesis says different things about the probability of observing a particular dataset: if I had a hypothesis that a coin was fair, then the probability of observing 8 heads in a row is just 1 in 256, but if my hypothesis was that the coin had two heads, then the probability of observing 8 heads in a row is 1.

To enforce our desire that (in the case that there is no effect) I will incorrectly report that there is an effect only 0.001 of the time, I will only reject the null ("no effect") hypothesis if the null hypothesis says that the probability of observing the particular data I get is 0.001 or less.

Why not set alpha super low (ideally, zero)?

As an experimenter, you don't know if there is an effect or not. Which universe are you living in- a universe with an effect, or one without?

If you are an experimenter living in a universe with an effect, you'd like to find out about it and publish it and get paraded about town on the shoulders of your colleagues. But a super-low alpha doesn't just make you more likely to be correct in a universe without an effect- it also makes it more likely for you to report "no effect" in a universe with an effect.

As alpha becomes smaller, the results you need to reject the null hypothesis become more extreme. If I decided my alpha was 0.001 (1 in 1000) I would not be ready to say that my coin was unfair just because I observed eight heads in a row. With that alpha, I'd need at least 10 heads in a row. If I actually did have a biased coin, but don't reject the "fair coin" hypothesis, I'm still making an error in judgment! (And I'm missing out on my parade!)

The probability that you will incorrectly fail to reject the null hypothesis is called "beta". It's also an amusing phrase because of how many negatives it contains. Reducing alpha increases beta by requiring more (or more extreme) data before you are willing to accept an effect.

You can mitigate the tradeoff between alpha and beta by gathering more data (that is, more data => lower beta for any given alpha). But that's expensive, and how much data can you really gather before your grant runs out?

What is the probability that an experiment makes an error in judgment?

You can't actually say this, for a few reasons.

First, alpha only tells you your probability of making an error if the null hypothesis is correct. If you're studying an effect that actually exists, then your probability of making an error (saying the effect doesn't exist) is beta, and you can't know beta as easily as you can know alpha.

Second, alpha is only the probability that experimenters think characterizes the risk of incorrectly rejecting the null hypothesis, because what the null hypothesis says about the calculation of the probability of observing a particular set of data is predicated on deeper assumptions. If those assumptions are wrong, then an experimenter's choice of alpha doesn't correctly characterize the probability of their making a mistake in a no-effect universe. These assumptions include how the variables are distributed, but also how the data are gathered (remember those 8 heads in a row? I actually flipped the coin 1000 times and found a run of 8 heads somewhere in the middle...).
posted by a snickering nuthatch at 8:07 AM on December 3, 2013 [14 favorites]

« Older You can't get there from here | The government “shall not substantially burden a... Newer »

This thread has been archived and is closed to new comments

MetaFilter

Many Labs Replication Project
November 27, 2013 6:48 AM Subscribe

Tags

Share

Many Labs Replication Project November 27, 2013 6:48 AM Subscribe

Tags

Share

Many Labs Replication Project
November 27, 2013 6:48 AM Subscribe