P-hacking
February 14, 2014 10:39 AM Subscribe

P values, the 'gold standard' of statistical validity, are not as reliable as many scientists assume.
posted by dhruva (101 comments total) 53 users marked this as a favorite

Or to put it another way, many scientists don't know what they're doing?

This is unfortunately a long standing problem.
posted by edd at 10:45 AM on February 14, 2014 [14 favorites]

Kent Brockman: Mr. Simpson, how do you respond to the charges that petty vandalism such as graffiti is down eighty percent, while heavy sack beatings are up a shocking nine hundred percent?
Homer Simpson: Aw, you can come up with statistics to prove anything, Kent. Forty percent of all people know that.
Kent Brockman: I see. Well, what do you say to the accusation that your group has been causing more crimes than it's been preventing?
Homer Simpson: Oh, Kent, I'd be lying if I said my men weren't committing crimes.
Kent Brockman: [pause] Well, touché.
posted by blue_beetle at 10:51 AM on February 14, 2014 [7 favorites]

I wrote a little explanation about p-values here.
posted by a snickering nuthatch at 10:52 AM on February 14, 2014 [2 favorites]

I think it's actually quite unfair to portray p vales as unreliable. They do a perfectly good job of calculating what they are supposed to. It's just these weird broken standards that people have layered on top of them that is wrong-headed.
posted by edd at 10:58 AM on February 14, 2014 [34 favorites]

I'm basically on board with everything she says except this:

That might have sounded impressive, but the effects were actually tiny: meeting online nudged the divorce rate from 7.67% down to 5.96%,

A reduction of greater than 20% is not tiny!
posted by aubilenon at 10:59 AM on February 14, 2014 [7 favorites]

The paper's points are all correct, but it's also true that a whole lot of scientists understand all this pretty well already. A lot of statistics departments teach these concepts well. We don't need a revolution. We do need journals to take more of a stand against bad statistics, but that's always been true and always will be true.
posted by gurple at 11:01 AM on February 14, 2014 [7 favorites]

I think the reason p values are so important in Intro Stats classes is that they are easy for underprepared students to understand. If the p value is less than 5%, then statistically significant; if not, not.

But this doesn't tell you anything about the size of the effect. So, I always make students calculate a confidence interval too (which is particularly easy since printouts already contain the standard error.).

But students tend to cling to the p value as the arbiter, because interpreting a confidence interval is sticky ( how do I decide if the effect is " large"?), but p value decisions are cut and dried.

This probably leads to overreliance on p values later on, if students use this material in their careers.
posted by wittgenstein at 11:03 AM on February 14, 2014 [2 favorites]

A reduction of greater than 20% is not tiny!

Yeah, but the difference between a six percent chance of something happening and an eight percent chance of something happening are pretty tiny.
posted by Diablevert at 11:03 AM on February 14, 2014 [2 favorites]

A lot of statistics departments teach these concepts well.

At the university I attended, undergrads generally took their stats courses not in the stats department, but in their own major's department. This contributes to a certain amount of major-specific inertia (you might even say "traditions") in statistical thinking.
posted by a snickering nuthatch at 11:03 AM on February 14, 2014 [5 favorites]

Anyone doing science relying on statistical testing really needs a fullly licensed and bonded statistician working with them. The problem with p-values in how they're applied, particularly in is that most scientists just look to the p-value test as this black box they are fervently hoping will pass. "Feed the numbers in, push the button, woohoo it's 0.03 we can publish!". Fair enough, statistics is hard and a whole different discipline than, say, cognitive science. So work with a professional statistician.
posted by Nelson at 11:04 AM on February 14, 2014 [8 favorites]

Diablevert just proved my point. Is a 20% reduction big or small? Well, it depends...
posted by wittgenstein at 11:04 AM on February 14, 2014 [4 favorites]

Or to put it another way, many scientists don't know what they're doing with statistics.

The second I see a scientist relying heavily or solely on P-value I instantly assume their results are suspect, their methods are suspect, and perhaps even their data is suspect in that they may be manipulating it to achieve a desired result. I wonder if not only are they throwing various statistical questions at their data without understanding the results or the question, do they even understand the reasoning behind the statistical methods they chose?

It isn't that P-values are bad, it's how many scientists abuse it as a test because they really don't understand statistics due to having had ~one class somewhere along the way. I would strongly support a "revolution" in education wherein every scientist was required to minor in statistics. (As well as some sort of graphics program as the ability to show your results visually, including without accidentally and/or purposely manipulating the results, is also important for a lot of disciplines).
posted by barchan at 11:19 AM on February 14, 2014 [4 favorites]

If a large reduction of a small quantity is a small reduction, then are we not doomed to see all small percentages as insignificant? So the unemployment rate going up 2%p is only a small increase? Or is millions of people being fired a large increase? How do you want to define big and small, anyway?
posted by I-Write-Essays at 11:20 AM on February 14, 2014

A reduction of greater than 20% is not tiny!

The change from 7.67% to 5.96% is not 20%. It's 1.71%.
posted by jimmythefish at 11:21 AM on February 14, 2014

> The change from 7.67% to 5.96% is not 20%. It's 1.71%.

No, it is a change of 1.71 percentage points.
posted by I-Write-Essays at 11:22 AM on February 14, 2014 [23 favorites]

Anybody know if there are any good online courses, preferably free, to get a refresher on stats? By now, while I was good enough to test out in college, I haven't actually had to do all of the math around them since junior year of high school, which is pretty much half my life ago. (I should probably brush up on algebra too — I had to a couple years ago for a class, but I ended up getting into the weeds over exponential fractions and the weird stuff they do). I use this stuff intermittently to deal with reporting on studies that other people run, but I'd like to get back to the level where I could check their work and have a broader grasp of the context.
posted by klangklangston at 11:23 AM on February 14, 2014

P-values ain't nothin' but a number.
posted by GuyZero at 11:24 AM on February 14, 2014

But students tend to cling to the p value as the arbiter, because interpreting a confidence interval is sticky ( how do I decide if the effect is " large"?), but p value decisions are cut and dried.
I and many others would argue that making these 'cut and dried' decisions with p-value is exactly the problem.
posted by edd at 11:24 AM on February 14, 2014

No, it is a change of 1.71 percentage points.

Never underestimate MetaFilter's ability to pedant.
posted by jimmythefish at 11:25 AM on February 14, 2014 [1 favorite]

(7.67 - 5.96) / 7.67 = .223

I think a bigger problem in the world of science is the fact that negative results don't get awards or press — positive ones do. Prizes like the Nobels that reward significant positive findings are great and all, but there ought really to be an even more prestigious prize that rewards excellence in research methodology, regardless of results. Science isn't about making groundbreaking discoveries; it's about doing hard, usually unrewarding work because very, very occasionally that tedium pays off for human knowledge.
posted by cthuljew at 11:26 AM on February 14, 2014 [14 favorites]

no jimmy, it's 20%.

does statistics exist primarily to reveal truth, or disguise it?
posted by bruce at 11:26 AM on February 14, 2014

Jimmy, it's not just pedantry if you're claiming something that is actually not true. Imprecise thinking in science is part of the problem with people misusing statistics.
posted by I-Write-Essays at 11:28 AM on February 14, 2014 [16 favorites]

I think the reason p values are so important in Intro Stats classes is that they are easy for underprepared students to understand.

My impression is that this is false. Even working scientists -- at least in psychology and other social sciences -- routinely misinterpret p-values as posterior probabilities, i.e. the probability that the null hypothesis is true given the observed data. And that interpretation is, of course, wrong.
posted by Jonathan Livengood at 11:29 AM on February 14, 2014 [7 favorites]

Never underestimate MetaFilter's ability to pedant.

The reason we have to say "percentage points" and not "percent" is because a change of "20 percent" already means something else. It's not pedantry, just disambiguation.
posted by my favorite orange at 11:29 AM on February 14, 2014 [18 favorites]

no jimmy, it's 20%.

Weren't they discussing the difference in the original value? Not the difference between the two percentage points?
posted by jimmythefish at 11:29 AM on February 14, 2014

the difference in the original value is over 20% less than what it was before. it's a substantial drop.
posted by bruce at 11:31 AM on February 14, 2014

If, for example, you were a beer manufacturer, and your market share went from 7.67% down to 5.96% you would almost certainly consider that quite a serious change. A point of market share in such a large market is very serious money.
posted by Bovine Love at 11:32 AM on February 14, 2014 [5 favorites]

klangklangston,

You might find CMU's Open Learning Initiative courses helpful.
posted by Jonathan Livengood at 11:33 AM on February 14, 2014 [5 favorites]

If, for example, you were a beer manufacturer, and your market share went from 7.67% down to 5.96% you would almost certainly consider that quite a serious change. A point of market share in such a large market is very serious money.

Yes, but nobody would say that the share went down 20%. That's because it only went down 1.71%.
posted by jimmythefish at 11:34 AM on February 14, 2014

Yes, but nobody would say that the share went down 20%.

I would, and I do stats and love beer.
posted by Mapes at 11:35 AM on February 14, 2014 [5 favorites]

It would depend, actually. You may well see it stated as 20%. If they gave the 1.71% figure, it would definitely be stated as "percentage points" (or perhaps points).
posted by Bovine Love at 11:36 AM on February 14, 2014 [2 favorites]

jimmy, in a beer market of fixed size, if you went from 7.67% down to 5.96% of the market, you'd have to lay off about 20% of your help.
posted by bruce at 11:36 AM on February 14, 2014 [11 favorites]

It would depend on what is being discussed, likely. In an article discussing over all market share, it would probably be given in points. In an article focussing on that manufacturer, it would likely be given as a percentage (i.e. the 20% figure).
posted by Bovine Love at 11:37 AM on February 14, 2014 [1 favorite]

Admittedly, it is confusing. Or maybe better to say subtle. For instance, Jpfed's very nice explanation contains the following totally accurate definition:

The probability that you will incorrectly fail to reject the null hypothesis is called "beta".

posted by benito.strauss at 11:39 AM on February 14, 2014 [1 favorite]

I'm pretty uncomfortable with the inquisitorial posturing of some of the replicability gang.

Their fervor makes me think they are trying to oust heretics and cleanse the field. Which can't happen because it turns out researchers are human.

Yes, but nobody would say that the share went down 20%. That's because it only went down 1.71%.

You don't read newspapers much do you? Pretty much every health finding and business swing is reported in precisely the way you say they are not. Even better they usually don't bother to tell you the base rates so miniscule changes to low base rates sound like huge swings.
posted by srboisvert at 11:40 AM on February 14, 2014 [3 favorites]

The probability that you will incorrectly fail to reject the null hypothesis is called "beta".

That's not called THAC0?
posted by Navelgazer at 11:40 AM on February 14, 2014 [16 favorites]

I like their argument for Baysianism. If you have prior reason to think an effect is extremely unlikely to exist, an experiment with a p-value of 0.01 should be taken to prove that the effect still probably doesn't exist but also that the theory that it exists is now a lot more plausible so the effect deserves further study.
posted by justsomebodythatyouusedtoknow at 11:41 AM on February 14, 2014 [3 favorites]

Science isn't about making groundbreaking discoveries; it's about doing hard, usually unrewarding work because very, very occasionally that tedium pays off for human knowledge.

As an applied scientist, I will humbly submit that my version of science is different than yours: To marshall robust and rigorous epidemiological methods to improve human health.
posted by docgonzo at 11:42 AM on February 14, 2014 [3 favorites]

No, it is a change of 1.71 percentage points.

no jimmy, it's 20%.

Why do you two keep starting your sentences with "no"? You're both right. The confusion between percentages and percentage points actually makes these sorts of issues really difficult for mere mortals and this sort of bickering doesn't help - I think it makes the non-mathematically inclined more likely to throw up their hands and say to hell with the whole process.

That's why, even though it doesn't make any formal difference, I like to use fractions or ratios for one of the values to show that I'm talking about different things. Like, it would be completely accurate to say "the study indicates that [XYZ factors] reduce the divorce rate by about 1/5 - from 7.67% to 5.96%" - n'est ce pas?
posted by Joey Buttafoucault at 11:45 AM on February 14, 2014 [6 favorites]

jimmy, in a beer market of fixed size, if you went from 7.67% down to 5.96% of the market, you'd have to lay off about 20% of your help.

I'm sorry, but you're talking about two different rates here. One is the share of the market (1.71%). The other is your production - in that your portion of that 7.67% is 100%. Yes, you're laying off 20% of your help, because you lost 1.71% of the total market share.
posted by jimmythefish at 11:46 AM on February 14, 2014

(You lost 20% of your own production, but it's 1.71% of the total market). I think that's where the confusion was...it's Friday and my kids were up all night...
posted by jimmythefish at 11:49 AM on February 14, 2014

jimmy this is really simple.
1.71 = the percentage points.
~20% = The percent change.

(Now watch me have screwed that up.)
posted by Navelgazer at 11:50 AM on February 14, 2014 [2 favorites]

I'm sorry, but you're talking about two different rates here. One is the share of the market (1.71%). The other is your production - in that your portion of that 7.67% is 100%. Yes, you're laying off 20% of your help, because yo* lost 1.71% of the total market share.

Yes - and one of those rates we refer to as "percent change" and one we refer to as "percentage points" precisely to avoid the confusion that would arise if both were referred to as percent.
posted by pemberkins at 11:50 AM on February 14, 2014 [4 favorites]

But 1.71% of the total market share is 20% of your share. Unless you are InBev , you care much more about relative than absolute percentages.

This confusion is why "basis points" are used.
posted by PMdixon at 11:52 AM on February 14, 2014 [2 favorites]

Yes - and one of those rates we refer to as "percent" and one refer to as "percentage points" precisely to avoid the confusion that would arise if both were referred to as percent.

The market share and the company's share of production were conflated above.
posted by jimmythefish at 11:53 AM on February 14, 2014

Klangklangston, MIT Opencourseware has a great introductory course to Probability and Statistics.
posted by barchan at 11:53 AM on February 14, 2014 [5 favorites]

no joey, we are not both right. i am right and he is wrong. i am not responsible for the fate of the "non-mathematically inclined".

jimmy, it's all good. you can still be a respected pillar of your community even if you're non-mathematically inclined. i'm logging off now because i have a catbox to clean before i go into town; i can't believe i got into an argument over THIS.
posted by bruce at 11:53 AM on February 14, 2014

This is when a P value of 0.05 became enshrined as 'statistically significant', for example. “The P value was never meant to be used the way it's used today,” says Goodman.

Brings to mind a classic paper on statistical tests, The Earth is Round, p < .05 (PDF):

One problem [with null hypothesis significance testing using p values] arises from a misapplication of deductive syllogistic reasoning [when such reasoning is made probabilistic]:

If the null hypothesis is correct, then these data are highly unlikely.
These data have occurred.
Therefore, the null hypothesis is highly unlikely.

...consider this:

If a person is an American, then he is probably not a member of Congress. (TRUE, RIGHT?)
This person is a member of Congress.
Therefore, he is probably not an American.

Oops.
posted by shivohum at 11:55 AM on February 14, 2014 [15 favorites]

If a person is an American, then he is probably not a member of Congress. (TRUE, RIGHT?)

Ha, I used that example in a stats presentation (slide 10)!
posted by Mapes at 11:57 AM on February 14, 2014 [3 favorites]

bruce: While I appreciate the put-down tremendously, this sentence here:

jimmy, in a beer market of fixed size, if you went from 7.67% down to 5.96% of the market, you'd have to lay off about 20% of your help.

is referring to two different things. One is total market share, and the other is share of the company's production. I was pointing out that it was a loss of 1.71% of the total market share, and I'm still right. The 20% figure is referring to something else.
posted by jimmythefish at 11:58 AM on February 14, 2014

Percentage change is a change in a quantity from past to present, or present to future. A change from 7.67 million bottles of beer to 6.96 million bottles of beer is a 22.3% reduction. A change from 7.67% to 6.96% is a 22.3% reduction. It is also a decrease of 1.71 percentage points, but it is not a 1.71 percent change, because "percent change" is a rigorously defined term that refers to relative change. I think the argument comes because jimmythefish said

The change from 7.67% to 5.96% is not 20%. It's 1.71%.

And everyone read that as saying "It's a 1.71% change." Saying "It's a 1.71% change" would not be true, but it's not what he said. (Though the fact that he later said "nobody would say that the share went down 20%" didn't help the matter, because people obviously would say that.) What he said was ambiguous, and now we're fighting about the ambiguity. Can we agree that there is a difference between percent change and change in number of percentage points, and be done with the derail?
posted by agentofselection at 12:05 PM on February 14, 2014 [7 favorites]

I've read this twice and I'm not convinced it is that great an article. The problem with P-hacking isn't p-values or statistics, but people testing repeatedly until they get the answer they want, but leaving out multiple testing corrections that would suggest that the repeated test that succeeded was lucky. The math is fine; the people are just not applying it correctly, and so reproducibility suffers.
posted by Blazecock Pileon at 12:09 PM on February 14, 2014 [5 favorites]

"The other is your production - in that your portion of that 7.67% is 100%. Yes, you're laying off 20% of your help, because you lost 1.71% of the total market share."

Yes, but unless it's clear from your sentence that you're referring to the loss of total market share, using "percent" (sans "points) is unclear and should be rewritten.
posted by klangklangston at 12:12 PM on February 14, 2014

We've done this already on the blue, right? Several times?

Because if so, I need to Bonferroni correct how much I care.
posted by cromagnon at 12:16 PM on February 14, 2014 [8 favorites]

SugarFreeGum: The null hypothesis might be based on random chance (e.g. can this alleged psychic predict coin flips at better than 50%?) Or they might be based on some other comparison metric which was actually tested in a separate "experimental treatment" (can your drug make cancer patients live longer than this other drug? Can it make them live longer than a placebo?) What you test against depends on what your null hypothesis is.
posted by agentofselection at 12:17 PM on February 14, 2014 [1 favorite]

> does statistics exist primarily to reveal truth, or disguise it?

And by way of this, my opinion is that Statistics exists primarily to describe data, and any attempt to extrapolate from there to "Truth" is some different discipline. Perhaps you could call it Metastatistics.
posted by I-Write-Essays at 12:17 PM on February 14, 2014 [2 favorites]

We've done this already on the blue, right? Several times?

Because if so, I need to Bonferroni correct how much I care.

The only thing more useless than your comment is mine.
posted by MisantropicPainforest at 12:19 PM on February 14, 2014 [2 favorites]

Most scientists would look at his original P value of 0.01 and say that there was just a 1% chance of his result being a false alarm. But they would be wrong. The P value cannot say this: all it can do is summarize the data assuming a specific null hypothesis. It cannot work backwards and make statements about the underlying reality. That requires another piece of information: the odds that a real effect was there in the first place.

Not quite. For any specific study, the probability* that there was a real effect in the first place can't be anything other than exactly zero or exactly one. There either is an effect in the real world, or there ain't... we don't know which, but there's no probability about it.

*I hate odds.
posted by ROU_Xenophobe at 12:24 PM on February 14, 2014

I hope it is not too Pepsi Blue to mention it in the thread (maybe to de-Pepsi it somewhat I won't link) but the book I have coming out in June has a long chapter about exactly this issue, where I try to treat the very interesting mathematical, statistical, and philosophical questions around p-values with the greater depth and range that a book allows.
posted by escabeche at 12:29 PM on February 14, 2014 [12 favorites]

escabeche, looks like an interesting book.
posted by I-Write-Essays at 12:36 PM on February 14, 2014 [4 favorites]

Yeah, but the difference between a six percent chance of something happening and an eight percent chance of something happening are pretty tiny.

Not if it's death.
posted by Mental Wimp at 12:40 PM on February 14, 2014 [2 favorites]

One result is an abundance of confusion about what the P value means4. Consider Motyl's study about political extremists. Most scientists would look at his original P value of 0.01 and say that there was just a 1% chance of his result being a false alarm. But they would be wrong. The P value cannot say this: all it can do is summarize the data assuming a specific null hypothesis. It cannot work backwards and make statements about the underlying reality. That requires another piece of information: the odds that a real effect was there in the first place. To ignore this would be like waking up with a headache and concluding that you have a rare brain tumour — possible, but so unlikely that it requires a lot more evidence to supersede an everyday explanation such as an allergic reaction. The more implausible the hypothesis — telepathy, aliens, homeopathy — the greater the chance that an exciting finding is a false alarm, no matter what the P value is.

I've now read this five times and am ready to confess that I flatly do not understand it. Is it simply making the point that a poorly-designed study (with false assumptions, sloppy methodology, criteria calculated to get the desired result, failure to control for lurking variables, etc.) is likely to spit out bad data, and that your P value won't be meaningful if you're a crackpot scientist? In other words, that your results are only as good as your experiment? Or is there something deeper here about the nature of statistics? I don't grok what it means to say that a P-value "cannot . . . make statements about the underlying reality" -- isn't the whole point of doing the calculation to make some sort of statement about the "underlying reality" (viz., the likelihood that it arose by chance) at least to the extent your experiment was successful in getting at the effect you're trying to measure? Otherwise what is it good for?
posted by eugenen at 12:42 PM on February 14, 2014

If the null hypothesis is correct, then these data are highly unlikely.
These data have occurred.
Therefore, the null hypothesis is highly unlikely.

Almost right, except the major premise is "Under the null hypothesis these data are less likely than under the alternative."
posted by Mental Wimp at 12:47 PM on February 14, 2014

Almost right, except the major premise is "Under the null hypothesis these data are less likely than under the alternative."

...and then, if you understand why that matters, you're a Bayesian.
posted by Philosopher Dirtbike at 12:55 PM on February 14, 2014 [3 favorites]

eugenen:

It's trying to point out the logic of Bayesian updating without really describing it or going through examples. The basic logic is the same as medical tests that are 95% accurate but where most of the positives are false, and the short answer about how that can be is that if the condition is a rare one (say one percent of people have it), the false positives (5% of 99% or 4.95% of the people you test) can far outnumber the true positives (95% of 1% or 0.95% of the people you test).

What a p-value tells you is how hard it would be to observe a sample like you actually see if the true fact was that there was nothing going on. If we lived in a world where rutabaga consumption had no effect on height, how hard would it be to observe a sample like we did where rutabaga-eaters were on average 4mm taller just because of bad luck in drawing the sample?

It's not, as it's commonly used, the probability that the results are due to chance. It's closer to how hard it would be to observe the results if they were driven by nothing but chance.
posted by ROU_Xenophobe at 12:55 PM on February 14, 2014 [6 favorites]

What a p-value tells you is how hard it would be to observe a sample like you actually see if the true fact was that there was nothing going on.

That's not right. It's the probability of a test statistic as extreme or more extreme than the one you observed. This is critical, because it is precisely this definition that makes significance testing violate the likelihood principle.
posted by Philosopher Dirtbike at 12:58 PM on February 14, 2014 [2 favorites]

It's not, as it's commonly used, the probability that the results are due to chance. It's closer to how hard it would be to observe the results if they were driven by nothing but chance.

I think I get it -- and this is because the probability that the results are due to chance depends on facts about the effect you're trying to measure (e.g. is it so super-rare that it's easily drowned out by even minor statistical fillips or experimental flaws), which the P-value cannot take into account?
posted by eugenen at 1:01 PM on February 14, 2014

And by way of this, my opinion is that Statistics exists primarily to describe data, and any attempt to extrapolate from there to "Truth" is some different discipline.

There is descriptive statistics and inferential statistics. Descriptive statistics is what you're thinking of, and inferential statistics is "using data from a sample to make probabilistic statements about parameters/properties of the population". [I'll give you a point or two if you can just repeat that back on the final exam.]
posted by benito.strauss at 1:02 PM on February 14, 2014 [3 favorites]

The Bayes-for-Dummies explanation I just got (which I'm certain will be super-familiar to statisticians and the like but was new to little ol' me and so might be helpful to others) goes as follows:

There's this disease out there, let's call it Clownitis. Thankfully, there's a test for it.

Now, if you've got Clownitis, the test has a 95% chance of detecting it.
If you don't have Clownitis, the test has a 99% chance of giving you a true negative.

And this all seems pretty compelling.

However, what this means is that, if there's a 1% incidence of Clownitis in the population at large, and you get a positive result, once you do the math, it's actually still more likely that you are negative for the disease. Because your chance of having the disease, multiplied by likelihood of true positive, is (roughly) .95%, while your chance of not having it, multiplied by potential for false positive, is .99%.

I'm interested to know if I got that wrong, of course, but it seemed pretty illustrative to me. (and thanks to inara for the explanation.)
posted by Navelgazer at 1:06 PM on February 14, 2014 [2 favorites]

a lot of scientists also don't understand Propagation of Error This was one of the best and most hardcore lessons I've ever learned.

I had an "evil" professor who gave us only 3 assignments for an entire semester. One assignment he deliberately set up so that we thought we were working with a data set comprised of geologic data for different localities, for which we all came up with different explanations what it represented and correlated it. The data turned out to be stock market fluctuations, addresses from our campus he had manipulated into random GPS coordinates, and ages of all his students run through random arithmetic and subtraction formulas. I still shiver at how happy I was to correlate and explain it.

The second assignment had real data for real locations, but had been turned upside down geologically speaking - so it worked out statistically but not realistically: our first lesson about error as well as thinking about what your results *really* mean. None of us caught it.

For the third assignment we had to collect our own data using instruments in his lab, run statistics on it, and interpret the results. We also had to design the experiment from scratch, including formulate our own hypothesis.

What we didn't know is that he was deliberately messing with his instruments and software. He tempted us with instrument and data collection we didn't need to complete the experiment (we all fell for it). He also introduced an element with the software we were using that required us to input the data multiple times, including moving it from Excel into another program, so that we unwittingly introduced our own errors. The whole experiment was designed to teach us about propagation of error and reproducibility.

Having wised up due to the previous assignments we all compared our results (thus the reproducibility element) and realized something was going on. It took us some time to track down most of what was wrong although we never caught that he was messing with the software - because none of us had bothered to try and understand the software besides the brief explanation of how to use it, or else we would have realized immediately.

Not only that, about 90% of our time was used in tedious data collection - giving us a great insight into what it means to be a scientist.

Because he had (deliberately) introduced a limit on time we could use the instruments in his lab, we could not redo our experiments. In the end, half of us took a big breath and reported that our experiments had not produced any usable results - and received A's. The rest tried to do something with the data anyway and received F's.

The class wasn't ostensibly about analysis or statistics. But the lessons I learned about data, instrument calibration, understanding your software, scientific honesty, and results in that class have stayed with me far more than any reading, problem sets, or simple lab experiments ever have.
posted by barchan at 1:12 PM on February 14, 2014 [67 favorites]

and any attempt to extrapolate from there to "Truth" is some different discipline.

No, this is exactly what statistics is. The whole enterprise is about inferring from data to the state of nature. This is why p-values were invented and why Bayesian statistics are done.
posted by Mental Wimp at 1:19 PM on February 14, 2014

Not quite. For any specific study, the probability* that there was a real effect in the first place can't be anything other than exactly zero or exactly one. There either is an effect in the real world, or there ain't... we don't know which, but there's no probability about it.
The definition of probability is extendable.
posted by edd at 1:21 PM on February 14, 2014 [2 favorites]

klangklangston, I did the free Statistics One course on Coursera and found it really helpful. The labs component of the course teaches you a bit of how to use R, so you can use a freely available tool to do your stats properly afterwards.

There is a lot of discussion in the course of precisely this issue of over-reliance on p-values and the ways to avoid making erroneous conclusions from them.
posted by pulposus at 1:23 PM on February 14, 2014 [4 favorites]

>>That might have sounded impressive, but the effects were actually tiny: meeting online nudged the divorce rate from 7.67% down to 5.96%,

>A reduction of greater than 20% is not tiny!

[Innumerable replies]

This is unfortunate wording that obscures any real discussion of "effect size." When trying to build a model of this kind, we begin by making a set of observations, and will determine a number of factors about these observations. Among these is the uncertainty of the observations (also informally called the variability, or the dispersion). We then try to explain those observations in terms of a variety of factors. "Effect size" (which is the key idea that the article is trying to get at) is not a measure of the difference in rate between these two groups. Instead, it is a measure of how much the factor reduces our uncertainty. Whether a drop from 7.67% to 5.96 is a big effect or a small effect cannot be determined by comparing those two numbers without also discussing the underlying variability.

The test statistic for the 7.67% vs. 5.96% comparison is a "chi-square" statistic, whose value is 9.95, according to the paper. A common metric of "Effect size" for this kind of comparison is the "phi coefficient," which divides the chi-square statistic by the sample size and takes a square root. According to this formula, the effect size for this comparison (where values range from -1 to 1 under ideal conditions, with 0 being "no effect at all") is 0.023, very nearly no effect. So, to that first commenter: This reduction is indeed trivial in size. Knowing whether a couple met online or not tells you almost nothing about how likely they are to divorce.
posted by belarius at 1:28 PM on February 14, 2014 [4 favorites]

"Significance" is easily one of the most misleading words in science. Having a p-value below whatever level tells you that the null hypothesis is not likely to show data like you observed. At best, your result is reliably different from the null hypothesis. That does not make it significant! For that you need an effect size that is meaningful for what you're studying.

So many papers conflate statistical "significance" level with scientific significance that the former term should simply be abolished. Just say "condition 1 reliably differed from condition 2 in our data (p < .01)" or something else more defensible and less ambiguous.
posted by parudox at 2:17 PM on February 14, 2014 [1 favorite]

Going to throw out this link again.

Cosma Shalizi talks through a gedanken experiment in which we consider an Earth 2 in which scientists work much the way they do here, and particularly use p-values in the same way - - - but unfortunately all of their hypotheses are wrong.

cstross could use that as a premise for a Laundry short.
posted by PMdixon at 2:20 PM on February 14, 2014 [3 favorites]

> Yeah, but the difference between a six percent chance of something happening and an eight percent chance of something happening are pretty tiny.

That shows a completely lack of understanding of statistics. Sorry to be brutal, but it's true.

Your chances of having AIDS in the United States are about 0.3%. If we could just lower this by 0.1%, "pretty tiny", that would mean 400,000 fewer Americans with AIDS - which would save a about $150 billion over their lifespan (given a lifetime cost of AIDS of about $350,000).
posted by lupus_yonderboy at 2:36 PM on February 14, 2014 [3 favorites]

It's not at all the case that you can't publish negative results. Any given issue of NEJM probably has a negative result in it. To publish a negative result, you need a well-designed well-powered study of an interesting question along with technical competence, which is a higher bar than to get published with a positive result. It's still possible though, and if you compare datasets of similar province you have a good chance of not being misled.
posted by a robot made out of meat at 2:39 PM on February 14, 2014

To publish a negative result, you need a well-designed well-powered study of an interesting question along with technical competence, which is a higher bar than to get published with a positive result.

Yeah, a much higher bar than for stuff like discussed a few threads down. There don't have to be literally no negative results published to introduce a significant bias.
posted by PMdixon at 2:47 PM on February 14, 2014 [2 favorites]

A common metric of "Effect size"...the "phi coefficient,"...is 0.023, very nearly no effect.

Epidemiologists used the odds ratio, and for this reduction that would be 0.763, or a ~24% decrease in the odds of divorce. Whether that's "tiny" or not depends upon how much you value preventing divorce. The studies that showed the value of colorectal cancer screening showed mortality reductions of 20-33%, corresponding to odds ratios of 0.8 to 0.67.
posted by Mental Wimp at 3:01 PM on February 14, 2014 [4 favorites]

Well, Mental Wimp, let's say I've got a couple coins. One of them has turned up heads 75% of the time, and the other has turned up heads 50% of the time.

You might think this is a pretty big difference! But actually they're both fair coins; I only flipped them four times each and got lucky with the first coin. In order to determine the effect of choosing one coin over another, we need to know how many trials went into these percentages. If instead I had done hundreds of coin flips and gotten those percentages, then yeah, that would be a big deal.
posted by a snickering nuthatch at 3:13 PM on February 14, 2014 [1 favorite]

Maybe I misunderstood, but it looks like you are comparing your actual results with hypothetical results which you call the null hypothesis. How do you come up with those hypothetical results? Does that have to be based on something?

I'm going to try to answer these questions, but I may make some mistakes.

The null hypothesis isn't a result or set of results, it's more of a heuristic, a theoretical construct that guides thinking. When you do research, one big goal is to identify important differences between distinct groups with respect to a single attribute or outcome. Generally speaking, the null hypothesis in any given situation states that no patterned difference exists between two qualitatively distinct groups of observations for a single variable of interest.

Let's say you're a medical researcher, and you see lots of men having bad outcomes from a surgical procedure, but few women having them from the same procedure. So you suspect that men have worse outcomes, as a group, than women do for this procedure. Let's go on to say your research hypothesis, then, would be that the probability of a bad outcome for the two groups is not equal (you can also have directional or compound hypotheses, but let's keep it simple). Note that you're not yet characterizing how bad those outcomes are, or how big the between-group difference might be. The null hypothesis there would be that there isn't any actual difference in the bad outcome rate between men and women, and you have to answer that charge first because if there's no difference, any variation you see in individual outcomes is either random or attributable to some factor other than gender, and in either case you wouldn't want to waste your time investigating a pattern that doesn't exist.

So the null hypothesis represents the lowest bar your theory (so to speak) has to pass to be worth discussing or analyzing any further, because in rejecting the null hypothesis, you establish that you now have a reasonable basis for thinking that you've identified a non-random pattern which corresponds with a categorical difference between the two groups, which you can then go on to further investigate and describe.
posted by clockzero at 3:13 PM on February 14, 2014 [1 favorite]

In the end, half of us took a big breath and reported that our experiments had not produced any usable results - and received A's. The rest tried to do something with the data anyway and received F's.

One problem is that the F-receiving approach has a nonzero chance of getting into Nature or Cell or Science, and the A-receiving approach doesn't.
posted by Mapes at 3:21 PM on February 14, 2014 [4 favorites]

That's not right. It's the probability of a test statistic as extreme or more extreme than the one you observed. This is critical, because it is precisely this definition that makes significance testing violate the likelihood principle.

The one is a colloquial and nontechnical way of describing the other.

That's sloppy; Philosopher Dirtbike's distinction is indeed crucial. If you allow the ambiguity, for example, you'll get two different answers for the p-value of flipping a coin 20 times and getting 17 heads. The correct p-value for null hypothesis significance testing* of a fair vs. weighted coin is the chance of flipping 17 heads or 18 or 19 or 20 heads or 17, 18, 19, or 20 tails, given a fair coin.

*Regardless of whether one thinks this approach is useful or not.
posted by Mapes at 3:34 PM on February 14, 2014 [2 favorites]

You might think this is a pretty big difference! But actually they're both fair coins; I only flipped them four times each and got lucky with the first coin.

But this is what statistical significance - i.e., the p-value - is supposed to guard against. And indeed it does, because if you do a Fisher's test here you get p = 1 for the difference between the two coins.

It's definitely true that reporting the ratio of divorce rates doesn't give you any idea of the dispersion around those rates, and it could be very high. However, I think in order to evaluate whether this reduction in divorce rates were actually "trivial" you would need to know, for example:

1. Are there any other predictors of divorce? If so, how good are they? If this is the only reliable effect, that's noteworthy even if it's not very large. Or perhaps there are other predictors, but meeting online is the factor that explains the most variance, or is at least in the top two or three. I mean, I kind of doubt it, but stranger things have been observed.

2. What is the application case for this information, and how practical is it to collect? Maybe you're a lawyer who wants to improve ad targeting and it turns out to be very easy to predict whether a particular viewer met their partner online: in that case, you only need to do a very little better than random in order to improve your targeting, because you're planning to get the ad shown to millions of people. However, if it's hard or expensive to tell whether they met their partner online, or if there is a big penalty to guessing wrong, then suddenly even at the same effect size it's not worth it anymore.

My point is that properly evaluating an effect size usually requires some domain knowledge as well as statistical knowledge.
posted by en forme de poire at 3:38 PM on February 14, 2014 [11 favorites]

I've now read this five times and am ready to confess that I flatly do not understand it.

A concise way of stating it is that the p-value is the probability of the result you saw GIVEN that the null hypothesis is true, or P(Result | Null). What people are actually usually interested in, though, is P(Null | Result). It turns out that those two quantities are related, but not the same. In order to go from the first one to the second one, you get something like:

P(Null | Result) = P(Result | Null) * P(Null) / P(Result)

or, to expand on that a bit:

P(Null | Result) = P(Result | Null) * P(Null) / [ P(Result | Null)P(Null) + P(Result | !Null)P(!Null) ]

So in order to actually get an estimate of P(Null | Result), it turns out you need to know P(Null). That's the prior probability - i.e., not knowing the result, what is the baseline probability of its being true?

I usually find that a good illustration here is a very rare tropical disease with very common symptoms. Let's say you live in the Northeastern USA, you have a headache, and your (irresponsible) doctor gives you a test for RareTD. The test only has a 5% chance of false positives and a 5% chance of false negatives. You take it and you get a positive result. So what's the chance that you're actually sick with RareTD? Is it 95%?

Well, it actually turns out to be less than 5% - because as it turns out, the prevalence of this disease in the Northeastern USA is less than 0.05% of people. That baseline rate is the P(Null) above.

So you can see, it's a higher risk than it was if you had never taken the test - but it's nowhere near conclusive.

Of course it gets way more complicated when we can't actually measure P(Null), but that doesn't mean it's not there lurking in the background, complicating all of our naive interpretations of p-values.
posted by en forme de poire at 3:54 PM on February 14, 2014 [5 favorites]

Or what Navelgazer posted above! Sorry, missed that on my first go through the thread.
posted by en forme de poire at 4:06 PM on February 14, 2014 [2 favorites]

Interestingly, in social science (econ and sociology especially), while we all still use p-values, the emphasis is on showing your results multiple times in multiple ways. We tend to work with large data sets, where there are often many clearly significant effects. To publish, though, you need to be able to explain why something happens, not just that there is an effect.

For example, I was working on a paper on the way crowdfunding allows women to overcome historical constraints on getting funds to launch businesses. Just showing that women are more likely to succeed than men (by 14%!) is cool, but not useful. I need to show why. I argue that it is due to women disproportionately supporting other women out of a kind of activism. I can't simply show this is true in one test, instead, I have to be able to test this multiple ways (in historical data, in a lab experiment, using multiple methods, etc), while my peers grill me about possible errors. Though the p-value of any test may be p is less than .05, the multiple approaches to the same problem get around many of the issues here.

It isn't about p-values, it is about experimental design.

Also, I would say that anything that changes the divorce rate by 20% is an enormous change. It is rare for any policy to have such a large effect.

Determining effect size relies on a few factors besides significance alone (which others have discussed): How big the effect is X How large the population is X How important the problem is. There are around 877k divorces in the US. Lowering that by 20% would result in around 700k divorces. Divorce increases the chance of poverty, along with other costs, so the reduction of 177k divorces would be a very big deal.
posted by blahblahblah at 4:57 PM on February 14, 2014 [2 favorites]

I should qualify my divorce study comments - I was working off the reported results, not the study itself. And the study is not as well-designed as I hoped.

This is mostly because the results they reported ignored covariates that would predict divorce, and any systemic differences in population. There is likely to be an endogenous effect; that is that populations looking online are likely different than those looking offline. They don't have enough controls to figure this out, and it seems pretty thin on the social science side.
posted by blahblahblah at 5:09 PM on February 14, 2014 [1 favorite]

When I was in grad school I knew these guys who wrote a data mining program that would sift through data sets and pick dependent, independent and control variables to find the largest possible R^2. Invariably, the equations were nonsense.
posted by Crotalus at 5:10 PM on February 14, 2014



library("overfitting")

model <- overfit(sexy.variable ~ weather * stock.prices * whatever.the.fuck + 1)

publish(model)

posted by en forme de poire at 6:08 PM on February 14, 2014 [5 favorites]

I had an "evil" professor who gave us only 3 assignments for an entire semester.

Wanna guess what's gonna happen to the next crop I have to teach intro stats to?
posted by ROU_Xenophobe at 6:14 PM on February 14, 2014 [13 favorites]

One problem is that the F-receiving approach has a nonzero chance of getting into Nature or Cell or Science, and the A-receiving approach doesn't.

Yeah, it's a widely acknowledged problem - many of us have probably dreamed of a journal or forum where we could publish those failed results and hypotheses.

However, I'm not sure if it was an intended consequence, but when I read a paper by one of those who received an F, I'm highly suspicious of their scientific integrity. Others in that class have admitted the same. In a way, that was the most valuable lesson of all.

Wanna guess what's gonna happen to the next crop I have to teach intro stats to? hahaha! You'll like this, then - I found out later that in another class, this professor gave everyone data but told them what the result should be. Of course it was nonsense, but he wanted to see to what extremes people would manipulate the data to get the result.
posted by barchan at 6:46 PM on February 14, 2014 [2 favorites]

I did once make them use ideology to predict age and interpret the results.
posted by ROU_Xenophobe at 7:45 PM on February 14, 2014 [2 favorites]

> Epidemiologists used the odds ratio, and for this reduction that would be 0.763, or a ~24% decrease in the odds of divorce.

(1) An epidemiologist will also point out that the range of values for odds ratios does not have ceiling of 1, so 0.763 is not necessarily a large or small value. Given that this finding was a single result from a massive survey of around 20,000 people it's very likely not among the best predictors of divorce.

(2) An epidemiologist will also be aware that odds ratios are not measures of risk.

I'm not arguing about the relative merits of changing the divorce rate (after all, reducing the infection rate of a rare disease is laudable). The issue here is whether p less than 0.05 corresponds to the size of the effect (it does not), and nearly all of the comments here suggest that most people In The Blue don't have much (or any) idea what effect size means in a technical sense.
posted by belarius at 9:15 PM on February 14, 2014

clockzero: So the null hypothesis represents the lowest bar your theory (so to speak) has to pass to be worth discussing or analyzing any further, because in rejecting the null hypothesis, you establish that you now have a reasonable basis for thinking that you've identified a non-random pattern which corresponds with a categorical difference between the two groups, which you can then go on to further investigate and describe.

This is what is commonly thought, but it is wrong. Rejecting the null hypothesis does not establish anything in an epistemic sense. Neyman was quite clear that epistemology had nothing to do with statistical testing, and Fisher only asserted the link; but, as pointed out up thread, without further (strong) premises, the link is invalid. Rejecting the null hypothesis does not give you a reasonable basis for thinking anything other than that you observed a p value less than a specific criterion.
posted by Philosopher Dirtbike at 12:02 AM on February 15, 2014 [3 favorites]

Given that this finding was a single result from a massive survey of around 20,000 people it's very likely not among the best predictors of divorce.

Prediction isn't the only nor necessarily the most frequent use of OR. More frequently the desire is to identify potential causal associations (not saying this is what the study was trying to do; just a general observation). An optimal predictive model can be confounded for causal associations depending upon the data available.

Given that this finding was a single result from a massive survey of around 20,000 people it's very likely not among the best predictors of divorce.

Of course not. They are ratios of risk, either directly using odds as the measure of risk or indirectly as an approximation of relative event rates. The latter can, of course, be directly estimated from the given data and if you do so, you'll see they are about equal.
posted by Mental Wimp at 8:24 AM on February 15, 2014

This is what is commonly thought, but it is wrong. Rejecting the null hypothesis does not establish anything in an epistemic sense. Neyman was quite clear that epistemology had nothing to do with statistical testing, and Fisher only asserted the link; but, as pointed out up thread, without further (strong) premises, the link is invalid. Rejecting the null hypothesis does not give you a reasonable basis for thinking anything other than that you observed a p value less than a specific criterion.

Thank you. I was hoping that someone more knowledgeable than I would point out if I'd gotten anything wrong.
posted by clockzero at 9:47 AM on February 15, 2014

But this is what statistical significance - i.e., the p-value - is supposed to guard against.

Oops, you are right. My bad.
posted by a snickering nuthatch at 6:02 AM on February 17, 2014

« Older Don't Talk | Is it wrong to say Bon Appetit? Newer »

This thread has been archived and is closed to new comments

MetaFilter

P-hacking
February 14, 2014 10:39 AM Subscribe

Tags

Share

P-hacking February 14, 2014 10:39 AM Subscribe

Tags

Share

P-hacking
February 14, 2014 10:39 AM Subscribe