Lies, damned lies, and exaggerated significance tests
January 18, 2011 8:14 AM Subscribe

Apparently They, of "They say..." fame, have been misusing the statistical significance test. So much of what "They say" might not actually be. (via)
posted by cross_impact (51 comments total) 29 users marked this as a favorite

I knew this would end this way.
posted by chavenet at 8:15 AM on January 18, 2011

The statistical significance test cannot be used correctly, because it is a flawed method for statistical inference. That's why my colleagues and I (including those interviewed for this article) have consistently argued against its use, and for the use of Bayesian methods.

Friends don't let friends compute p values.
posted by Philosopher Dirtbike at 8:25 AM on January 18, 2011 [6 favorites]

I think the other factor that goes into this is the interestingness problem. As far as the scientific method is concerned, a non-effect is still a result, and equally valuable as the experiment that produces a result. However, there is a strong bias to publish only interesting results, both on the part of the experimenter, his financial backers, and the journal's editors.

Consider the 5% threshold. If you perform 20 studies on a drug that is no better than a placebo, you would expect one of those studies to measure an effect greater than placebo with p=0.05, simply due to chance. The people performing the other 19 studies would have little motivation to publish, so we're left with the one statistical anomaly hitting the journal.

I'm not necessarily convinced the solution is as simple as finding a new threshold for evaluating statistical significance, although that's probably a part. I think we need to not jump on every new study as uncovering some awesome new truth about the universe. However, science reporting thrives on sensational stories, so I'm guessing this won't happen.
posted by knave at 8:26 AM on January 18, 2011 [5 favorites]

I didn't even realize They were computing statistical significance before They said things. If anything, They have gone up in my estimation.
posted by DU at 8:32 AM on January 18, 2011 [3 favorites]

I used to be They. When I worked at the library answering the phone, sometimes people would call us to settle arguments. usually from a bar to answer some bit of sports trivia. Or every St. Patrick's day someone would drunkenly call from a bar and ask what the colors on the Irish flag stood for.

We would answer the question, then the person would turn from the phone and say: "They said xxx...."

I was They.
posted by marxchivist at 8:38 AM on January 18, 2011 [4 favorites]

previously
posted by nangar at 8:44 AM on January 18, 2011

Great post! This article goes a bit deeper.

Big Science should be deeply ashamed of their addiction to this fraudulent but convenient substitute for real significance, and the concomitant rise of statistical significance chasing in publication strategies that is doing huge harm to the achievement and credibility of scientific research. See e.g. this great piece on Lies, Damned Lies, and Medical Science just published in The Atlantic
posted by Philosopher's Beard at 9:03 AM on January 18, 2011

Already someone close to me (a religious person prone to alien conspiracies) is using this article to refute the reliability of science and the scientific method. I think this is a well written piece but it highlights the problems with interpretation. Interpretation is a subjective act. Not only must data be interpreted, but NYT articles written about data interpretation must be interpreted, and it's not always done by others as YOU would do it.
posted by madred at 9:04 AM on January 18, 2011

Me: Hey, I know I'm just a post-doc, but I think we should all suddenly switch to using a completely different statistical methodology--a change that will ultimately lead to our lab publishing fewer papers and getting less grant money!
Everyone else: ...

Which is to say, I don't see how a shift towards using Bayesian stats is ever going to happen.
posted by IjonTichy at 9:09 AM on January 18, 2011

I was They.

Hello? Santa Claus's reindeer? Of course I can... let's see, there's Dopey, Grouchy, Sneezy, Sleepy, Happy, Bashful, Rudolph and Blitzen! You're welcome!

(maxarchivist: if you haven't seen Desk Set, rent it now!)
posted by The Bellman at 9:10 AM on January 18, 2011

As far as the scientific method is concerned, a non-effect is still a result, and equally valuable as the experiment that produces a result.

The first professor I worked for was feuding with a few other professors with competing ideas. To score points against his opponents, he would have to find certain statistically-significant effects that his ideas predicted and his opponents' would not predict.

But let's imagine a world where we can publish non-effect results easily. Then scoring points against his opponents becomes much easier; he can just do an experiment looking for an effect that would be predicted by his opponents, but introduce enough error variance (or under-power the study) so that the effect (even if it existed) would not be found. Likewise, his opponents would be able to attack his ideas with the same ease.

It would be a very different game.
posted by a snickering nuthatch at 9:13 AM on January 18, 2011 [1 favorite]

Bayesian methods are commonplace, if not dominant in my little bit of science. It should be so in other areas and as I think the article hints will be more so in the future.
If you think you'll be lackin publications then just write a bunch of papers on The One True Way of Statisics for your field to make up for it.
posted by edd at 9:16 AM on January 18, 2011

Jpfed: Bayesian inference naturally accounts for error bounds and predictive strength in models. The approach you suggest would simply result in an indecisive result.
It's a direct and fair competition between models.
posted by edd at 9:20 AM on January 18, 2011

Correct me if I'm wrong, but I thought the problem with the psy article is that they were doing exploratory data analysis and then interpreting the resultant p-values as tests of their hypothesis. If you're doing a bayesian technique, then who's to say that the subjectivity of the prior won't lead us back to a similar or worse situation, where experimenters adjust their prior until they find significance. To me, the problem seems more a general misapplication of frequentist and/or bayesian techniques. But it seems the coverage in the press paints it as a bayesian refutation of frequentist methodology.
posted by ehassler at 9:28 AM on January 18, 2011

Also previously!

In health sciences there is a very slow shift towards Bayesian methods, but at least when I was in school there was still a clear split between the older faculty using old school frequentist methods (many of whom couldn't have passed their dept's own current stats/methods courses) and those who were incorporating Bayesian analysis. As graduate students we felt like ooh we're ahead of the curve learning this stuff, but the reality is more like what IjonTichy describes. I figure it'll happen eventually, but there's still a long ways to go.
posted by mandymanwasregistered at 9:40 AM on January 18, 2011

ehassler, there wasn't only a single problem with the methodology in Bem's article. There were several, and one of them was a reliance on significance testing. And some of the other problems (like not correcting for multiple comparisons in exploratory research, for instance) arise from this problem. And some of them don't.

The interesting thing about this is that most of psychology depends on significance testing. Bayesians in psychology are hoping that this article, and the responses, are a wake-up call to psychologists. Carey, the author of the NY Times article, decided to focus on this problem for a followup article.
posted by Philosopher Dirtbike at 9:54 AM on January 18, 2011

And some of the other problems (like not correcting for multiple comparisons in exploratory research, for instance) arise from this problem. And some of them don't.

I think the underlying problem in most of psychological research is the chronic under-powering of their studies, so that, coupled with significance testing, the dominant effect is publication bias. When effects are at the margin of power, and you do a lot of experiments, then the "positive" ones, those that get published, will largely be false positives, since real new scientific findings are rare.
posted by Mental Wimp at 10:13 AM on January 18, 2011 [2 favorites]

I should add, this is true of epidemiology as well.
posted by Mental Wimp at 10:14 AM on January 18, 2011 [1 favorite]

Mental Wimp: This is true of _bad_ epidemiology. I work on a prospective cohort with decades of follow-up and no lack of power for most things. Given the trend towards multi-site studies, linked data, etc., I think the problem of underpower will diminish.

Also: Fighting over p-values is the Mac vs. PC of the biostats world.
posted by docgonzo at 10:24 AM on January 18, 2011

Can someone explain the alternative in small words for a historian with a little brain?

I had statistical significance explained to me in small words: basically, if your sample is small compared to the size of your universe, your effect has to be larger to be statistically significant.

Not just psychology - but also epidemiology relies on statistical significance. And a little bit of history.
posted by jb at 10:25 AM on January 18, 2011

Could someone describe what is meant by "under-powering" for the cheap seats?
posted by leotrotsky at 10:34 AM on January 18, 2011

Also: Fighting over p-values is the Mac vs. PC of the biostats world.

Mac vs. PC? In my house, it's an Israel/Palestine-type issue.
posted by Philosopher Dirtbike at 10:35 AM on January 18, 2011 [2 favorites]

The article Philosopher's Beard linked to is a much better write up of this topic and does address some of the problems and solutions inherent in stats. Specifically, it points out that a study never gets done just once and gets accepted as proof (not that any statistician would ever claim any p-value as proof). Also, that the relevance of the significance depends a great deal on the powering of the study, and that there is a huge gulf between statistical and practical/clinical significance.
posted by Panjandrum at 10:40 AM on January 18, 2011

edd- I should have clarified that I was talking about publishing null results within the framework of frequentist statistical inference.

If we move the discussion to a Bayesian framework, then the game between the feuding profs would become bickering over priors.
posted by a snickering nuthatch at 10:47 AM on January 18, 2011

Could someone describe what is meant by "under-powering" for the cheap seats?

Quick and dirty
Statistical power is tied up in sample size. Basically it addresses the question: Can this particular study actually find a relevant result, or will it just spit out garbage that could mean anything? Quite a lot of study designs use a power calculation in order to determine the required sample size for a study to be relevant.

However, the formula in question requires a bit of foreknowledge of what you are testing (the video I linked above is a humorous look at this):
- The standard deviation of the groups being tested
- The probability of a Type I Error (.05 is the standard)
- The probability of a Type II Error (this is really the "power" of the study; typically set at .2)
- How big of a practical/clinical effect is desired

This page has a pretty friendly rundown of the details, along with the formula to which I am referring.
posted by Panjandrum at 10:58 AM on January 18, 2011 [1 favorite]

They say of the Acropolis, where the Parthenon is...
posted by xedrik at 11:14 AM on January 18, 2011 [1 favorite]

Already someone close to me (a religious person prone to alien conspiracies) is using this article to refute the reliability of science and the scientific method.

People will grasp at any straws to support their delusions.

The robustness of the scientific method is that we can use the scientific method to evaluate how people are doing science sloppily.
posted by Sidhedevil at 11:14 AM on January 18, 2011 [1 favorite]

Panjandrum's got it; to be even briefer (and a bit too crude), an underpowered study is one in which there are an insufficient participants or observations to come to any statistically valid conclusion. Typically in epidemiology the number of participants recruited into a study is just enough, given the formula above, to come up with a valid result. (Funding bodies are usually loath to give funding to recruit/interview/etc. "extra" participants.) The obverse can also be true: It can be difficult to study some diseases because the number of individuals does not allow valid comparisons to be made.
posted by docgonzo at 11:15 AM on January 18, 2011 [1 favorite]

Panjandrum, the power is the complement of the Type II error rate; a commonly recommended power is 0.8.
posted by Philosopher Dirtbike at 11:23 AM on January 18, 2011 [1 favorite]

Philosopher Dirtbike, I'm familiar with problems due to exploratory data analysis and multiple hypothesis testing. When I read the criticism of the Psy article I fully agreed with their contention that using data to generate a hypothesis and then using that same data to test that hypothesis is flawed.

The second point they made, about needing an extremely skeptical prior, I wasn't sure I followed them in their reasoning, since the construction of the prior is subjective and can be tweaked to yield erroneous results in the same way that doing the exploratory data analysis can lead one to hypothesis tests that are optimistic.

When you write "one of [the problems] was a reliance on significance testing" it sounds as though there is a more general problem with frequentist significance tests, and that a Bayesian perspective such as recommended in the criticism can remedy such a problem. I'm interested to learn more about this argument in general; can you elaborate and/or point me towards some literature on the subject?
posted by ehassler at 12:05 PM on January 18, 2011

As a result, these experts say, the literature is littered with positive findings that do not pan out.

Just like business reporting...
posted by Alexandra Kitty at 12:47 PM on January 18, 2011

Could someone describe what is meant by "under-powering" for the cheap seats?

If a study has a sample size that reduces the error just to the point where only a large effect can be detected above the noise (or random variation), then frequently the noise will look like an effect (in fact, if the significance cut-off is 5%, noise will appear to be an effect 5% of the time). Small sample sizes mean that significant effects are much more likely to be noise, since the real effects are unlikely to be seen due to the low power of the sample to detect them.

I tried to express the idea without using much technical jargon, but may have failed. To compute the power of the test, you first assume that an effect of a specific size actually exists, then compute the probability that the test of significance will be positive ("reject the null hypothesis" in frequentist statspeak). If, for example, you design your study for 80% power to detect a 50% increase in the measured outcome, but there really is only a 10% increase, you will most likely miss it. If you do 200 studies and only 10 address a real effect and those are underpowered, all you will likely see is 10 positive studies, none of which identify a real effect.
posted by Mental Wimp at 12:55 PM on January 18, 2011

Let's say I'm testing a drug. I hypothesize that it may have an effect on a symptom which I can measure, but I don't know yet how much effect it will have, or whether it will have any effect at all. To test this, I'll compare assessments of the symptom in patients given the drug to those of patients given a placebo. (We'll assume I know what results are expected from people given the placebo, but not the drug.)

How do I come up with a Bayesian prior in this situation?

Philosopher Dirtbike, edd, anyone who understands Bayesian statistics, can you explain this to someone who has no background in Bayesian statistics? I have an undergraduate degree in psychology, which means I studied the supposedly "bad" kind statistics - and I'm rusty at that. My understanding of Bayesian statistics is limited to 'I basically get how Bayesian spam filters work.'

I've read Wagenmakers, et al's critique of Bem's article, "Feeling the Future." I get (and agree with) their criticism that Bem's analysis of his experiments was exploratory, or partially a "fishing expedition" (basically that if you dice data up enough you'll eventually get positive results somewhere). I don't follow their argument that all experimental research should use Bayesian statistics.

I understand how Bayesian statistics could be used determine whether someone, or a population, had been given a known drug, with a known effect size, or a placebo, but not how you would you use it to determine whether a possible drug had some effect or not.

I realize that the real answer to this question is 'learn Bayesian statistics, and you'll understand the critique and be able to evaluate it,' and maybe this question would be better for AskMe. But I'm hoping for a quick education.
posted by nangar at 12:59 PM on January 18, 2011

ehassler, there are several lines of argument that Bayesians level at significance test. I'll outline the ones I find particularly valuable here, with some background for people with less knowledge of significance testing.

In order to do a statistical significance test, we first specify a "null" hypothesis. Actually, a better word for this hypothesis is "default" hypothesis, because we set this hypothesis up as a straw man to be falsified. An example of this type of hypothesis is, for instance, that a coin is fair (that is, the probability of "heads" is 0.5). The important thing about this hypothesis is that it usually corresponds to a single value of a statistical parameter (like that a probability is exactly 0.5, or a mean is 100).

We then collect data, and it is our goal to quantify how much the data is inconsistent with the hypothesis. If we were flipping a coin 500 times, and we got 227/500 heads, this would be inconsistent with a fair coin. It wouldn't be impossible of course (in fact, this outcome would occur 0.4% of the time, given a fair coin) but it happens only rarely if the coin is fair. Now, of course, our question is "how inconsistent with the hypothesis is 227/500?"

The way a significance test works is that we take our data, and we compute a measure of how inconsistent with the default hypothesis it is. Now, strictly speaking, all possible outcomes of our 500 flips are unlikely; even the most likely outcome, 250 heads, would only occur 3.5% of the time. So we can't just say that 227/500 is unlikely, and conclude that the default hypothesis is false. Instead, in a significance test, we compute a p value, which is the probability, if the default hypothesis were true, of obtaining a result as extreme or more extreme as the data we observed.

In the case of our coin flipping, the probability of obtaining 227 or fewer, or 273 or more, heads out of 500 flips (all of which outcomes are as inconsistent or more inconsistent with the default hypothesis as 6 heads is) is something like p = 0.04, which is a fairly low value. This value is traditionally enough to reject the default hypothesis in the social sciences. Since the probability of getting something as extreme or more extreme is so low, the logic goes: "Either something extremely rare happened, or the hypothesis is false. This event is sufficiently rare under the default hypothesis, that I no longer believe the hypothesis is tenable."

At this point our critic (the Bayesian, or anti-significance test frequentist) speaks up. They note several things. First, they note that in traditional significance testing, there is no way to accept the null hypothesis. It was the assumed hypothesis; had it not been rejected, we simple would have said we have insufficient evidence to reject it. We can't accept a null hypothesis. If the null hypothesis were tenable a priori, though, then we've ruled it out by our choice of method. This is critical, especially when the null hypothesis is something like "there is no ESP".

Our critic also notes that in order to compute the p value, we counted outcomes that did not occur against the null hypothesis. We computed the probability of an event as extreme or more extreme. But why would we use events that did not occur as evidence against a hypothesis? That's almost the Chewbacca defense, except in statistical form.

But the heart of the matter is when the critic asks, "How much evidence is there against the null hypothesis, compared to every other hypothesis, in only the data we observed?" To compute this, we need to compare the probability of getting 227/500 heads given the default hypothesis, to that for every other possible probability of heads. As it turns out, the most "likely" other hypothesis is that the probability of heads is 227/500=0.454. To compute the evidence against the default hypothesis, compared to the most likely other hypothesis, we compute the ratio of the probabilities of 227/500 heads, given each hypothesis. This ratio is 8.3 - that is, the outcome we observed is 8.3 times more likely under the cherry-picked alternative hypothesis, compared to the null hypothesis.

This is taken by frequentist and Bayesian statisticians to be a measure of the evidence against the null hypothesis; that is, 8.3 to 1 against. This isn't staggering evidence (especially given that we would have rejected the hypothesis outright, given the p value), and even worse: for larger sample sizes, the p value could tell you to reject the null, while the evidence ratio tells you that it is as likely as the alternative (this is called Lindley's paradox)

So, we arrive at a situation where we are rejecting hypotheses, based on very little evidence for two reasons: first, our method doesn't allow us to accept the null in the first place, no matter how much evidence we have; and second, p values unfairly count events that did not occur against the null hypothesis.

And those are some of the major reasons why Bayesians hate p values. If you want more, check out Wagenmakers et al's "Bayesian versus frequentist inference", which you can find on my website here (but don't tell anyone I have an illegal copy there :)
posted by Philosopher Dirtbike at 1:15 PM on January 18, 2011 [6 favorites]

nangar: How do I come up with a Bayesian prior in this situation?

Philosopher Dirtbike, edd, anyone who understands Bayesian statistics, can you explain this to someone who has no background in Bayesian statistics?

You might try to read our paper on Bayesian t tests, which is written for people with little exposure to Bayesian statistics, but who have a familiarity with traditional methods. See the paper here, complete with an online Bayesian inference interface here. The Bayesian test we advocate there uses only the traditional t statistic and the sample size to yield an inferential statistic, so it is pretty easy to use if you know about classical statistics.
posted by Philosopher Dirtbike at 1:21 PM on January 18, 2011 [1 favorite]

Philospher Dirtbike, in your coin flip example, the unobserved events are counting in favor of the null hypothesis, since you obtain p by adding the probability of the more extreme events. Since the null hypothesis specifies that those events are possible, it isn't remotely Chewbaca-esque to count them when considering it.

Once we begin to consider a very large number of potential hypothesis (e.g. bias in the coin may be any particular number) we run into the same problem as genetic screens against a large number of genes --the evidence in favor of any one is likely to be small, but many are likely to appear to have some amount of support. This requires a scientist to stop playing with numbers for a minute and think. As a math-oriented biologist, I've found that experimentalists are rarely bamboozled by p-values, and are usually pretty good about using significance tests to guide their reasoning rather than rule it.

In any case, if you know any radical Bayesians who are content to play a gambling game where a coin comes up heads 227 times out of 500, tell them I'm in, as long as the bets are small and I get to bet tails.
posted by Humanzee at 5:22 PM on January 18, 2011

Humanzee, there is no evidence FOR the null hypothesis in a significance test, only against. The point is that those events that are more extreme, that did not occur, are somehow relevant in computing evidence contributed by what did occur. They aren't, and this is not a uniquely Bayesian perspective. Frequencies likehood ratios are based on the same idea.
posted by Philosopher Dirtbike at 11:45 PM on January 18, 2011

I dunno. Most of the problems you cite are only relevant if you take an exceedingly narrow, lawyerly, almost autistic view of what a significance test is telling you and ignore what the model actually says.

You can certainly find evidence in favor of a null hypothesis from frequentist statistics. The most obvious example would be finding a confidence interval around an effect that is very narrow and brackets zero.

But unless you have a theory-based prediction that's very, very strong, you'd also have to look at results where an effect has a p-value of 0.001 but a coefficient that's not substantively important as, like the first case, the model telling you that it's pretty damn sure that the effect of the variable is almost exactly nil.

Likewise, a model where the p-value is 0.04 and the coefficient is large and positive, which implies a confidence interval running from just barely different from zero to some very large amount, is the model telling you that it really has very little idea what the effect might be, but it's probably not negative, which is in many ways a weak finding.

I have a lot of sympathy for Bayesian approaches. But a lot of the problems that are brought up with frequentist approaches really seem to me to be more problems of fundamentally misunderstanding the limited things they're saying and just having weak theories.
posted by ROU_Xenophobe at 6:33 AM on January 19, 2011

I dunno. Most of the problems you cite are only relevant if you take an exceedingly narrow, lawyerly, almost autistic view of what a significance test is telling you and ignore what the model actually says.

We're talking about statistical methods of inference. The method of inference is what it is - there's nothing "autistic" in talking about the logical implications of an inferential theory.

You can certainly find evidence in favor of a null hypothesis from frequentist statistics. The most obvious example would be finding a confidence interval around an effect that is very narrow and brackets zero.

There are some really great examples out there of frequentist confidence intervals (that is, confidence intervals which "cover" the true value X% of the time), and how they fail to quantify evidence at all. Here's one: suppose we are sampling from a uniform distribution between m-0.5 and m+0.5. The parameter m indexes the middle of the distribution, and we want to learn about it by sampling two observations (that's all we can afford). So, we build a frequentist 50% confidence interval, which is easy: it is simply the interval between the minimum observation and the maximum observation. This will contain the parameter m 50% of the time. We thus have 50% "confidence" that m is in the interval, regardless of the actual values of the observations (frequentist techniques defined by a concern for the long run properties of their techniques, not about what information they yield with any particular data).

What's silly about this is that if the difference between the minimum observation and the maximum observation is >0.5, we know for a fact that m is included within the interval. We thus have only 50% frequentist "confidence" in a conclusion which we know must be true. Since frequentist coverage doesn't lead to a result we can interpret as "evidence" (for any reasonable meaning of "evidence", at least) we can't take small confidence intervals per se as evidence that the null is true.

Of course, insomuch as the frequentist CI agrees with the Bayesian credible interval, we can; but that is an accident of the particular modeling setup, not a general state of affairs.
posted by Philosopher Dirtbike at 7:10 AM on January 19, 2011

We're talking about statistical methods of inference. The method of inference is what it is - there's nothing "autistic" in talking about the logical implications of an inferential theory.

If you don't understand that any method of inference can be taken too seriously in exactly the wrong ways, then we disagree so fundamentally that there's little point talking any more.
posted by ROU_Xenophobe at 7:47 AM on January 19, 2011

If you don't understand that any method of inference can be taken too seriously in exactly the wrong ways, then we disagree so fundamentally that there's little point talking any more.

No, I don't understand what you mean; but I'd be happy to entertain your argument if you have one to offer. Isn't that the point of these threads? Discussion?

The bottom line is, for me (as a statistician) if you don't take your methods "seriously", then you have no way of understanding the properties of the methods, and the interpretation of the quantities they generate. Without understanding the properties and output of your method, the epistemological justification for using the method evaporates.
posted by Philosopher Dirtbike at 8:17 AM on January 19, 2011

ROU_Xenophobe - I'd also like to add that, regarding your example of the CI, I'd agree that this generally represents evidence for the null; but the precise reason why that is so requires the use of Bayesian statistics. You are informally a Bayesian by using this logic. If you want to think like a Bayesian, you should adopt the formal mathematical framework.
posted by Philosopher Dirtbike at 8:30 AM on January 19, 2011

Philospher Dirtbike, you're just playing with words:
What I was responding to:
Our critic also notes that in order to compute the p value, we counted outcomes that did not occur against the null hypothesis. We computed the probability of an event as extreme or more extreme.

The more extreme events are not counted "against" the null hypothesis. They were added to p which however you want to word it, is most definitely for the null hypothesis. I get that you don't accept the null hypothesis, you fail to reject it. My point is that the original objection isn't a failure of semantics, it's just backwards in concept.

These sorts of discussions remind me a lot of the objections raised by philosophers of science. They'll say something along the lines of "you can't base science on the concept of falsifiability" and if you ever trap them into expanding on that, it boils down to something like, "hypotheses aren't falsifiable on their own, but only within a framework of theory". No shit, thanks for that insight.

Scientists want to know if an effect they observe is real, and important. There are a number of tests you can run that can shed insight on those questions. There is no test that can answer it. That always requires human judgement, it requires knowledge of the field, as well as understanding of the mind set of the human scientists you're going to be communicating with. No one I know sees something with p = 0.4 and takes that particularly seriously, we just generally refrain from even bothering to mention things with p = 0.6. Even a really small p-value with a marginal effect size is not really interesting. Working scientists understand this, which is what ROU_Xenophobe was getting at. Certain kinds of mathematicians can't get past this idea (As a former math grad student, I've certainly heard math professors express amazement that scientists get anything done since they never prove anything). That's okay. You guys stay in your building, and I'll stay in mine.

There are real problems with p-value calculation that biologists can be made aware of and who knows, maybe get better at. Mining data for hypotheses is easy to abuse ---it's even easy to mine for hypotheses accidentally sometimes. Also, most scientists I know are too lazy to determine the N they need before doing their analysis to make sure they have sufficient power. Those things can be fixed, but they're never going to stop calculating p-values, so you can forget about that dream.
posted by Humanzee at 8:41 AM on January 19, 2011 [1 favorite]

Oops, obviously I meant p = 0.04 and p = 0.06 in the above examples.
posted by Humanzee at 8:42 AM on January 19, 2011

I'll try, then.

Let's take a simple null that a regression coefficient is 0 in a two-tailed test.

If we want to take it too seriously in the wrong way, we would note that this means that null is that it is zero. Exactly. Deviating from zero not in the slightest degree whatsoever; it is not epsilon or -epsilon, it is exactly at the point value 0.

Lots of things are or can be structured about this. Classical hypothesis testing is built around rejecting that null at some alpha level. Other work generates p-values against that null.

But, no sane and skilled practicing scientist would (in the absence of very, very strong theory) ever assert that they're really interested in testing that the coefficient is anything different, even in the slightest degree, than zero. Almost all coefficients in the real world, even for relationships that by any reasonable definition don't exist, are going to be something other than exactly zero. "Zero" in this case is the operationalized version of "The coefficient is negligible."

In that case, a CI that's entirely near zero supports the null, because it tells you that the coefficient is pretty sure to be negligible. No Bayesian statistics needed: this CI says 0.001-0.004, and even the upper bound is substantively irrelevant, so this variable doesn't seem to matter.
posted by ROU_Xenophobe at 9:26 AM on January 19, 2011

More broadly, the answer to a problem of finding support for the null is not to escalate statistical methods to try to support it with the same data.

If you think the null is right, then you must have some theoretical reason why. You must have some process in mind such that you expect that effect to be negligible.

The right way to deal with this is to go back to theory. In a regression oriented world like I live in, you have some theory that tells you that b=0 in the normal sense. So... what are the theoretical consequences of b=0? If b=0, what else will be true in the world? What are the other implications of your theory that tells you that b=0? Then GO AND LOOK FOR THOSE THINGS. Show me other things that are consistent with whatever you think is actually going on in the world.

Similarly, the paper you were kind enough to link to writes that "The main challenge of hypothesis testing or model selection is to identify the model with the best predictive performance."

This is simply incorrect as far as scientific inference goes. Deeply, fundamentally, shockingly wrong to a practicing (social) scientist like me.

There is a right model to test in the purest of all possible worlds, and that model is THE MODEL THAT IS IMPLIED BY YOUR THEORY. No other. The goal is not to fit the data, not to maximize explanatory power, the goal is to test the implications of your theory.

What variables should you include in your analysis? All of those that are implied by your theory as it relates to this particular implication. Which should you exclude? All others.

In the "real" academic world, which is far from pure, we have to make the wicked and sinful accommodations of including other variables not clearly implied by the theory that the existing literature insists on; this is not a good thing.
posted by ROU_Xenophobe at 9:43 AM on January 19, 2011 [1 favorite]

The more extreme events are not counted "against" the null hypothesis. They were added to p which however you want to word it, is most definitely for the null hypothesis. I get that you don't accept the null hypothesis, you fail to reject it. My point is that the original objection isn't a failure of semantics, it's just backwards in concept.

Yes, that should have read "we counted outcomes that did not occur when computing the evidence against the null hypothesis." The point is the same. A computation of evidence provided by data should include only the data observed.

And no, bigger p values are not evidence for the null hypothesis, unless you are a Bayesian. Why? Because under the null hypothesis, p values are uniformally distributed across (0,1). If the null hypothesis is true, then you are as likely to get a p value in (0.95,1) as you are (0.05,0.1). Under the logic of significance testing, bigger p values are not, in themselves, evidence for the null hypothesis. This is a basic fact about significance testing, and is not controversial.

How can this be? We all feel that bigger p values should be more evidence for the null, right? That's because everyone reasons roughly like a Bayesian, even when they are using significance testing. Under any reasonable alternative, larger p values become less likely. So, comparing the null hypothesis to any alternative given a big p value will lead to support for the null. (Strictly speaking, this isn't Bayesian yet, because you can do this with likelihood ratios; but Bayesians are the only ones who are able to interpret the likelihood ratio directly, in my opinion.

The main point is that your objection actually informally supports the Bayesian point of view.

Scientists want to know if an effect they observe is real, and important. There are a number of tests you can run that can shed insight on those questions. There is no test that can answer it. That always requires human judgement, it requires knowledge of the field, as well as understanding of the mind set of the human scientists you're going to be communicating with.

In addition to being a statistician, I am also a scientist (a cognitive psychologist, actually). I know what scientists want when they analyze data. Just because no tests can answer the question completely, doesn't mean that all tests are equal. They are not. Significance testing has been criticized for a century, but you know why people like it? It isn't because it is the best. It is because significance testing is easy, and it gives scientists the illusion that they are computing meaningful measures of evidence, in spite of loads of evidence that that simply isn't the case.

You say "working scientists" understand how to interpret a p value, but I review plenty of papers and go to conferences and read the literature, and that's simply not true. I've seen plenty of over interpreted, small p values, and you know what? p=0.04 is a good amount of evidence, if the sample size is small, as is p=0.06. The criterion p<0.05 doesn't make sense, because whether this is convincing depends on the sample size, as you understand. And not just that, but it depends on sample size in a particular way that is describable using Bayesian statistics. The funny thing about this argument, again, is that you are arguing that "working scientists" are informally Bayesians. So they don't need Bayesian statistics. That's kind of a weird argument.

Scientists can get far with just means and standard errors, and the power of replication. So fairly good science continues to get done in spite of p values, but they sure cloud the waters, due to the nasty properties of p values. The sooner p values are replaced (and they are beginning to be) the better. The Bem "findings", hopefully, will be one of the drivers of that replacement.

posted by Philosopher Dirtbike at 9:47 AM on January 19, 2011

No Bayesian statistics needed: this CI says 0.001-0.004, and even the upper bound is substantively irrelevant, so this variable doesn't seem to matter.

In many commonly applied situations, the CI is an approximate Bayesian credible interval, so that's nice. ;>

In the larger sense, frequentist and Bayesian (and "likelihoodists" to coin a term) are all trying to do the same thing, but deal with the necessary assumptions in a different way. Any Bayesian procedure can be evaluated as a frequentist procedure (and vice versa), but the differences lie in how the assumptions are presented (or not). There is a set of prior distributions that are implied by a particular significance test (or confidence interval construction) if it is considered a Bayesian rule (or credible interval). Any Bayesian rule can be described by its frequentist Type I and Type II error properties (or its coverage probabilities for a confidence interval). The decision boils down to emphasizing certain assumptions and the flexibility afforded by the model chosen. That's it. It really doesn't go any deeper.
posted by Mental Wimp at 10:06 AM on January 19, 2011

ROU_Xenophobe, we've got a a bit of cross-talk here regarding methods. There is a distinction between parameter estimation, and hypothesis testing. And there is another distinction between Bayesian methods and significance testing. You can reject hypothesis testing, on the basis that point hypotheses are unreasonable, and still be a Bayesian.

But, no sane and skilled practicing scientist would (in the absence of very, very strong theory) ever assert that they're really interested in testing that the coefficient is anything different, even in the slightest degree, than zero.

But scientists do this every day. There's a ton of literature trying to NOT get them to do this (I just submitted a paper today on this very issue, as a matter of fact). Since the p value is computed given that the null hypothesis is true, if we assume that null hypotheses can never be true, two things become clear: first, the p value is meaningless. It is conditioned on an impossible event. Second, there was no reason to compute the p value in the first place, since you've rejected the null hypothesis a priori.

In that case, a CI that's entirely near zero supports the null, because it tells you that the coefficient is pretty sure to be negligible. No Bayesian statistics needed: this CI says 0.001-0.004, and even the upper bound is substantively irrelevant, so this variable doesn't seem to matter.

But how do you interpret the CI? You are implicitly interpreting the CI as a credible interval (that is, it tells you what you should believe given particular data). In other words, you are using Bayesian logic, while at the same time denying that Bayesian methods aren't needed. Bayesian methods are needed, because you are using Bayesian logic informally, and you should be using them formally. For frequentist CIs, the move from the method (that is, constructing the CI) to the inference ("what do these data tell me?") is left completely unsaid. Bayesian methods make this explicit and formal.

To me, this issue above is critical. The numbers you compute must mean something, and that meaning must be related to the inference desired. If they are not, the justification for using the method is simply not there, and our methods simply become numerology.

Similarly, the paper you were kind enough to link to writes that "The main challenge of hypothesis testing or model selection is to identify the model with the best predictive performance."

This is simply incorrect as far as scientific inference goes. Deeply, fundamentally, shockingly wrong to a practicing (social) scientist like me.

I think you'll find that if you delved deeper into the idea of "predictive performance", you'd find that it actually agree with it more than at first glance. The "implications of your theory" as you say, are predictions of future data. That's exactly what is meant. There's a lot of technical stuff behind quantifying predictive performance (and arguments about what "true models" are). But it sounds like you've got the same idea about model testing, you're just unfamiliar with couching it the language of "prediction".

What variables should you include in your analysis? All of those that are implied by your theory as it relates to this particular implication. Which should you exclude? All others.

I'm not sure I understand what you mean here. If we want to test a model, we would typically throw in some variables that we don't think are related to the dependent variable, (especially if a rival theory believes they should affect the dependent variable) and test to see whether our expectations, or the rivals', are confirmed. The specifics of how you do that depend on whether you are Bayesian or frequentist, but the logic is similar.

Mental Wimp: The decision boils down to emphasizing certain assumptions and the flexibility afforded by the model chosen. That's it. It really doesn't go any deeper.

This is technically true, but one of the main problems is that people choose a method that emphasizes one thing (controlling Type I error rate) and use it for something else (quantifying statistical evidence). Is Type I error rate control important in science? I agree with Fisher here; Type I error rate control is for shopkeepers and factories, not science. The goal in science should be quantification of evidence, and a focus on what we should believe given data. Bayesian statistics provides the latter.
posted by Philosopher Dirtbike at 10:50 AM on January 19, 2011 [2 favorites]

Type I error rate control is for shopkeepers and factories, not science.

And brewers. God bless William Sealy Gosset.
posted by Mental Wimp at 1:57 PM on January 19, 2011

I mean, William Sealy Gosset.
posted by Mental Wimp at 2:03 PM on January 19, 2011

« Older I'd like $14,300 in small, unmarked bills | Liquid Hydrocarbons on Demand? Newer »

This thread has been archived and is closed to new comments

MetaFilter

Lies, damned lies, and exaggerated significance tests
January 18, 2011 8:14 AM Subscribe

Tags

Share

Lies, damned lies, and exaggerated significance tests January 18, 2011 8:14 AM Subscribe

Tags

Share

Lies, damned lies, and exaggerated significance tests
January 18, 2011 8:14 AM Subscribe