Dance of the p values
August 23, 2021 6:51 PM   Subscribe

"You could play this at parties, develop a dance for the dance of the p values and reflect on how ridiculous it is that p values are right at the centre of our thinking about drawing conclusions from research."
posted by clawsoon (27 comments total) 27 users marked this as a favorite
 
Ah, I remember coming across this video a few years ago. Wonderfully clear explanation, so much so that I forgive him for using Excel.

I take a certain pride in the fact that while my dissertation involved a lot of heavy statistics, I was able to make my arguments without using a single p-value. There is almost always a better way to make the point you want to make.
posted by biogeo at 7:41 PM on August 23, 2021 [4 favorites]


Ah, I remember coming across this video a few years ago. Wonderfully clear explanation, so much so that I forgive him for using Excel.
And Comic Sans - a true sign of content triumphing over form.

I also like "P-values Broke Scientific Statistics—Can We Fix Them?" which features the story of the neural activation levels of a dead salmon being exposed to photos in an MRI scanner. And the original story of Fisher's tea with milk added first or second. Interesting that Fisher seems to have been rather forced into saying which levels of his test should and should not count as "significant" - once he did that then everybody just took those numbers.
posted by rongorongo at 11:06 PM on August 23, 2021 [3 favorites]


I think people misinterpreted Fisher's example - he used the 5%, or 1 in 20, example as a case which was clearly not significant and contrasted it with an example where p was 1 in 70 which would be sufficiently improbable, so people calling significance at .049 are not really getting the point.
posted by jamespake at 2:05 AM on August 24, 2021 [2 favorites]


Richard McElreath has a text book "Statistical Rethinking", that has been recommended as a way to learn bayesian inference without reference to any p values. This introductory lecture by the author is pretty fun: Bayesian Inference is Just Counting
posted by are-coral-made at 3:15 AM on August 24, 2021 [1 favorite]


But is 1/70 still convincing if 70 researchers are running the experiment, and only the one that gets a positive result will publish?
posted by starfishprime at 3:44 AM on August 24, 2021 [4 favorites]


"'Approaching significance.' How do you know it's not running away from it as hard as it can go?" I laughed way too loud at this for the hour.
posted by eirias at 4:47 AM on August 24, 2021 [8 favorites]


One thing that would've been nice - and if anybody has a video about it I'd like to watch - would have been if he had pointed out how if you see a whole bunch of n=32 studies with different effect sizes that all have p<0.05 you're almost certainly looking at publication bias, since p values should jump around dramatically for small samples with moderate effect sizes.
posted by clawsoon at 5:47 AM on August 24, 2021 [9 favorites]


The widespread acceptance of the p-value as a standard suggests that it strikes a balance between too lax to be useful and too strict to be practicable. In any particular case, it has be be taken along with sample size and replication. Probably, two studies of 100 samples in different labs make a more reliable result than one sample of 200.

I'd be more worried about using a Normal distribution-based method on decidedly non-Normal data than about the p-value methodology.
posted by SemiSalt at 7:04 AM on August 24, 2021 [1 favorite]


Question: Is the problem to do with a bad phrasing of the definition of P-values? I've always been told that the p-value is the probability of your experimental results assuming the null hypothesis is true. But that's not complete is it? It's actually the probability of your experimental results assuming the null hypothesis is true *using your sample data*. Change the sample data and the p-value changes. Or I am misunderstanding something?
posted by storybored at 7:43 AM on August 24, 2021 [1 favorite]


storybored, the sample data is what is meant by "your experimental results." We're not talking about the qualitative interpretation of those results, there, we're assessing how well the empirical distribution you got fits with a particular theoretical distribution implied by your null hypothesis.
posted by eirias at 8:25 AM on August 24, 2021 [3 favorites]


Can confirm that Statistical Rethinking is a pedagogical masterpiece.
posted by MisantropicPainforest at 9:02 AM on August 24, 2021 [2 favorites]


Yeah, the fundamental problem with p-values is that they seem really similar to the thing that researchers actually care about, but in fact are completely different from the thing that researchers really care about. A p-value is essentially the probability of observing the data that you observed, given that the hypothesis you have is wrong. This is not what anyone actually wants to know. What people want to know, is, what is the probability that the hypothesis they have is true, given the data they observed. Symbolically, a p-value is P(data | I'm wrong), but people actually want to know P(I'm right | data). The reason that people use p-values is because under a frequentist worldview, it's possible to rigorously define them in an unambiguous way, whereas the second quantity, the thing that describes what people really care about, is difficult or impossible to define unambiguously. And so as so often happens, researchers have chosen to pursue the thing that they know how to measure, even though it's not the thing they care about, and in so doing confuse themselves and others into believing that what they're measuring is actually what they care about.

One strategy for avoiding this problem is to go full Bayesian, and try to develop rigor around the question of P(hypothesis | data). Personally like Bayesianism because it renders explicit the assumptions in your hypothesis testing that frequentism usually hides implicitly: things like the prior distribution are really just formalizations of the same kinds of assumptions that frequentists make when doing null hypothesis significance testing, which people operating within the frequentist framework often don't even realize they're making. But you don't need to go full Bayesian to avoid the trap of p-values. As the video points out, the standard frequentist framework already provides much much better tools for thinking about your results, such as confidence intervals. Instead of saying "My result is significant p<.05," you can say "My data are only weakly consistent with a hypothesis of no effect. The effect most consistent with my data is [my point estimate], but effects of [the lower bound of my 95% CI] ranging to [the upper bound of my 95% CI] are all generally consistent with my observations." Then you can go on to discuss the implications of various parts of the range of values spanning your confidence interval: for example, if you're testing a drug, how would you assess the overall efficacy if the true effect is at the low end of your CI, the high end of your CI, or your point estimate? Of course, doing so requires more work and more thinking about the implications of your data, and being more prepared to accept uncertainty and ambiguity in scientific results, which is why very few people actually take this approach.
posted by biogeo at 9:26 AM on August 24, 2021 [10 favorites]


Probably, two studies of 100 samples in different labs make a more reliable result than one sample of 200.

Even better is to combine the 100 samples in two different labs into a single study, and model the influence of the lab collecting the data directly as having a random effect on the data within a mixed-effects model. This lets you appropriately pool the data between the labs to narrow the confidence interval on your effects, while directly accounting for the fact that conditions and "researcher degrees of freedom" vary between labs, and can even let you say something about how much of an effect the lab where the data was collected had on the data itself. This approach is sometimes taken with meta-analyses, when the raw data from both labs is actually available, but unfortunately that's not always true.
posted by biogeo at 9:33 AM on August 24, 2021 [7 favorites]


Metafilter: in so doing confuse themselves and others into believing that what they're measuring is actually what they care about.
posted by sammyo at 11:00 AM on August 24, 2021 [1 favorite]


A p-value is essentially the probability of observing the data that you observed, given that the hypothesis you have is wrong.

Even worse, it's the probability of observing the data that you observed (or even more extreme) if the data were actually being generated by a specific process you think is uninteresting. If you do a bad job of specifying that null process, both it and your idea can be wrong at the same time. I'm not sure what can really be done about this, though. It's not like people who are already misunderstanding, misusing, and playing sneaky tricks with p-values aren't going to misunderstand, misuse, and play sneaky tricks with bayesian methods.

More than anything, the specific "dance of the p-values" video is about small experiments being underpowered. Even when there's a real but small effect, as in his simulation, it's often not going to be distinguishable from no-effect with 64 observations. If you increase the n to 200, 100 in each group, there's no dance any more.
posted by GCU Sweet and Full of Grace at 11:43 AM on August 24, 2021 [3 favorites]


Surprised no one has linked this classic XKCD yet.
posted by TedW at 12:57 PM on August 24, 2021


I remember a Vox or Fivethiryeight cardstack that allowed you to play with the data and see the p-hacking in real time. I tried finding it, but I could not. Vox has shut down their cardstack page, for some reason.

My statistics use was more geared towards Experimental Design and curve fitting; and only vaguely remember learning about p-value determination and null hypothesis testing etc., because I never really used them. Chi-Square tests were about the most I did when dealing with unknown priors etc. So seeing the cardstack was eye opening; to see how you can get into trouble even if you are not being malicious or shady. So I can only imagine how easy it is when you are actively trying to obfuscate and mislead.
posted by indianbadger1 at 1:15 PM on August 24, 2021 [1 favorite]


But is 1/70 still convincing if 70 researchers are running the experiment, and only the one that gets a positive result will publish?

No, but you can't blame Fisher for the file drawer effect and Bayesian approaches aren't immune to this either.
posted by jamespake at 3:29 PM on August 24, 2021 [2 favorites]


Instead of saying "My result is significant Instead of saying "My result is significant p [lt] .05," you can say "My data are only weakly consistent with a hypothesis of no effect. The effect most consistent with my data is [my point estimate], but effects of [the lower bound of my 95% CI] ranging to [the upper bound of my 95% CI] are all generally consistent with my observations." Then you can go on to discuss the implications of various parts of the range of values spanning your confidence interval

And then Reviewer #2 and/or your manager and/or your client all say “But what’s the p-value? Is it significant or not?” because that’s the only way they know to draw a bright line and make a decision. And then your Bayesian heart breaks.

Not that I know this from experience or anything.
posted by snowmentality at 4:25 PM on August 24, 2021 [5 favorites]


There's a more fundamental problem with how many people use p-values: It is often the case that the data were not generated or collected as the result of any random design. For example, a survey of people that uses a sample of convenience from whomever is available (i.e., college students who volunteer), rather than a random sample from any well-defined population.

What is the meaning of a p-value when there is no random sample, or no randomization into a control group versus a treatment group? It's usually meaningless, especially the way it gets calculated and presented. There's a statistical/probabilistic model underlying the data, but the model has no connection to the real world conditions under which the data were collected.
posted by mikeand1 at 6:22 PM on August 24, 2021 [2 favorites]


I'm starting to get a feeling of fragile understanding on this.
So now one further question: In the video, the p-values dance and jump around. What can we say about the distribution of the p-values generated? Is the concept of p-value distribution useful at all in determining significance of experimental results if we repeat an experiment many times?
posted by storybored at 6:59 PM on August 24, 2021 [1 favorite]


What can we say about the distribution of the p-values generated? Is the concept of p-value distribution useful at all in determining significance of experimental results if we repeat an experiment many times?

I don't know a lot about it, but I do know that what you're talking about has been used in meta-analyses to detect patterns of file-drawer effects (negative results don't get published) and p-hacking (a bunch of different analyses are tried until one yields a statistically significant p<.05 result that can be published).

It looks like p-curve is the term you're looking for to find out more. (Caveat: I have not watched the video in the link and can't speak to its quality or clarity.)
posted by clawsoon at 7:22 PM on August 24, 2021 [1 favorite]


...also this person argues that p-curves are less useful for detecting misuse of statistics as we'd like to think, and investigates the most productive methods of p-hacking. (Short version: Adding dependent variables is the best use of your time as a p-hacker.)
posted by clawsoon at 7:28 PM on August 24, 2021 [2 favorites]


The core problem is that most people who are interested enough in specific scientific topics to pursue them as careers are not necessarily interested in becoming hardcore statisticians.
posted by srboisvert at 2:41 AM on August 25, 2021 [1 favorite]


What can we say about the distribution of the p-values generated?

It depends on a lot. Here's an R script that'll generate a density plot of p-values:
trials=100
n=32
difference=.5
output=matrix(nrow=trials,ncol=1)
count=1

while (count < trials+1) {
    low=rnorm(n)
    high=rnorm(n,difference)
    output[count]=t.test(low,high)$p.value
    count=count+1
    }
plot(density(output)) 
trials is # of experiments
n is size of each group in any experiment, 32 is value from video
difference is difference between groups, .5 is value from video
posted by GCU Sweet and Full of Grace at 4:27 AM on August 25, 2021 [1 favorite]


srboisvert: The core problem is that most people who are interested enough in specific scientific topics to pursue them as careers are not necessarily interested in becoming hardcore statisticians.

I think that's true up until you run into bad intentions or perverse incentives. For the first case, the data fraud story from the other day wouldn't have been so easy to spot if the fraudster weren't so laughably bad at statistics. This video has some interesting things to say about the second case of perverse incentives:
...which in lab experiments in social psychology and even some branches of experimental economics happens. Like, massively underpowered analyses. For me as an outsider to this literature I've always asked myself - and I think that's part of the motivation for what they're doing - why don't these folks do half as many experiments and have twice as big a sample size to actually say something definitive? But of course if you're in a world of false positive results that's the last thing you want to do.

You want to have the possibility of false positives and big samples will kill all your zero results, so you want small samples and you live on the sampling variation and and publish off that.
If you're swimming in a sea of false positives and your career chances are driven by them, more knowledge of statistics gives you more ways to be tempted into turning to the dark side standard practise in your discipline.

Of course we hope that those motivations represent a small portion of scientists and most scientists will be motivated by more statistics knowledge to do better studies.
posted by clawsoon at 6:39 AM on August 25, 2021 [3 favorites]


Thanks for the p-curve links, clawsoon. Intriguing. There are some deep woods there: two contrary opinions about the use of the curves - and these are the experts! Given how tricky this all is, it'd be nice if there was a statistical ratings agency that could grade studies according to their integrity. (Reminds me to take a look at cochrane.org to see what their policy is on p-values)
posted by storybored at 7:40 AM on August 25, 2021 [2 favorites]


« Older "You waited eight months?"   |   Lorde's Third Album Newer »


This thread has been archived and is closed to new comments