Selecting a 'better' threshold for p-values being 'statistically significant'

post by freedomandutility · 2022-01-07T11:23:42.943Z · EA · GW · 3 comments

I suspect this is true for some other fields, but medical research relies very heavily on the arbitrary p < 0.05 threshold for concluding whether interventions in RCTs are effective or not.

If there is good reason to use a different threshold than p < 0.05, switching to this new threshold would improve biomedical research. With the US National Institute for Health alone spending $37.3 billion annually on biomedical research, improving the efficiency of biomedical research seems to have very high expected value, despite it not being 'neglected'. [EA(p) · GW(p)]

A lower threshold (eg p < 0.01) would mean that we conclude that fewer interventions are effective. This would reduce our false positive rate but increase our false negative rate. 

A higher threshold  (eg p < 0.1) would mean that we conclude that more interventions are effective. This would reduce our false negative rate but increase our false positive rate.

This can make it look like there is no way for a threshold to be a 'better' threshold. 

But that is hard to accept when we look at extremes - surely a p < 0.99 threshold is worse than a p < 0.05 threshold.


I think a better threshold would simply be one that is less arbitrary. Given that the current threshold is essentially 100% arbitrary, just loosely rooting a new threshold in empirical evidence would make it better than the current one.


Here is how I think this can be achieved:


(Obviously, it is impossible to literally be 100% certain about the effect or lack of effect of an exposure based on reason / theoretical considerations, but I think this exercise is still useful even without the 100% certainty, because the bar for selecting a less arbitrary threshold is so low)


Comments sorted by top scores.

comment by tobycrisford · 2022-01-07T13:01:02.360Z · EA(p) · GW(p)

"deciding, based on reason, that Exposure A is certain to have no effect on Outcome X, and then repeatedly running RCTs for the effect of exposure A on Outcome X to obtain a range of p values"

If the p-values have been calculated correctly and you run enough RCTS, then we already know what the outcome of this experiment will be: p<0.05 will occur 5% of the time, p<0.01 will occur 1% of the time, etc for all values of p between 0 and 1.

The other way round is more interesting, it will tell you what the "power" of your test was (, but that strongly depends on the size of the effect of B on X, as well as the sample size in your study. You'll probably miss something if you pick a single B and X pair to represent your entire field.

I think the point is that any p-value threshold is arbitrary. The one you should use depends on context. It should depend on how much you care about false positives vs false negatives in that particular case, and on your priors. Also maybe we should just stop using p-values and switch to using likelihood ratios instead. Both of these changes might be useful things to advocate for, but I wouldn't have thought changing one arbitrary threshold to another arbitrary threshold is likely to be very useful.

Replies from: Matt_Sharp
comment by Matt_Sharp · 2022-01-07T20:19:08.930Z · EA(p) · GW(p)

"The one you should use depends on context. It should depend on how much you care about false positives vs false negatives in that particular case"

Yep, exactly! Assume you're a doctor, have a bunch of patients with a disease that is definitely going to kill them tomorrow, and there is a new, very low-cost, possible cure. Even if there's only one study of this possible cure showing a p-value of 0.2, you really should still recommend it! 

comment by David_Moss · 2022-01-07T12:49:15.068Z · EA(p) · GW(p)

There's been a fair amount of discussion of this in the academic literature e.g. and