In this post I might come across as sounding dumb… but it won’t be the first or last time that will happen. I came across this article about how scientists cant even explain p-values in an understandable way. I know the students and residents I teach only understand it as indicating statistical significance. It then occurred to me as I thought of the definition that maybe we should get rid of p-values altogether and let users decide if what they are seeing is clinically significant.
What is a p-value?
First, we have to understand one other concept- the null hypothesis. When doing a study researchers have to construct a null hypothesis that they will attempt to prove or disprove (depending on how you look at it) with their study. For a superiority trial, the null hypothesis is that there is no difference between the intervention and control treatments. For a equivalence trial, the null hypothesis is that there is a difference between the intervention and control treatments. Data from each arm of the study are compared with an appropriate statistical test yielding a test statistic (e.g. t statistic if using a t-test). That test statistic is compared to a table of critical values for that test statistic to determine if the value (of your calculated test statistic) is greater than the critical value required to reject the null hypothesis at the chosen alpha level. If it is, you reject the null hypothesis and say it is statistically significant. If not, you fail to reject the null hypothesis and the finding is not statistically significant.
P-values are simply probabilities. They are the probability of finding what was found in the study (or even a bigger finding) if the null hypothesis was true.
Here’s an example. Let’s say I conduct a study to examine the effect of reading for 30 min each day on medical certification examination scores. My null hypothesis is that reading will not improve exam scores. I randomize one group of participants to read for 30 min every day and the control group to no reading. Both groups take the certifying examination and I calculate mean scores for each group. I can compare these means with a t-test (assuming what parametric tests require is met) and I find the mean examination score is 5 points higher in those who read for 30 minutes compared to those who didn’t with a p-value of 0.04. So, this p-value of 0.04 means that there is a 4% probability that you would see at least a 5 point higher mean score in the reading group given that there is no effect of reading on exam scores. What if the p-value was 0.2? Then there is a 20% probability that you would see at least a 5 point higher mean score in the reading group given there is no effect of reading.
A common mistake pointed out by statisticians is for someone to interpret the p-value as the probability of what was seen in the study being due to chance alone. I think many people think of it this way because it’s easy to comprehend. But chance isn’t the only explanation for a false positive finding in a study.
All this is very confusing, right? Exactly. So, if no one really understands p-values is it time to abandon them and the concept of statistical significance? After all, the 0.05 cutoff is just a tradition. Is 0.07 all that different? Or 0.051?
I know if I am suggesting we get rid of the p-value I have to suggest an alternative. The only one I can think of are confidence intervals but the statistical definition of that is confusing and not clinically useful. So, should we abandon confidence intervals too? Both the p-value and the confidence interval give useful information but if that information is interpreted incorrectly can it really be useful information?
What should be the role of the p-value and the confidence interval? Maybe we just need to better educate users of the scientific literature on what these values can and cannot tell us then all of this would be moot.
What do you think?
I agree that the designation of 0.05 as the ‘definition’ of significance is arbitrary. I think that the designation of an a priori p-value is one way of keeping everyone honest about the results. If a researcher decides in advance to use 0.05 as the level of significance, they can’t, after the fact, say that a result with a p-value of 0.051 was ‘almost significant’. On the other hand, payers should not say that a result with a p-value of 0.049 was ‘almost not significant’. I have heard both.
In either case, some additional assessment should be made about clinical significance. Perhaps also reporting on a standard meaningfully important clinic difference (MCID) would help to inform the results. For example, a very large study comparing two lipid-lowering therapies might be statistically significant in favor of one therapy even though the absolute difference is very small (e.g. < 1mg/dL).