If no one can give an understandable definition of p-value why do we use it? Or quit saying statistically significant.

In this post I might come across as sounding dumb… but it won’t be the first or last time that will happen. I came across this article about how scientists cant even explain p-values in an understandable way. I know the students and residents I teach only understand it as indicating statistical significance.  It then occurred to me as I thought of the definition that maybe we should get rid of p-values altogether and let users decide if what they are seeing is clinically significant.

What is a p-value?

First, we have to understand one other concept- the null hypothesis. When doing a study researchers have to construct a null hypothesis that they will attempt to prove or disprove (depending on how you look at it) with their study. For a superiority trial, the null hypothesis is that there is no difference between the intervention and control treatments. For a equivalence trial, the null hypothesis is that there is a difference between the intervention and control treatments. Data from each arm of the study are compared with an appropriate statistical test yielding a test statistic (e.g. t statistic if using a t-test). That test statistic is compared to a table of critical values for that test statistic to determine if the value (of your calculated test statistic) is greater than the critical value required to reject the null hypothesis at the chosen alpha level.  If it is, you reject the null hypothesis and say it is statistically significant. If not, you fail to reject the null hypothesis and the finding is not statistically significant.

P-values are simply probabilities. They are the probability of finding what was found  in the study (or even a bigger finding) if the null hypothesis was true.

Here’s an example. Let’s say I conduct a study to examine the effect of reading for 30 min each day on medical certification examination scores. My null hypothesis is that reading will not improve exam scores.  I randomize one group of participants to read for 30 min every day and the control group to no reading. Both groups take the certifying examination and I calculate mean scores for each group. I can compare these means with a t-test (assuming what parametric tests require is met) and I find the mean examination score is 5 points higher in those who read for 30 minutes compared to those who didn’t with a p-value of 0.04. So, this  p-value of 0.04 means that there is a 4% probability that you would see at least a 5 point higher mean score in the reading group given that there is no effect of reading on exam scores. What if the p-value was 0.2? Then there is a 20% probability that you would see at least a 5 point higher mean score in the reading group given there is no effect of reading.

A common mistake pointed out by statisticians is for someone to interpret the p-value as the probability of what was seen in the study being due to chance alone. I think many people think of it this way because it’s easy to comprehend. But chance isn’t the only explanation for a false positive finding in a study.

All this is very confusing, right? Exactly. So, if no one really understands p-values is it time to abandon them and the concept of statistical significance? After all, the 0.05 cutoff is just a tradition. Is 0.07 all that different? Or 0.051?

I know if I am suggesting we get rid of the p-value I have to suggest an alternative. The only one I can think of are confidence intervals but the statistical definition of that is confusing and not clinically useful. So, should we abandon confidence intervals too? Both the p-value and the confidence interval give useful information but if that information is interpreted incorrectly can it really be useful information?

What should be the role of the p-value and the confidence interval? Maybe we just need to better educate users of the scientific literature on what these values can and cannot tell us then all of this would be moot.

What do you think?

Is tinzaparin better than warfarin in patients with VTE and cancer or not?

The CATCH trail results were published this week in JAMA. Read the abstract is below. Do you think this drug is useful for venous thromboembolism (VTE) treatment?

Importance  Low-molecular-weight heparin is recommended over warfarin for the treatment of acute venous thromboembolism (VTE) in patients with active cancer largely based on results of a single, large trial.

Objective  To study the efficacy and safety of tinzaparin vs warfarin for treatment of acute, symptomatic VTE in patients with active cancer.

Design, Settings, and Participants  A randomized, open-label study with blinded central adjudication of study outcomes enrolled patients in 164 centers in Asia, Africa, Europe, and North, Central, and South America between August 2010 and November 2013. Adult patients with active cancer (defined as histologic diagnosis of cancer and receiving anticancer therapy or diagnosed with, or received such therapy, within the previous 6 months) and objectively documented proximal deep vein thrombosis (DVT) or pulmonary embolism, with a life expectancy greater than 6 months and without contraindications for anticoagulation, were followed up for 180 days and for 30 days after the last study medication dose for collection of safety data.

Interventions  Tinzaparin (175 IU/kg) once daily for 6 months vs conventional therapy with tinzaparin (175 IU/kg) once daily for 5 to 10 days followed by warfarin at a dose adjusted to maintain the international normalized ratio within the therapeutic range (2.0-3.0) for 6 months.

Main Outcomes and Measures  Primary efficacy outcome was a composite of centrally adjudicated recurrent DVT, fatal or nonfatal pulmonary embolism, and incidental VTE. Safety outcomes included major bleeding, clinically relevant nonmajor bleeding, and overall mortality.

Results  Nine hundred patients were randomized and included in intention-to-treat efficacy and safety analyses. Recurrent VTE occurred in 31 of 449 patients treated with tinzaparin and 45 of 451 patients treated with warfarin (6-month cumulative incidence, 7.2% for tinzaparin vs 10.5% for warfarin; hazard ratio [HR], 0.65 [95% CI, 0.41-1.03]; P = .07). There were no differences in major bleeding (12 patients for tinzaparin vs 11 patients for warfarin; HR, 0.89 [95% CI, 0.40-1.99]; P = .77) or overall mortality (150 patients for tinzaparin vs 138 patients for warfarin; HR, 1.08 [95% CI, 0.85-1.36]; P = .54). A significant reduction in clinically relevant nonmajor bleeding was observed with tinzaparin (49 of 449 patients for tinzaparin vs 69 of 451 patients for warfarin; HR, 0.58 [95% CI, 0.40-0.84]; P = .004).

Conclusions and Relevance  Among patients with active cancer and acute symptomatic VTE, the use of full-dose tinzaparin (175 IU/kg) daily compared with warfarin for 6 months did not significantly reduce the composite measure of recurrent VTE and was not associated with reductions in overall mortality or major bleeding, but was associated with a lower rate of clinically relevant nonmajor bleeding. Further studies are needed to assess whether the efficacy outcomes would be different in patients at higher risk of recurrent VTE.

When I approach a study with marginally negative results I consider several things to help me decide if I would still prescribe the drug:

  1. Was the study powered properly? Alternatively, were the assumptions made in sample size calculations reasonable. Sample size calculations require several data points. The main ones are: desired power, type 1 error rate, expected difference in event rates between the arms of the trial. The usual offender is the authors overestimating the benefit they expect to see. The authors expected a 50% relative reduction in event rates between the 2 arms of the study. That seems high but is consistent with a meta-analysis of similar studies and the CLOT trial.  They only saw a 31% reduction. This would have meant the study needed more patients and thus is underpowered. (post hoc power 41.4%).
  2. How much of the confidence interval is on the side of being beneficial? Most of the CI in this case is below 1.0 (0.41-1.03). Thus, I pay more attention to this than the p-value (0.07). There is potentially 59% reduction in the hazard of VTE and only a 3% potential increase in VTE. This is a clinically important reduction in VTE.
  3. What are the pros and cons of the therapy? Preventing VTE is important. The risk of bleeding was less in with tinzaparin. Had the bleeding been higher then I might have had different thoughts about prescribing this drug.
  4. Are the results of this trial consistent with previous studies? If so, then I fall back on it being underpowered and likely would prescribe the drug. A metaanalysis of 7 studies found a similar reduction in VTE (HR 0.47).

Thus, I think the study was underpowered for the event rates they encountered. Had there been more patients enrolled they likely would have found a statistically significant difference between groups. I would not anticipate the results shifting from benefit to harm with more patients. It is likely the patients in this trial were “healthier” than patients in the previous trials.  I feel comfortable saying tinzaparin is likely beneficial and I would feel comfortable prescribing it.

This demonstrates the importance of evaluating the confidence interval and not just the p-value. More information can be gleaned from the confidence interval than a p-value.

Publication Bias is Common in High Impact Journal Systematic Reviews

A very interesting study was published earlier this month in the Journal of Clinical Epidemiology assessing publication bias reporting in systematic reviews published in high impact factor journals.  Publication bias refers to the phenomenon that statistically significant positive results are more likely to be published than negative results. They also tend to be published more quickly and in more prominent journals. The issue of publication bias is an important one because the goal of a systematic review is to systematically search for and find all studies on a topic (both published and unpublished) so that an unbiased estimate of effect can be determined from including all studies (both positive and negative). If only positive studies, or a preponderance of positive studies, are published and only these are included in the review then a biased estimate of effect will result.

Onishi and Furukawa’s study is the first study to examine the frequency of significant publication bias in systematic reviews published in high impact factor general medical journals. They identified 116 systematic reviews published in the top 10 general medical journals in 2011 and 2012: NEJM, Lancet, JAMA, Annals of Internal Medicine, PLOS Medicine, BMJ, Archives of Internal Medicine, CMAJ, BMC Medicine, and Mayo Clinic Proceedings. They assessed each of the systematic reviews that did not report an assessment of publication bias for publication bias using Egger test of funnel plot asymmetry, contour-enhanced funnel plots, and tunnel effects. RESULTS: The included systematic reviews were of moderate quality as shown in the graph below. About a third of “systematic reviews” didn’t even perform a comprehensive literature search while 20% didn’t  assess study quality. Finally, 31% of systematic reviews didn’t assess for publication bias. How can you call your review a systematic review when you don’t perform a comprehensive literature search and you don’t determine if you missed studies?

Quality of included reviews

From J Clin Epi 2014;67:1320

Of the 36 reviews that did not report an assessment of publication bias, 7 (19.4%) had significant publication bias. Saying this another way, if a systematic review didn’t report an assessment of publication bias there was about a 20% chance publication bias was present. The authors then assessed what impact publication bias had on the estimated pooled results and found that the estimated pooled result was OVERESTIMATED by a median of 50.9% because of publication bias. This makes sense as mostly positive studies are published and negative studies aren’t. Thus, you would expect the estimates to be overly optimistic.

The figure below reports the results for individual journals. JAMA had significant publication bias in 50% of the reviews that didn’t assess publication bias while the Annals had 25% and BMJ 10%. It is concerning that these high impact journals publish “systematic reviews” that are of moderate quality and have a significant number of reviews that don’t report any assessment of publication bias.

Results by journal

From J Clin Epi 2014;67:1320

Bottom Line: Always critically appraise systematic reviews published in high impact journals. Don’t trust that an editor, even of a prestigious journal, did their job….they likely didn’t.

Do You Have An Unconfortable Relationship With Math? A Study Shows Most Doctors Do

If a test to detect a disease whose prevalence is 1/1000 has a false positive rate of 5%, what is the chance that a person with a positive test result actually has the disease? Assume the test is 100% sensitive.

Everyone taking care of patients, especially in primary care, needs to be able to figure this out. This is a basic understanding of what to do with a positive screening test result. If you can’t figure this out how would you be able to discuss the results with a patient? Or better yet how would you be able to counsel a patient on the implications of a positive test result prior to ordering a screening test?

Unfortunately, a study released online on April 21st found that 77% of respondents answered the question incorrectly. These results are similar to the results of a study in 1978, which used the same scenario. This is unfortunate as interpreting diagnostic test results is a cornerstone of EBM teaching and almost all (if not all) medical schools and residency programs teach EBM principles. So what’s the problem?

Here are some of my thoughts and observations:

  1. These principles are probably not actually being taught because the teachers themselves don’t understand them or if they do they don’t teach them in the proper context. This needs to be taught in the clinic when residents and medical students discuss ordering screening tests or on the wards when considering a stress test or cardiac catheterization, etc.
  2. The most common answer in the study was 95% (wrong answer). This shows that doctors don’t understand the influence of pretest probability (or prevalence) on post test probability (or predictive value). They assume a positive test equals disease. They assume a negative test equals no disease.  Remember where you end up (posttest probability) depends on where you start from (pretest probability).
  3. I commonly see a simple lack of thinking when ordering tests. How many of you stop to think: What is the pretest probability? Based on that do I want to rule in or rule out disease? Based on that do I need a sensitive or specific test? What are the test properties of the test I plan to order? (or do I just order the same test all the time for the same diagnosis?)
  4. I also see tests ordered for presumably defensive purposes. Does everyone need a CT in the ER? Does everyone need a d-dimer for every little twinge of chest pain? When you ask why a test was ordered I usually hear something like this: “Well I needed to make sure something bad wasn’t going on”.  I think this mindset transfers to the housestaff and students who perpetuate it.  I commonly see the results of the ER CT in the HPI for God’s sake!!!
  5. Laziness. There’s an app for that. Even if you can’t remember the formula or how to set up a 2×2 table your smartphone and Google are your friends.  Information management is an important skill.

So what’s the answer to the question above? 1.96%   (Remember PPV = true pos / true pos + false pos  so 1 / 1 + 50 = 1.96) If its easier set up a 2 x 2 table.

This very sensitive (100%) and fairly specific (95%) test (positive LR is 20!) wasn’t very informative when positive. Probability only went from 0.1% to 2%. The patient is still not likely to have disease even with a positive test.  It would have been more useful if the test result was negative. Thus, in a low probability setting your goal is to rule out disease and you should choose the most sensitive test (Remember SnNout).


What Does Statistically Significant Mean?

Hilda Bastian writes an important and well written blog on this topic in a recent Scientific American blog .

I don’t think I have much else to add other than read this blog. There are some great links inside her blog to further understand this topic.

I think we are too focused on p <0.05. What if the p value is 0.051? Does that mean we should ignore the finding? Is it really any different than p value of 0.0499?

statistically significant

Confidence intervals give information on both statistical significance and clinical significance but I worry about how they are interpreted also. (Disclaimer: the interpretation and use of the confidence interval that follows is not statistically correct but is how we use them clinically.) Lets say a treatment improves a bad outcome with a relative risk (RR) of 0.94 with 95% CI of 0.66-1.12. So the treatment isn’t “statistically significant” (the CI includes 1.0) but there is potential for a relatively significant clinical benefit [ the lower bound of the CI suggests a potential 34% reduction in the bad outcome (1- RR = relative risk reduction so 1-0.66 = 0.34 or 34%)]. There is also potential for a clinically significant increase in risk of 12%. So which is more important? Somewhat depends on whether you believe in this treatment or not. If you believe in it you focus on the potential 34% reduction in outcomes. If you don’t believe in the treatment you focus on the 12% increased risk. So that’s the problem with confidence intervals but they give much more information than p-values do.

Should Traditional Intention To Treat Analysis Be Abandoned?

A commenter on my video about intention to treat analysis  asked about my thoughts on a twist on intention to treat analysis in which an adjustment is made (via an instrumental variable) for “treatment contamination”. A disclaimer: I am not a statistician or epidemiologist.

First lets start with some definitions:
1) intention to treat analysis: once randomized always analyzed in the group to which the patient was assigned (even if you don’t get the intervention in the intervention arm or you do get it in the control arm)
2) Superiority trial: study designed to “prove” one intervention is better than the other. Null hypothesis is that there is no difference between the groups.
3) Noninferiority trial: study designed to “prove” that one intervention is not worse than another treatment by some prespecified amount. Null hypothesis is the is a difference between the groups.
4) Instrumental variable: variable associated with the factor under study but not directly associated with the outcome variable or any potential confounders.

intention to treat analysis

The authors of this paper An IV for the RCT: using instrumental variables to adjust for treatment contamination in randomised controlled trials  state:

Intention to treat analysis estimates the effect of recommending a treatment to study participants, not the effect of the treatment on those study participants who actually received it. In this article, we describe a simple yet rarely used analytical technique, the “contamination adjusted intention to treat analysis,” which complements the intention to treat approach by producing a better estimate of the benefits and harms of receiving a treatment. This method uses the statistical technique of instrumental variable analysis to address contamination

So what do I think about this?
1) A main role of intention to treat (ITT) analysis is to be conservative in a superiority trial. That means we dont want to reject the null hypothesis falsely and claim treatment is better than the control. Another main role of ITT analysis is to preserve randomization (remember, once randomized always analyzed).

2) The authors of the BMJ paper point out that “Intention to treat analysis estimates the effect of recommending a treatment to study participants, not the effect of the treatment on those study participants who actually received it.” This is true but isnt that what real life is like? I recommend a treatment to my patients. Some take it, some don’t. Some who I tell not to use something wind up using it.

3) The authors of the BMJ paper further point out that ITT analysis “underestimates value of receiving the treatment.” That is possible also but its also the point (see #1 above).

4) The instrumental variable in this scheme would be a variable entered into the model indicating whether or not a patient received treatment or not (no matter what group they were assigned to). ITT analysis would still be used but be adjusted for treatment receipt. I worry that this could lead to overfitting the model- a situation where you can add too many variables to a model and start to detect noise beyond real relationships.

5) I think it would be difficult in a trial to judge adherence- what is the cutoff? Is it 100%? What about 60%? 40%? How much use by the control group is important? I think there are issues in judging what is contamination or not.

Time will tell if this technique should be used. We will have to study the treatment estimates from traditional ITT analysis and contamination adjusted ITT analysis. Until then I will stick with what is recommended…traditional ITT analysis.

Journal Club- Basic Stats: Answers

Here are my answers to the journal club questions. I have also added links to some of my youtube videos to answer questions

1) The authors designed the study to have a “power of more than 80%“. What does this mean?
Power is the probability of the study finding a difference given that one truly exists. So this study was designed with at least an 80% chance of finding a difference between treatment and control groups (given that one truly exists). This video explains power in a little more depth.
2) What was the planned type 1 error rate in this study? Type 1 error is also called the alpha error. They planned on a 5% type 1 error rate. This video explains type 1 error in a little more detail
3) What is a type 2 error and how is it related to power? Type 2 error is also called beta error. It is related to power in that power is 1 (or 100%) minus the beta error. So if power is 80% the type 2 error rate is 20%. This video  explains type 2 error in more detail.
4) What are the determinants of sample size in this study? How does varying the estimates of these components affect sample size? Sample size is determined by a variety of factors: power, type 1 and 2 error rates, estimated difference between study groups and variability in the data (though this last one has less of an effect). See this video explaining these factors and their effect on sample size.
5) The authors use a variety of statistical tests (chi-square, Fisher’s exact, t-tests, etc) to analyze the data. In general, what do statistical tests do?
Statistical tests look at the data and calculate a test statistic (e.g. t statistic for a t test). The test statistic is then used to determine the p-value assosicated with the data.

Review Table 2 and answer the following questions:1) The primary outcome occurred in 1.92/100 person-yrs in the control group compared to 1.83/100 person-yrs in the intervention group. The p-value associated with this comparison is 0.51. What does this p-value mean? Can p-values be used to detect bias in the study? The simple interpretation is that the difference is not statistically significant because the p-value is > 0.05. Another interpretation would be that the difference seen between the groups or one more extreme is due 51% likely due to chance. P-values cannont detect bias (systematic errors) in a study. Critical appraisal detects bias.
2) The hazard ratio comparing the intervention group to the control group for the primary outcome is 0.95 with a 95% confidence interval of 0.83-1.09. What does this confidence interval tell you about the effect? Can confidence intervals be used to detect bias in the study? It tells you a couple of things: 1. that the difference is not statistically significant as the CI included the point of no difference…1.0 and 2. that the benefit could be up to 17% reduction in cardiovascular events or 9% increase. This video explains how to interpret hazard ratios and this video confidence intervals.

Finally the extra credit: These 4 things can explain study findings: truth, chance, bias, confounding

I hope this was somewhat helpful. I will have another journal club next month on another EBM topic.

Journal Club- Basic Stats: Cardiovascular Effects of Intensive Lifestyle Intervention in Type 2 Diabetes

I decided to start a new feature that hopefully you and I will find useful. It will only be useful if you work thru the questions and have a dialogue thru the comments section.

My plan is to post a different article about once a month and have questions for you to answer. 1 week later (or so) I will then either post a video review of my answers or just write about the answers. I plan to focus mostly on the basics but at times I will cover advanced topics also as “extra credit”. I am mostly going to parallel the journal club curriculum we are using this year at UAB. I welcome comments to make this better or articles you want to read.

journal club

58 yo M with DM-2, hyperlipidemia and HTN presents to you for a follow-up visit. He takes metformin 1000mg BID, lisinopril 20mg daily, and pravastatin 40mg nightly. His most recent HgA1C was 6.9% and LDL was 88 mg/dl. In the office his blood pressure is 128/67 mm Hg and BMI is 32. You counsel him to lose weight and he responds “My blood pressure, cholesteol and A1C are good. How is losing weight going to help my heart?” What do you tell him?

Article: The Look AHEAD Research Group. Cardiovascular Effects of Intensive Lifestyle Intervention in Type 2 Diabetes. NEJM 2013;369:145-54.

After reading the statistical analysis section (pgs 147-148) of the article answer the following questions:
1) The authors designed the study to have a “power of more than 80%“. What does this mean?
2) What was the planned type 1 error rate in this study?
3) What is a type 2 error and how is it related to power?
4) What are the determinants of sample size in this study? How does varying the estimates of these components affect sample size?
5) The authors use a variety of statistical tests (chi-square, Fisher’s exact, t-tests, etc) to analyze the data. In general, what do statistical tests do?

Review Table 2 and answer the following questions:1) The primary outcome occurred in 1.92/100 person-yrs in the control group compared to 1.83/100 person-yrs in the intervention group. The p-value associated with this comparison is 0.51. What does this p-value mean? Can p-values be used to detect bias in the study?
2) The hazard ratio comparing the intervention group to the control group for the primary outcome is 0.95 with a 95% confidence interval of 0.83-1.09. What does this confidence interval tell you about the effect? Can confidence intervals be used to detect bias in the study?

Extra Credit:
List the 4 things that can explain study findings