Small Studies Can Lead To Big Results

An interesting article was published in the British Medical Journal in April. Researchers looked at the influence of sample size on treatment effect estimates. The bottom line is that they found that treatment effect estimates were significantly larger from smaller trials as compared to larger trials (up to 48% greater!). The figure below shows this relationship. The left graph compares sample sizes from each study broken into quartiles and on the right arbitrary divisions by raw numbers.

Comparison of treatment effect estimates between trail sample sizes

So what does this mean for the average reader of medical journals? Pay attention to sample size. Early studies on new technology (whether meds or procedures) are often carried out on a fairly small group of people. Realize that what you see is likely overestimated (compared to a large study). If benefits are marginal (barely clinically significant) realize they likely will go away with a larger trial. If benefits are too good to be true….they likely are too good to be true and you should temper your enthusiasm. I always like to see more than one trial on a topic before I jump in and prescribe new meds or recommend new procedures.

Why Can’t Guideline Developers Just Do Their Job Right????

I am reviewing a manuscript about the trustworthiness of guidelines for a prominent medical journal. I have written editorials on this topic in the past ( and The authors of the paper I am reviewing reviewed the recommendations made by 3 separate medical societies on the use of a certain medication for patients with atrial fibrillation. The data on this drug can be summarized as follows: little benefit, much more harm. But as you would expect these specialists recommended its use in the same sentence as other safer and more proven therapies. They basically ignored the side effects and only focused on the minimal benefits.

Why do many guideline developers keep doing this? They just can’t seem to develop guidelines properly. Unfortunately their biased products have weight with insurers, the public, and the legal system. The reasons are complex but solvable. A main reason (in my opinion) is that they are stuck in their ways. Each society has its guideline machine and they churn them out the same way year after year. Why would they change? Who is holding them accountable? Certainly not journal editors. (As a side note: the journals that publish these guidelines are often owned by the same subspecialty societies that developed the guidelines. Hmmmm. No conflicts there.)

conflict of interest

The biggest problem though is conflicts of interest. There is intellectual COI. Monetary COI. Converting data to recommendations requires judgment and judgment involves values. Single specialty medical society guideline development panels involve the same types of doctors that have shared values. But I always wonder how much did the authors of these guidelines get from the drug companies? Are they so married to this drug that they don’t believe the data? Is it ignorance? Are they so intellectually dishonest that they only see benefits and can’t understand harm? I don’t think we will ever truly understand this process without having a proverbial fly on the wall present during guideline deliberations.

Until someone demands a better job of guideline development I still consider them opinion pieces or at best consensus statements. We need to quit placing so much weight on them in quality assessment especially when some guidelines, like these, recommend harmful treatment.

Danish Osteoporosis Prevention Trial Doesn’t Prove Anything

The overstatement of  the DOPS trial ( results have bothered me this week. So much so that even though I am on vacation I wanted to write something about it. Thankfully comments linked to the article show that at least a few readers were smart enough to detect the limitations of this study. What has bothered me is the ridiculous headline on about this trial (

HRT cuts CVD by 50%, latest “unique” data show

First off all data is unique so that’s stupid…..CVD cut by 50%. When something seems too good to be true and goes against what we already know take it with a grain of salt. Almost nothing in medicine is 50% effective, especially as  primary prevention. But I digress.

The authors of the trial point out their study was different that the previous large HRT study – the Womens’ Health Initiative (WHI). So why do these studies contradict each other?

Whenever you see a study finding always consider 4 things that can explain what you see and its your job to figure out which one it is: truth, chance, bias, and confounding. So let’s look at the DOPS study with this framework

Truth: maybe DOPS is right and the Cochrane review with 24,283 total patients is wrong. Possible but unlikely. DOPS enrolled 1006 patients and has very low event rates (much lower than other studies in this area).

Chance: The composite outcome (which I’ll comment on in a minute) did have a p value <0.05 but none of its components were statistically significant. Each study we do can be a false positive study (or a false negative). So its possible the study is a false positive and if repeated would not give the same results.  Small studies are more likely to have false positives and false negatives.

Bias: Biases are systematic errors made in a study.There are a couple in this study: no blinding (this leads to overestimation of effects) and poorly concealed allocation (again leads to overestimation of effects).

Confounding: Women in the control group were about 6 months older than treated patients but this was controlled for in the analysis phase. What else was different about these women that could have affected the outcome?

So far my summary of this study would be that it is small with potential for overestimation of effects due to lack of blinding and poorly concealed allocation.

But there’s more:

  • This study ended years ago and is now just getting published. Why? Were the authors playing with the data? The study was industry funded and the authors have industry ties. Hmmm.
  • The composite outcome they used is bizarre and not the typical composite used in cardiovascular trials . They used death, admission to the hospital for myocardial infarction or heart failure. This isn’t a good composite because patients wouldn’t consider each component equally important and the biology of each component is very different. Thus you must look at individual components and none are statistically significant by themselves.
  • The WHI is the largest HRT trial done to date. Women in the WHI were older and fatter than the DOPS participants and thus are at higher risk. So why would women at higher risk for an outcome gain less benefit that those at lower risk for the outcome? Things usually don’t work that way. A big difference though in these 2 trials is that DOPS women started HRT earlier than WHI women. So maybe timing is important.

Thus, I think this trial at best suggests a hypothesis to test: starting HRT within the first couple of years compared to starting later is more beneficial. DOPS doesn’t prove this. The body of evidence contradicting this trial is stronger than DOPS. Thus I don’t think I will change what I tell my female patients.

Drug companies (and the FDA) undermining EBM

This is a nice TED talk on the problem of publication bias. The FDA is complicit in this problem because they dont force drug companies to publish their studies. Consumers (doctors and patients) never get the whole story. No amount of critical appraisal skills can overcome this problem.

Ben Goldacre: What doctors don’t know about the drugs they prescribe

Truly sad… Not sure what any of us can do though short of going to the FDA website for every drug we prescribe (or at least the new ones) and seeing what was submitted for approval.

Critical Appraisal of Studies is Really Important

This week in the Annals of Internal Medicine another study (  has been published showing that biases in studies can lead to inaccurate results. Thus its really important to critically appraise primary studies. Unfortunately few doctors take the time to do so (I suspect, though I don’t have empiric proof to cite) and, despite EBM skills being taught for a decade now, few probably even remember how to do so.

Savovic and colleagues have done the most comprehensive attempt to quantify the effect of 3 design elements on the outcomes of randomized controlled trials: random-sequence generation, allocation concealment, and double blinding. First, what the heck do those terms even mean? In a randomized trial participants are assigned to study groups in a random fashion, akin to a coin flip. No one actually flips a coin but researchers usually use a computer program to generate a random number (random sequence generation) and this number determines the group to which a patient is assigned. For example, if the number is odd the patient goes into the control arm, if the number is even the intervention arm. The number generation needs to be unpredictable (ie random) and not just alternating odd and even numbers.  Authors of studies should give enough information on how the random sequence generation was undertaken. As of 2006, only 34% of PubMed indexed trials did this adequately.

We don’t want those trying to enroll a patient into a study to be able to figure out to which arm the patient will be allocated or assigned. We want the allocation concealed.  This is blinding of the randomization order or scheme. Concealed allocation helps guard against someone getting preferentially placed in one arm of a trial or another based on their prognosis. We don’t want sicker patients preferentially put in one arm and  healthier ones in another. This would clearly bias the findings of the study. In a 2005 study, only 18% of randomized trials indexed in PubMed reported any allocation concealment.

Most doctors understand blinding. What they don’t understand is who should be blinded– everyone possible is the short answer. Blinding the trial participants and trial personnel avoids participants from being treated differently based on the arm of the study they are in. But what if you can’t blind the patients or the study personnel (for example in a study of a surgical procedure vs medical mgmt)? You blind the outcomes assessors. Statisticians should also be blinded. Interestingly, Benjamin Franklin is credited with being the first person to blind participants in a scientific study.  Blinding is especially important if the outcomes are subjective (for example quality of life).  Conversely, blinding is less important for objective outcomes like death.

Back to the study by Savovic and colleagues. The authors used some sophisticated techniques to acquire and analyze the data and I won’t bore you with the details. Just accept that they did a good job (dont all authors of studies want us to trust them and they usually disappoint us?).  What did they find?  Inadequately or unclear random sequence generation, allocation concealment and blinding led to exaggeration of intervention effects by an average of 11%. As expected, the effect was greatest for subjective outcomes.  The greatest overestimate of treatment effect was seen with inadequate blinding (23% overestimation) followed by inadequate allocation concealment (18% overestimation).

These kind of findings always bother me for 2 reasons:

  1. We come to the conclusion that interventions are better than they are. We are falsely led to believe in much greater benefit than there likely is. We offer things to patients with the promise of more benefit than they will likely offer.
  2. Why do these flawed studies get published? Why dont reviewers and editors reject the publication of these studies or at least put a black box warning that the results are biased? I still can’t understand why we publish flawed research without labelling it as such.  Why can’t researchers just design the study properly in the first place? It’s not like the elements of good study design are a secret.

What should doctors do to avoid using biased information?

  • Read the pre-appraised literature like ACP Journal Club. The articles published in ACPJC are structured summaries of critically appraised articles. To be published in ACPJC a study has to be methodologically sound and clinically important. Articles with important methodological weaknesses will not be published.
  • Find answers to questions in evidence-based textbooks, like Dynamed (
  • If you have to read primary studies CRITICALLY APPRAISE THEM! It’s not hard. Each study design has its own set of questions against which you should judge the quality of the study ( If you find the study is flawed either throw it away and find another one or realize biases almost always result in overestimation of treatment benefits and adjust your expectations accordingly.


This post is not about desirable personal characteristics but about 2 studies that have attempted to determine if PCI combined with optimal medical therapy (OMT) is better than OMT alone in patients with stable angina. This is an important question because a lot of costly PCIs are done on patients with stable angina. I am not a cardiologist so I will not comment on the technical aspects of these studies but will instead focus on design issues that I think temper the results of these important studies, especially FAME2.

FAME2 was released online this week by the New England Journal of Medicine ( while COURAGE was published in 2007 ( COURAGE was the first trial to combine state of the art (at the time) PCI with state of the art medical therapy (which still holds true today) for CAD. COURAGE has been criticized because many of the patients were VA patients and for the use of mostly bare metal stents. Critics have ignored the fact that bare metal stents are similar to drug eluting stents (DES) in most outcomes except for restenosis.

FAME2 enrolled patients with stable angina, who had one coronary vessel with at least 50% stenosis that was suitable for PCI. Inherent in these inclusion criteria is the angiographic knowledge of the patient’s coronary anatomy. Often in clinical practice we make decisions on treatment without this knowledge; but based on symptoms and noninvasive assessments alone.  Study design quality was generally good: randomization was used, randomization scheme was concealed, intention to treat analysis was used, the 2 study groups were similar at the start of the study, and the groups were treated equally other than the treatment under study. Patients and clinicians were not blinded but this is less important in this study as the outcomes were fairly objective.

So what problems do I have with FAME2? Whenever you read a study and identify a limitation you should always ask yourself what impact that limitation could have on the results of the study. I am especially trying to identify design issues that bias the results towards one group over the other.

  1. The endpoint of the trial is a composite of death from any cause, nonfatal MI or unplanned hospitalization leading to urgent revascularization.  Patients would consider each component equally important in a good composite and clearly death would be much less preferred than urgent revascularization. Each of the components of the composite should have a similar biological mechanism and clearly they don’t. Finally the components of a good composite should be affected fairly equally by the intervention and here they aren’t (death is insignificantly reduced by 67%, MI is increased insignificantly by 5%, and revascularization is reduced by 87%). This doesn’t mean we reject the trial it just means you should look at each individual component instead of using the composite.
  2. All stenoses with a fractional flow reserve (FFR) <0.8 were treated with the current state of the art DES. Sounds great…..fix everything you see. The problem is that this biases the study positively toward the PCI group because if you fix everything and not just lesions that cause ischemia by functional testing you will leave fewer lesions to cause any problems in the future and thus need less revascularization.
  3. The trial was stopped early. Too early considering there were no formal stopping rules. Furthermore the trial was stopped because the PCI group needed less urgent revascularizations than the OMT group….a finding that was predictable as I mention in #1 above. By stopping a trial early you never know what the longer term effects of your study will be (both good and bad) .
  4. The landmark analysis using day 7 as the landmark point seems arbitrary and its rationale isn’t explained in the manuscript. This will also bias findings positively towards PCI because PCI has immediate effects whereas medications take more time to work.
  5. The study wasnt blinded and knowledge of the treatment arm could definitely influence treatment decisions. Knowing a patient was in the OMT arm and still having angina might lead to the recommendation of PCI more often than intensifying OMT.

So what should readers of FAME2 take away from the study? In patients with stable angina PCI only reduces angina and the need for future “urgent” revascularizations and this reduction is likely overestimated. PCI doesn’t prevent death and it doesn’t prevent MIs. COURAGE showed us the same thing.   Optimal medical therapy works and we should strive to get patients on OMT as outlined by COURAGE and FAME2. Finally, all studies have limitations; there is no perfect study but understanding what effect the limitations would be expected to have on the outcomes can help us better temper the results.