Good science, together with proper statistics, has a dual role. The first role is to decrease uncertainty about which hypotheses are true; the second is to properly measure the remaining uncertainty. These are carried out in part through a process called statistical inference. Statistical inference involves the process of summarizing data, estimating the uncertainty around the summary, and using the summary to reach conclusions about the underlying truth that gave rise to the data.

The two main approaches to statistical inference are the standard “frequentist” approach and the Bayesian approach. Each has distinctive strengths and weaknesses when used as bases for decision-making; including both approaches in the technical and conceptual toolbox can be extraordinarily important in making proper decisions in the face of complex evidence and substantial uncertainty. The frequentist approach to statistical inference is familiar to medical researchers and is the basis for most FDA rules and guidance. The Bayesian approach is less widely used and understood, however, it has many attractive properties that can both elucidate the reasons for disagreements, and provide an analytic model for decision-making. This model allows decision-makers to combine the chance of being wrong about risks and benefits, together with the seriousness of those errors, to support optimal decisions.

The frequentist approach employs such measures as P values, confidence intervals, and type I and II errors, as well as practices such as hypothesis-testing. Evidence against a specified hypothesis is measured with a P value. P values are typically used within a hypothesis-testing paradigm that declares results “statistically significant” or “not significant”, with the threshold for significance usually being a P value less than 0.05. By convention, type I (false-positive) error rates in individual studies are set in the design stage at 5 percent or lower, and type II (false-negative) rates at 20 percent or below (Gordis, 2004).

In the colitis example, if the null hypothesis posits that broad-spectrum antibiotics do not increase the risk of colitis, a P value less than 0.05 would lead one to reject that null hypothesis and conclude that broad-spectrum antibiotics do increase the risk of colitis. The range of that elevation statistically consistent with the evidence would be captured by the confidence interval. If the P value exceeded 0.05, several conclusions could be supported, depending on the location and width of the confidence interval; either that a clinically negligible effect is likely, or that the study cannot rule out either a null or clinically important effect and thus is inconclusive. In the drug-approval setting, the FDA regulatory threshold of “substantial evidence”2 for effectiveness is generally defined as two well controlled trials that have achieved statistical significance on an agreed upon endpoint, although there can be exceptions (Carpenter, 2010; Garrison et al., 2010).


221 USC § 355(d) (2010).

The National Academies of Sciences, Engineering, and Medicine
500 Fifth St. N.W. | Washington, D.C. 20001

Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement