Confidence intervals in research evaluation
ACP J Club. 1992 Mar-April;116:A28. doi:10.7326/ACPJC-1992-116-2-A28
No one who reads a medical journal can be unaware of the widespread use of statistics in research papers. In particular, virtually all papers with statistical analyses contain one or more P values. Most of us think we understand these, but studies have shown that P values are widely misinterpreted. The P value relates to the null hypothesis of no effect (for example, that 2 treatments are equally effective). It is the probability of obtaining the observed data, or more unlikely data, when the null hypothesis is true. In other words, the P value measures the compatibility of the data with the null hypothesis. The smaller the P value, the less plausible is the null hypothesis and the more likely we are to reject it and be convinced, for example, that 2 drugs differ in effectiveness. The P value does not indicate the magnitude of the effect of interest, or even its direction, nor does it indicate how much uncertainty is associated with the results.
By contrast, the use of confidence intervals is founded on the idea that what we most wish to know is the magnitude of the effect of interest, together with some measure of uncertainty. The principle is to use the data from the sample studied to obtain a best estimate of the true effect in the whole relevant population (such as the difference in the effectiveness of 2 drugs for patients with a certain disease) and to give a range of uncertainty around that estimate. Confidence intervals are not a new concept, nor is the suggestion to use them in medical research (1, 2), but only recently have they begun to be used widely.
Because confidence intervals indicate the strength of evidence, they are of particular relevance to ACP Journal Club. In small studies or in large studies where the outcome of interest is rare, confidence intervals are wide, indicating imprecise estimation of the effect of interest. For example, in a study to evaluate the ability of diabetologists to screen diabetic patients for retinopathy (3), the serious error rate was given as 1 in 20 (5%). The 95% confidence interval for the true error rate is 0.1% to 25%. When a comparative study has not found a statistically significant effect (i.e., P > 0.05), a confidence interval is especially valuable. It will often indicate that the interpretation of “not significant” as “no difference” cannot be supported by the data because the results are compatible with large real effects. By comparison, P values alone allow a much restricted interpretation.
The contrast between P values and confidence intervals is well illustrated by 2 consecutive sentences in a recent report in ACP Journal Club (4): “Overall hay fever symptom scores were lower for the Alutard SQ group (at peak season, 2.2 vs 5.5; CI -4.8 to -0.5; P = 0.02). Postseasonal assessment by both patients and the study coordinator showed improvement in favor of Alutard SQ (P < 0.001).” In the second sentence, the P value looks impressive, but neither the size of the difference in improvement nor the uncertainty associated with the estimate of improvement is given.
The 2 values (limits) that define a confidence interval indicate the range of values of the true effect that is consistent with the data. A 95% confidence interval means that the data are not significantly different (at the 5% level) from any true effect between the limits of the interval. If many studies of the same problem are done, 95% of the 95% confidence intervals from all these studies will include the true value. Thus, a more common (although not absolutely correct) interpretation is that we can be 95% confident that the true value lies within the stated range of values.
The convention of using the value of 95% is arbitrary, just as is that of taking P < 0.05 as being significant, and authors sometimes use 90% or 99% confidence intervals. There is a close relation between confidence intervals and P values. If the 95% confidence interval excludes the null value (usually 0, but 1 if the estimate is an odds ratio or relative risk), then P < 0.05. (This relationship is not exact in some cases.) In general, it is recommended that both confidence intervals and P values be presented, the latter as exact values (e.g., P = 0.13 or P = 0.005 rather than P < 0.05 or P < 0.01). However, the estimate and confidence interval are often sufficient. Confidence intervals can be obtained in most circumstances, even for some nonparametric analyses (5). A computer program is available to carry out all the common types of calculations (6). When the authors have not provided confidence intervals, the intervals can often be constructed using the results in the paper.
Confidence intervals are commonly used in meta-analyses. Results from many clinical trials or observational studies that appear contradictory are often shown to be compatible with some consistent true value when confidence intervals are constructed for each study. In addition, confidence intervals are routinely given in conjunction with the pooled estimate of effect.
For many comparative studies (including clinical trials and meta-analyses), the effect of interest is the difference between two groups, so the confidence interval should be for this difference (7). It is, however, common to see only within-group confidence intervals given. Indeed, within-group confidence intervals were presented in several abstracts of clinical trials reported in early issues of ACP Journal Club. For example, in a controlled trial of insulin-dependent diabetes (8), serious episodes of hypoglycemia occurred in 25 of 44 patients receiving intensified conventional treatment (57%, 95% CI 44% to 73%) and in 12 of 53 patients receiving regular treatment (23%, CI 11% to 34%, P < 0.001). The 95% confidence interval for the difference in proportions (of 34%) can be calculated as 16% to 53%. Alternatively the data could be used to calculate the relative risk (RR) of serious episodes of hypoglycemia, which is RR 2.51 (95% CI 1.4 to 4.4). It is often, but not always, possible to construct the required between-group confidence interval using information given in the paper.
In observational studies, confidence intervals are useful to give a range of uncertainty for estimates of prevalence, risk, and so forth. For example, the estimated risk for HIV-1 infection after percutaneous exposure to HIV-infected body fluids was estimated as 0.56% on the basis of a single occurrence after 179 exposures (9). The 95% confidence interval was naturally very wide (CI 0.01% to 3.06%), despite a sample size of over 2000.
Confidence intervals are valuable in assessing published papers. A statement such as “there was an increased risk of breast cancer among cases (odds ratio, 3.1; 95% CI 1.8 to 4.8),” is far more informative than “the risk of breast cancer was significantly higher among cases than controls (P < 0.01).” Several journals now encourage or even require the use of confidence intervals. Whenever possible, entries in ACP Journal Club will include them for at least the main outcomes.
Douglas G. Altman