Subgroup analyses: primary and secondary
ACP J Club. 1995 May-June;122:A12. doi:10.7326/ACPJC-1995-122-3-A12
Related Content in the Archives
• Correction: Subgroup analyses: primary and secondary
Clinical investigators do randomized trials to determine whether therapeutic interventions such as surgical procedures and medications improve outcomes for patients. Many trials are designed to show efficacy, that is, that an intervention can achieve a stated goal when used in optimal circumstances. These trials often enroll a homogeneous group of patients. Some trials are designed to show effectiveness, namely, that an intervention does more good than harm when used in usual clinical circumstances. These trials may enroll a wider variety of patients. At the end of these experiments, the investigators will draw 1 of 4 conclusions: The intervention helps, the intervention hurts, the intervention neither helps nor hurts, or the effect of the intervention is still unknown. They also estimate the size of the effect and a plausible range in which the true effect lies.
Regardless of the conclusions, clinicians cannot resist asking whether the results of these trials apply to their own patients, who may differ from the “average” patient included in the trial. For example, a geriatrician might be interested in knowing whether the medication or surgical procedure works for her patients, who are all older than 75 years, or may ask whether the intervention works differently for patients who are elderly compared with patients who are not. In cases in which randomized trials do not include any patients similar to those of interest to the particular clinician, the answer can only be extrapolated from the trials. More commonly, however, randomized trials include enough variation in patient characteristics that clinicians ask to see, or investigators offer to show, variation in event rates between treated and untreated patients for subgroups of patients defined by particular characteristics. This analysis is known as a subgroup analysis.
The table below shows a subgroup analysis for the recently published results of the Global Utilization of Streptokinase and Tissue Plasminogen Activator for Occluded Coronary Arteries (GUSTO) trial of thrombolysis for acute myocardial infarction (1). The GUSTO study was a randomized trial of 41 021 patients in which a regimen of accelerated tissue plasminogen activator (tPA) was compared with streptokinase in a factorial design. The Table shows the published results for patients older and younger than 75 years. The P value associated with the difference in 30-day mortality for patients younger than 75 years was significant (P < 0.05), but the P value for those older than 75 years was not (P > 0.05). Many clinicians might conclude that tPA therapy has a beneficial effect on this outcome for persons younger than 75 years but is of no benefit for those older than 75 years. Another possible conclusion is that the therapies worked differently for those older and younger than 75 years.
Table. 30-Day Mortality for Age Subgroups in the GUSTO Trial*
|Age, y||Patients in Trial, %†||30-day Mortality Rates, %||Absolute Difference in 30-day Mortality Rates, %|
*Derived from figure 3 of reference 1. tPA = tissue plasminogen activator.
†A correction was made at this point in the table. See correction for details.
‡P value inferred from 95% confidence interval shown in the figure.
It would be wrong to draw these conclusions from these data. The appropriate analysis is not to compare the P values in the 2 subgroups but rather to do a statistical test that directly asks the following question: Does the absolute difference in the 30-day mortality rate between patients treated with tPA and those treated with streptokinase differ between those older and younger than 75 years? In other words, does the 1.1% difference in mortality for patients younger than 75 years differ from the 1.3% difference for those older than 75 years? This analysis is also known as a test of interaction between treatment and age. It is the statistical significance of this interaction term that tells the reader whether the “difference in differences” between subgroups is statistically significant.
The statistical analysis reported by the GUSTO investigators indicated that the interaction term was not significant for age. Furthermore, if absolute risk reduction or the absolute difference in 30-day mortality rates is the variable of interest, it is clear that 1.3% is actually larger than 1.1%. Of course, when put in terms of odds ratios or proportionate risk reduction, the observed difference is smaller for persons older than 75 years. However, once again, according to the GUSTO investigators, this difference is not statistically significant.
The lack of statistical significance for the interaction term brings up a second issue, namely, statistical power, the ability of the study to detect a clinically important effect. First, it is important to note that one of the reasons the P value was greater than 0.05 for the subgroup of persons older than 75 years might be that only 12% of the participants enrolled in GUSTO were older than 75 years. Second, the analysis of the interaction term may also have been limited by the sample size. If a trial of this size lacks statistical power for subgroup analysis, then, clearly, most other trials also will have this problem. It may well be that, despite our well-intentioned clinical curiosity, we only get to answer 1 question from a trial—does the therapy reduce the risk for adverse events (and what is our best estimate of the treatment effect) in the total population?
A third important point concerns when the variables to be included in the subgroup analyses were selected. The choice of variables for subgroup analyses should always be made before the trial is started, with hypotheses as to how different subgroups might react differently to the interventions being studied. Investigators and clinicians, however, are often tempted to sift through the results after the trial has been completed to look for quantitative differences in effectiveness among subgroups and then do statistical tests on those that “look different.” An analysis done after looking at the data (post hoc) is also known as “data dredging.” Perhaps the most famous example of poking fun at post hoc subgroup analyses was the report of ISIS-2 (2), another randomized trial of a thrombolytic agent in acute myocardial infarction. When editors asked the investigators to look for possible subgroups in which the treatments might differ, the first one reported was based on the astrological sign of the patient. Post hoc analyses may be useful for generating hypotheses for future study (although, as mentioned above, they will require huge sample sizes) but not for testing hypotheses.
Finally, there is the problem of multiple comparison. A simple way to think of this is to imagine that if the “difference in differences” for 100 subgroups is examined, 5 subgroups by chance alone may be expected to show statistically significant differences with a P value of < 0.05. Which 5? Any 5! This problem occurs for both prespecified and post hoc subgroup analyses. Various ways are available for adjusting for this problem, all of which establish more stringent criteria for calling a result statistically significant (i.e., require a P value more stringent than 0.05 to establish statistical significance). The bottom line is that investigators simply will not have enough power to ask many (or, often, any) questions beyond the primary efficacy of the intervention being studied. Other authors (3, 4) have expanded on these and other issues pertaining to subgroup analyses from primary data and we refer the interested reader to these excellent reviews.
We move now from these “primary subgroup analyses” to another approach that we call “secondary subgroup analyses.” These analyses may attempt to derive clinical policies (sometimes going as far as practice guidelines) that incorporate more than just the efficacy or effectiveness of interventions. For example, clinicians and policy analysts often want to consider other variables such as safety or side effects, health care costs, and risk in untreated patients. In some cases, multiple pieces of data are examined together without formal analyses (5); in other cases, formal modeling techniques, such as decision analysis, are used to combine these data (6). Once again, the choice of thrombolytic agent for persons older and younger than 75 years provides a good example of some of these issues.
The first issue we consider is differences in side effects between age subgroups. The major difference in side effects between patients receiving tPA and those receiving streptokinase is hemorrhage. The primary report of the GUSTO trial showed that the risk for hemorrhagic stroke was lower for patients younger than 75 years than for patients older than 75 years. The trial also showed that the observed difference in hemorrhagic stroke rates between patients receiving accelerated tPA and those receiving streptokinase was larger for the group older than 75 years. The statistical significance of this “difference in differences” was not reported. However, the investigators did report that when the combination of death and nonfatal disabling stroke (including hemorrhagic stroke) was considered, no statistical difference was observed in the absolute effect size between the groups. The trend for this combined end point actually favored patients older than 75 years despite the increased risk for hemorrhagic stroke. Therefore, this argument does not support more conservative use of tPA in elderly persons.
Secondary analyses also may consider the issue of cost, or, more precisely, cost-effectiveness. In the thrombolysis example, Naylor and colleagues (7) calculated the incremental cost per life-year gained from using tPA, the more expensive agent, compared with streptokinase. They calculated that the number of dollars spent per life-year gained depends on the duration of survival, conditional on the patient surviving 1 month after a myocardial infarction. If that number is 10 years, then the cost per life-year saved is only $28 000. If that period of time is shorter, however, the cost increases. For example, the incremental cost per life-year saved for 5 years is $56 000. One might assume that survival beyond 30 days for patients older than 75 years is shorter than that for patients younger than 75 years and therefore that the incremental cost-effectiveness is higher (less attractive) for the former group. The data for these values are still missing and further analyses on this issue are pending.
Finally, secondary analyses may examine subgroup differences in the baseline risk for an adverse outcome for patients treated with the control therapy. In some cases, subgroups that are at a very low baseline risk can be identified, which makes the risk-benefit or cost-benefit ratio for treatment less attractive. This would occur if the much lower baseline risk resulted in a much smaller absolute risk reduction that is attributable to the better therapy. In the GUSTO trial, patients older than 75 years were actually at an increased baseline risk compared with younger patients who had, as noted above, a slightly larger absolute risk reduction attributable to tPA.
In summary, subgroup analyses of the primary data in a randomized trial that try to answer the clinically relevant question of differences in effectiveness across patients with different characteristics must be interpreted with extreme caution. They must be limited to prespecified analyses and apply the appropriate statistical technique to look for interaction between the characteristics and outcomes of interest. Secondary analyses move beyond these issues to look at other data, such as side effect rates, cost-effectiveness, and baseline risk. Some of these secondary analyses will use multiple pieces of data, including observed subgroup data derived from the same randomized trials that provide the efficacy data.
In doing so, regardless of whether explicit quantitative models are used, judgment is always required to determine whether the analysis should use different estimates of effectiveness for two subgroups when the difference is not statistically significant but may be clinically important. A conservative approach would be to continue to consider these differences as the play of chance and use the estimate of effectiveness derived from the entire trial sample for both subgroups.
Clinicians who treat individual patients and others who create policies for groups of patients must always apply judgment in interpreting and extrapolating data. We in no way wish to dampen the enthusiasm for applying research results to the clinical and policy world. Instead, we wish to help those who do so to understand the limitations of using observations in subgroups.
Allan S. Detsky, MD, PhD
I. Gary Naglie, MD
2. ISIS-2 (Second International Study of Infarct Survival) Collaborative Group. Randomised trial of intravenous streptokinase, oral aspirin, both, or neither among 17 187 cases of suspected acute myocardial infarction ISIS-2. Lancet. 1988;2:349-60.
6. Krumholz HM, Pasternak RC, Weinstein MC, et al. Cost effectiveness of thrombolytic therapy with streptokinase in elderly patients with suspected acute myocardial infarction. N Engl J Med. 1992; 327:7-13.