Preventing false discovery of heterogeneous treatment effect subgroups in randomized trials

Background Heterogeneous treatment effects (HTEs), or systematic differences in treatment effectiveness among participants with different observable features, may be important when applying trial results to clinical practice. Current methods suffer from a potential for false detection of HTEs due to imbalances in covariates between candidate subgroups. Methods We introduce a new method, matching plus classification and regression trees (mCART), that yields balance in covariates in identified HTE subgroups. We compared mCART to a classical method (logistic regression [LR] with backwards covariate selection using the Akaike information criterion ) and two machine-learning approaches increasingly applied to HTE detection (random forest [RF] and gradient RF) in simulations with a binary outcome with known HTE subgroups. We considered an N = 200 phase II oncology trial where there were either no HTEs (1A) or two HTE subgroups (1B) and an N = 6000 phase III cardiovascular disease trial where there were either no HTEs (2A) or four HTE subgroups (2B). Additionally, we considered an N = 6000 phase III cardiovascular disease trial where there was no average treatment effect but there were four HTE subgroups (2C). Results In simulations 1A and 2A (no HTEs), mCART did not identify any HTE subgroups, whereas LR found 2 and 448, RF 5 and 2, and gradient RF 5 and 24, respectively (all false positives). In simulation 1B, mCART failed to identify the two true HTE subgroups whereas LR found 4, RF 6, and gradient RF 10 (half or more of which were false positives). In simulations 2B and 2C, mCART captured the four true HTE subgroups, whereas the other methods found only false positives. All HTE subgroups identified by mCART had acceptable treated vs. control covariate balance with absolute standardized differences less than 0.2, whereas the absolute standardized differences for the other methods typically exceeded 0.2. The imbalance in covariates in identified subgroups for LR, RF, and gradient RF indicates the false HTE detection may have been due to confounding. Conclusions Covariate imbalances may be producing false positives in subgroup analyses. mCART could be a useful tool to help prevent the false discovery of HTE subgroups in secondary analyses of randomized trial data.


Background
Precision medicine aims to direct the right medication to the right patient at the right time [1]. A requirement of precision medicine is that the effect of a treatment on a given patient must be accurately estimated, to determine if that patient systematically differs from the average in a randomized trial, for instance.
Heterogeneous treatment effects (HTEs)-or systematic differences in treatment effects among participants with different observable features-are increasingly identified through post-trial data analyses [2,3]. Because common univariate subgroup analyses of randomized trial data have been found to lack statistical power to detect HTEs [4], some researchers have deployed multivariate models to detect how covariates may produce different expected treatment effects for different participants based on a multivariate risk/benefit score [5]. Both traditional statistical regression approaches [6] and newer machine-learning methods such as random forest (RF) methods [7] are increasingly being used to estimate HTEs.
HTE estimation approaches attempt to predict individual-level treatment effects and identify subgroups of participants with above average or lower average benefit from a treatment. In this paper, we focus on identifying subgroups with HTE, rather than individual-level treatment effects. A major challenge in HTE subgroup analysis is that, even if the data come from a randomized trial, subgroups that are identified may be imbalanced on important clinical characteristics [8]. For example, for a blood pressure treatment trial, one subgroup with high benefit from treatment may be individuals of black race with baseline systolic blood pressure >140 mmHg. Yet it may be that among individuals with black race and baseline systolic blood pressure >140 mmHg, those randomized to the treatment arm had much better kidney function than those randomized to the control arm. Hence, race and baseline blood pressure may not have correctly identified a HTE, since the subgroup identified may have been falsely detected due to an imbalance in another feature (kidney function) between the treatment and control groups. While randomization aims to avoid such confounding in the overall trial, it does not guarantee that all subgroups of interest will achieve adequate balance to avoid confounding in the estimation of a subgroup-specific effect. If HTE subgroups were known in advance, restricted or constrained randomization within subgroups would be an attractive option for inference on the HTEs. However, in practice, HTE subgroups are rarely known in advance and creating subgroups from combinations of all levels of prognostic categorical variables (such as those listed in Table 1) would likely lead to an impractical randomization scheme.
The consequences of falsely identifying a subgroup can be dire: a life-saving treatment could be withheld or an ineffective treatment prescribed that increases the chances of a serious adverse event. One potential strategy to reduce the likelihood of false HTE detection is to reduce the observed imbalance between treatment and control groups when applying methods to detect HTE [8].
Here, we propose a method for detecting and potentially avoiding imbalance in observed characteristics in trial data when constructing HTE models for precision medicine applications. We specifically combine matching (to control for treated vs. control covariate imbalances) with classification and regression trees (to identify subgroups with an easily visualized decision tree), and demonstrate the virtues of this approach compared to popular alternatives currently implemented in the post-hoc trial analysis literature.

Methods
We simulated multiple trials in which the true HTEs were known, to compare the rates of finding true and false positive subgroups from: (i) the matching with classification and regression trees approach (mCART), (ii) a logistic regression (LR) model with interaction terms between participant covariates and the treatment arm, (iii) the common machine-learning approach of RF analysis, and (iv) the newer machine-learning approach specifically designed for detecting HTEs, of gradient forest analysis (sometimes referred to as causal forest analysis, a term we intentionally avoid here [9]). These methods are detailed further below. Statistical code for replication is available at https://github.com/ joerigdon/HTE.

Simulated datasets
We simulated trial data to test the ability of comparator methods to detect HTEs. We simulated randomized trials in which each individual participant has a potential adverse medical event when randomized to treatment, Y 1 , or when randomized to placebo, Y 0 . In practice, only one of Y 1 or Y 0 is revealed by the trial [10]. The outcome variable is labeled 0 if no event occurs and 1 if an event occurs. The unobservable true treatment effect for each individual in the trial, δ = Y 1 -Y 0 , is the difference between the outcome for the treatment group and the outcome for the control group. δ = −1 if the treatment prevents the event (benefit), δ = 0 if the treatment has no effect on the event, and δ = 1 if the treatment induces the event (harm). We considered two study settings (outlined in Table  1) representative of clinical trials often seen in practice: (1) a smaller phase II oncology trial (N = 200) and (2) a larger phase III trial of a cardiovascular disease (CVD) treatment (N = 6000). In setting (1), we considered a trial where researchers were interested in replicating the finding that a combination of treatments was more effective than a single treatment in increasing progression-free survival among patients with advanced, estrogen receptor-positive, HER2-negative breast cancer [11]. We designed the study to have 80% power to detect a change from an event rate of 59/81 (about 73%) in the single treatment group (Y 0 ) to an event rate of 41/84 (about 49%) in the combination group (Y 1 ). Such a study design would require n = 64 per group [12], but given the high number of anticipated dropouts and conservative effect-size estimation (73% rounded to 70% and 49% to 50%), we recruited n = 100 per group in the hypothetical trial.
Let V~N(m, s) be shorthand for variable V is normally distributed with mean m and standard deviation s. Let V~Bern(b) be shorthand for variable V is drawn from a Bernoulli distribution with mean b, or Pr[V = 1] = b, and let V~Multinom(p A , p B , …, p Z ) be shorthand for V is drawn from a multinomial distribution where V can take on the values (A, B, …, Z) with corresponding probabilities (p A , p B , …, p Z ). For setting (1), the oncology trial, we simulated n = 200 records of the following six baseline covariates: age~N(65, 5), disease stage = 4 B(0.98), disease site = (visceral, bone only, other)M ultinom(0.5, 0.17, 0.33), previous treatment = (none, chemo only, hormonal only, chemo + hormonal)M ultinom(0.5, 0.2, 0.2, 0.1), Eastern Cooperative Oncology Group (ECOG) score~Bern(0.55), and disease-free interval >12 months from adjuvant to re-currence~Bern(0.35).
We simulated the treatment effects in two ways for the oncology trial. In simulation 1A, we considered a setting where there were no HTE subgroups, i.e., δ = (-1, 0, 1)~Multinom(0.5, 0.2, 0.3) for all N = 200 individuals in the trial, independent of covariates such that the average treatment effect was approximately -0.2. For individuals with δ = −1, Y 1 and Y 0 immediately follow as 0 and 1, respectively, and for individuals with δ = 1, Y 1 and Y 0 immediately follow as 1 and 0, respectively. For individuals with δ = 0, Y 1 and Y 0 were set to 1, so that our sample means were approximately Y 1 = 0.5 and Y 0 = 0.7.
In simulation 1B, we considered a setting where there were HTE subgroups in the oncology trial in two groups: women over 65 years old versus women 65 years and younger, each constituting about half of the sample population of the trial. In particular, for women aged ≤65, δ~Multinom(0.6, 0.2, 0.2), such that the average treatment effect (ATE) was approximately −0.4 (Y 1 = 0.4 and Y 0 = 0.8), and for women aged >65, δ~Multinom(0.4, 0.2, 0.4), such that the ATE was approximately 0 (Y 1 = 0.6 and Y 0 = 0.6), and such that the overall ATE was still approximately −0.2.
In setting (2), we considered a trial where researchers were interested in testing the effect of a more intensive blood pressure target of systolic pressure <20 mmHg versus the standard target for systolic pressure <140 mmHg for preventing a composite CVD outcome [13]. We designed the study to have 80% power to detect a change from an event rate of 6.8% in the standard (Y 0 ) to an event rate of 5.2% in the intensive target group (Y 1 ). Such a study design would require n = 3000 per group [12] to have 80% power to reject the null hypothesis that standard and intensive are equal versus the alternative specified above (6.8% standard versus 5.2% intensive).
In setting 2B, we simulated HTEs by setting:  In simulation 2C, we consider the same N = 6000 CVD trial where ATE is 0 but there are HTE subgroups. In particular:

mCART methodological approach
To identify HTE subgroups that are balanced on the covariates, we propose a novel algorithm using rank-based Mahalanobis distance matrix matching followed by classification trees for inference on the HTEs. Henceforth, we term this method matching plus classification and regression trees (mCART): 1. We select a set of K prognostic variables of interest, P = (X 1 , …, X K ), where practice or literature suggests a potential effect modification in δ, e.g., age, sex, race, medical history, etc. These K variables are often selected beforehand by research teams (e.g., the demographic and risk factors displayed in a typical Table 1) and can be continuous or categorical. 2. For each participant i = 1, ..., N in the randomized trial, the covariate vector P i = (x i1 , …, x iK ) is collected and stored. 3. Suppose there are C individuals randomized to the control group and T to the treatment group (such that C + T = N). Then, a rank-based Mahalanobis distance matrix with T rows (treated individuals) and C columns (control individuals) is formed. If C > T the matrix is transposed. [14] is applied to the matrix in step 3 to create G = min(C, T) pair matches, each containing one treated individual and one control individual. 5. For match g = 1, …, G, the covariate vectors P g t (treated) and P g c (control) are compared. If for match g, any of the categorical variables, e.g., race or sex, in P g t are unequal to their counterpart in P g t , then match g is discarded from the set of G matches. After this step, there are G 2 ≤ G matches remaining. We do not anticipate losing an impactful number of matches as mCART is designed for settings where there are 10-15 prognostic variables of interest to be balanced at baseline (shown in Table 1), of which perhaps 7-10 are categorical. 6. We use the G 2 matched pairs to create an averaged data set as follows. For match

A pair-matching algorithm
, and the vector of covariates P g is equal to (P g t + P g c ) / 2, the average of the treated and control participants in pair g. 7. We apply a single conditional inference tree [15] to model δ g as a function of P g in the averaged data set from step 6. This yields a decision tree that estimates where there are differences in the distribution of δ g , i.e., where there are heterogeneous treatment effects. By virtue of the match, any identified subgroups have an approximately equal distribution of risk factors between the treated individuals and controls. The tree will split important categorical variables by level and continuous variables by cut points. 8. We apply the model estimated in step 7 to our N individual by K variable observed data collected in the randomized trial to estimate Within the terminal nodes identified in step 7, we can estimate the ATE in the original data set using methods for inference on a risk difference.

Comparison methods
We compared our method to three strategies commonly applied or proposed for identifying HTEs: LR, RF, and gradient RF. Unless otherwise specified, default settings for parameters in RF or gradient RF were used. In LR, all variables, treatments, and the interaction of treatment with each of the individual variables were entered into the model. The backwards Akaike information criterion was used to select the most parsimonious model, as is typical in the literature for HTE detection [16,17]. After obtaining the final model, a probability of the outcome for the treatment group, p 1 , and a probability of the outcome for the control group, p 0 , were estimated for each of the 6000 individuals. Treatment effects were estimated for each individual as p 1 -p 0 . Estimated treatment effects were partitioned into subgroups using one classification and regression tree (via the R package 'party') [15].
In the RF method, all variables and a treatment dummy variable were entered into the model. The RF method searches across all available variables to find the first variable that explains the largest variance in the outcome, and it chooses a value of that variable to split the population into subgroups. Then, a second variable is chosen, then a third, producing a tree where the branches identify subgroups. The process is repeated hundreds of times with bootstrapped samples of the data and covariates, to produce a forest of these trees, and the predicted outcome for an individual is taken as the average prediction from among the trees in the forest [18]. We did not specifically enter any interaction terms because the RF method searches for interactions by construction. The RF algorithm was applied by taking 500 bootstrap samples with replacement of the data. The best split at each node in each tree is chosen among a randomly sampled group of the square root of the total number of variables. The R package 'randomForest' was used to for the modeling [19]. A predicted outcome (equal to 0 or 1) was obtained for each individual as the most common prediction of 0 or 1 from the 500 trees, for both the treatment group and the control group. The difference in predicted outcomes served as each individual's treatment effect estimate. Estimated treatment effects (−1, 0, or 1 for each individual) were again partitioned into subgroups using one classification and regression tree in the R package 'party'.
In the gradient RF method, the RF is built to yield an estimate in the interval [−1, 1] for δ for each individual. A key difference between the gradient RF method and the RF method is the process known as honest estimation, which means that the gradient RF approach selects the variables defining each split point/branch of the decision tree from one subset of the data, then estimates the values of each variable that define the split in a different subset, to reduce the bias in treatment effect estimation and the influence of outliers [9]. Additionally, the gradient RF algorithm applies a classification tree to each of 2000 bootstrap samples. It (i) finds terminal nodes of individuals with similar covariates (X) and (ii) computes an effect estimate within each node as the proportion experiencing the outcome in the treated group minus the proportion experiencing the outcome in the control subset of the trial. Using similar logic as the RF method, the predicted treatment effects are averaged across the 2000 trees to yield an estimated causal effect for each individual in the trial. Gradient RF was carried out using the R package 'grf' [20]. After building the risk model, estimates of δ were again partitioned into subgroups using a classification tree.
Simulations were performed in R [21], using the simulation code posted at https://github.com/joerigdon/HTE. In simulations 1A, 1B, 2A, 2B, and 2C, one hypothetical trial was generated by assigning half of the simulated participants to the treatment group (Z = 1) and half to the control group (Z = 0), and the outcome for each individual was calculated as Y = Z × Y 1 + (1 -Z) × Y 0 . For all methods, after subgroup identification, HTEs were computed within identified subgroups along with treated minus control absolute standardized differences (ASDs) for every covariate. Subgroups were deemed to have acceptable balance if all ASDs were below 0.2, a cutoff point chosen because it is of the same order as small effect sizes [22]. Bias was also computed within each subgroup as the estimated ATE for the subgroup. Bias is the difference in proportions of the outcome for treated individuals minus controls in the subgroup, minus the true ATE for the subgroup (known as the average of the true δ's in the subgroup).

Simulation 1A
The characteristics of the simulated trial population for simulation 1A are displayed in Table 2. It shows that the important covariates in the simulated trial were balanced across the overall treated and control groups, indicating successful randomization, and thus allowing average differences in outcomes (Y) to be attributed to control or treatment exposure without concerns due to confounding arising from baseline covariates. The outcomes were: 46/ 100 individuals in the combination group experienced the event, while 70/100 individuals in the single agent arm experienced the event. The risk difference wasΔ = -0.24 [95% confidence interval (0.38, -0.01)], which is close to the true value Δ = -0.2. The treatment was effective in the sense that the upper limit of the confidence interval of the treatment effect was less than 0, indicating a decrease in the number of deaths for the combination treatment group.
In subgroup analyses of the data shown in Table 2, mCART did not find any subgroups (no false positives), whereas LR found 2, RF 5, and gradient RF also 5 (all false positives). Tables 4, 5 and 6 in the appendix show the balance statistics for the discovered subgroups for LR, RF, and gradient RF, respectively. Figure 1 displays a plot of the maximum ASD within subgroup versus the bias for each of the methods. Table 7 in the appendix shows the data at baseline for simulation 1B. Data are balanced across the single agent and combination groups with the minor exception of previous treatment equals none (ASD = 0.22; note that ASD < 0.2 is the rule of thumb for small effect-size differences or an acceptable balance [22]). The trial is a success as the combination group (29/ 100) is shown to have a lower event rate than the single-agent group (56/100) with a risk difference of −0.27 (−0.41, −0.13).

Simulation 1B
Simulation 1B contained two subgroups with HTEs: women ≤65 years of age and women >65. In the subgroup analyses, mCART did not find any subgroups (two false negatives), whereas LR found 4, RF 6, and gradient RF also 10 (all false positives). Fig. 4 in the appendix is a plot of the maximum ASD within subgroup versus the bias for each of the methods.   Table 3 shows the data at baseline for simulation 2A. Data are balanced across the treatment and placebo groups with all ASDs < 0.2. The trial is a success as the treatment group (157/3000) is shown to have a lower event rate than the control group (203/3000) with a risk difference of −0.015 (−0.028, −0.0030). Simulation 2A contained no subgroups with HTEs. In subgroup analyses, mCART did not find any subgroups, whereas LR found 448, RF 2, and gradient RF also 24 (all false positives). Fig. 5 in the appendix is a plot of the maximum ASD within subgroup versus the bias for each of the methods. Table 8 in the appendix shows the data at baseline for simulation 2B. Data are balanced across the treatment and placebo groups with all ASDs< 0.2. The trial is a success as the treatment group (160/3000) is shown to have a lower event rate than the control group (203/3000) with a risk difference of −0.014 (−0.027, −0.0019).

Simulation 2B
Simulation 2B contained four subgroups with HTEs as outlined in 'Methods': individuals taking aspirin with eGFR ≤ 72, individuals taking aspirin with eGFR > 72, individuals not taking aspirin with eGFR ≤ 72, and individuals not taking aspirin with eGFR > 72. In the subgroup analyses, mCART found four subgroups (see Fig. 2), whereas LR found 436, RF 3, and gradient RF 37 (all false positives). The four subgroups found by mCART approximately equaled the four true subgroups: individuals taking aspirin with eGFR ≤ 76.649 (versus 72), individuals taking aspirin with eGFR > 76.649 (versus 72), individuals not taking aspirin with eGFR ≤ 72.743 (versus 72), and individuals not taking aspirin with eGFR > 72.743 (versus 72). Figure 3 is a plot of the maximum ASD within subgroup versus the bias for each of the methods. Notably, all the subgroups discovered by mCART had maximum ASDs < 0.2 and bias never exceeding 0.016.

Simulation 2C
Simulation 2C had no ATE but the same four HTE subgroups as in simulation 2B: individuals taking aspirin with eGFR ≤ 72, individuals taking aspirin with eGFR > 72, individuals not taking aspirin with eGFR ≤ 72, and individuals not taking aspirin with eGFR > 72. Table 9 in the appendix displays the study characteristics at baseline. The trial has a null result as the treatment group (193 / 3000) is shown to have the same event rate as the control group (190 / 3000) with a risk difference of 0.001 (−0.012, 0.014).
In the subgroup analyses, mCART found four subgroups (Fig. 6 in the appendix), whereas LR found 442, RF 3, and gradient RF 46 (all false positives). The four subgroups found by mCART approximately equaled the four true subgroups: individuals taking aspirin with eGFR ≤ 72.057 (versus 72), individuals taking aspirin with eGFR > 72.057 (versus 72), individuals not taking aspirin with eGFR ≤ 72.796 (versus 72), and individuals not taking aspirin with eGFR > 72.796 (versus 72). Fig. 7 in the appendix is a plot of the maximum ASD within subgroup versus the bias for each of the methods. Notably, all the subgroups discovered by mCART had maximum ASDs < 0.2 and bias never exceeding 0.008.

Discussion
Precision medicine requires the detection of HTEs from randomized trial data to provide personalized effect estimates-that is, to determine if a particular patient is likely to experience benefits, no effects, or harms from therapy. This contextualizes the ATE in a trial for the individual patient.
Here, we found that while common standard regression and alternative machine-learning methods   Fig. 2 mCART results from simulation 2B. eGFR estimated glomerular filtration rate, mCART matching plus classification and regression trees can identify HTE subgroups, they may also yield imbalances between the study arms within identified subgroups, such that differences in outcomes are falsely attributed to differences in treatment effect, but are in fact due to imbalances in covariates. We strongly recommend that researchers report the balance between study arms in identified subgroups to reduce the risk of false HTE reporting. Guidelines already recommend that studies estimating causal effects have a detailed discussion of the covariate balance of the groups under discussion [23,24].
We also tested the method of matching followed by CART analysis and found it may reduce the imbalance in observable covariates and thereby prevent false HTE detection. The method yielded subgroups with a balance in observable characteristics, suggesting that differences in outcomes in identified subgroups were attributable to treatment or unobserved covariates. The method also produced an interpretable decision tree that may be more transparent to clinicians than alternative machine learning methods.
A limitation of our study is that we considered a handful of simple data-generating processes that we believe to be representative of trials hypothetically seen in clinical practice, with the intention of demonstrating a situation where an imbalance in subgroups can cause confounding. We do not know how often this will occur in practice, as this requires a further systematic review of the literature. Importantly, a second limitation is that real data could have unmeasured confounders that we cannot control for.
In future work, we hope to study how the choice of different matching algorithms impacts the performance of the mCART algorithm. Other avenues of future research include applying mCART to non-binary outcomes (e.g., continuous or survival outcomes), considering methods of inference for smaller sample sizes within identified subgroups (e.g., [25]), and further optimizing mCART for smaller trials (as it did not detect the two subgroups in the N = 200 oncology trial).

Conclusions
mCART could be an interpretable and rigorous tool for identifying HTE subgroups after the conclusion of a clinical trial, and may help identify subgroups balanced on potential prognostic baseline variables that also differ in treatment effects. Perhaps most importantly, mCART may help prevent the wasteful false discovery of HTE subgroups in secondary analyses of randomized trials.