### Multivariable risk-based ranking of trial individuals

In the following we consider RCT data of the form \(\{x_{i},y_{i},t_{i}\}_{i=1}^{N}\), where *x*_{i} is a vector of baseline patient covariates for the *i*th patient, and *y*_{i} is their observed outcome after receiving a randomised treatment allocation indicated by *t*_{i}. We are interested in pairwise comparisons between two treatment arms *T*_{0} and *T*_{1}.

We assume that it is possible to construct a priori a ‘risk quantile mapping’ *Q*:*X*→[0,1] for the outcome *y*_{i}. This function *Q* maps each subject in the trial to their corresponding empirical risk quantile, agnostic of treatment. Thus, *Q*(*x*_{i})=0 denotes the subject least at risk of the negative outcome, and *Q*(*x*_{j})=1 the subject most at risk. In practice this mapping could be derived by estimating a function *f*:*X*→*Y* using data from a different source (this could include observational data, as the risk is agnostic of the treatment received); computing *f*(*x*_{i}) for each subject; and then mapping *f*(*x*_{i}) onto the empirical risk distribution for the *N* subjects in the trial. For most conditions, there will exist either an already validated risk score or external data on which to build a risk-based ranking [1]. If this is not the case, it is also possible to build an internal risk model by ‘retrodiction’: fitting the function *f* to the trial data at hand. Simulation studies suggest that these internal models introduce little bias into the procedure [14]. We note that a risk quantile mapping can be based on almost any type of outcome data. For example, a proportional hazards (Cox) model fit to time-to-event data can produce a risk quantile mapping by using the estimated linear combination of predictors.

We can then use the risk quantile mapping to construct a reference class. The risk mapping removes the need for multiple testing of interactions between the randomised treatment and single baseline covariates. The use of risk-based reference classes for exploring HTEs has been advocated previously, but it was limited to quintile or quartile subgroups [3–5]. We now consider a general approach using the risk-ranked individuals to estimate ITEs and CATEs. The approach advocated here relies critically on the quality of the risk mapping: the better the quantile mapping (i.e. the better it is at discriminating between low- and high-risk individuals), the better it will be for visualising HTEs if they are present.

In the following, for simplicity we assume that the subject index *i* has subsequently been sorted according to the risk prediction, with *Q*(*x*_{1})=0 and *Q*(*x*_{N})=1. In general, and in the absence of ties, \(Q(x_{i}) = \frac {i-1}{N-1}\).

### Local smoothing estimation of a risk-based ITE using reference class forecasting

By ranking trial subjects from those ‘least at risk’ to those ‘greatest at risk’, it is possible to use a sliding window approach to estimate the ITE for each subject. For subject *i*, the set of risk-adjacent subjects determines the reference class used to predict the ITE for subject *i*. This generalises the concept of partitioning subjects into quintile or quartile subgroups [11].

Mathematically, the adjacency can be quantified using localised reweighting kernels. Local kernels target a specific individual focussed at their quantile of risk *q*_{i} for the *i*th subject by considering the treatment outcome of other individuals in a local neighbourhood of risk-adjacent individuals, with *q*’s close to *q*_{i}. These local reference classes are parameterised by their bandwidth (radius) *γ*∈[0,1], which defines the proportion of subjects in the window ‘close’ to subject *i*, which are used to estimate the ITE of the *i*th subject. This in turn characterises the effective sample size of the reference class forecasting method. The simplest local reference class forecasting weighting scheme uses the ‘boxcar’ function that gives equal weight to all subjects in the local window when estimating the ITE (see Fig. 1 for an example). For a window of width *γ*, we can estimate the ITE of treatment *T*_{1} versus *T*_{0} in subject *i* as:

$$ {}\text{ITE}(q_{i}) = C \left\{\sum_{k} w_{k} y_{k} \mathbbm{1}(t_{k}=T_{1}) - \sum_{k} w_{k} y_{k} \mathbbm{1}(t_{k}=T_{0}) \right\} $$

(1)

where \(C = (\sum _{k} w_{k})^{-1}\) and *w*_{k}=1 for subjects in the window around subject *i* and *w*_{k}=0 outside of the window, i.e.:

$$ w_{k} = \left\{\begin{array}{ll} 1, & \text{for}\quad i-\lfloor{\gamma/N}\rfloor \leq k \leq i + \lfloor{\gamma/N}\rfloor\\ 0, & \text{otherwise} \end{array}\right. $$

(2)

Equation (1) defines the ITE of subject *i* using outcome data from all subjects whose baseline risk quantile is inside an interval of width 2*γ* centred around the *i*th subject. For the subjects in the lowest and highest risk quantiles, for whom there are not ⌊*γ*/*N*⌋ risk-adjacent subjects on each side, we take the convention to define the ITE as that of the subjects ⌊*γ*/*N*⌋ and *N*−⌊*γ*/*N*⌋, respectively. This convention preserves symmetry at the risk ‘tails’ (highest and lowest risk individuals). We note that *γ* must be large enough such that for each subject *i* there is at least one risk-adjacent subject inside the risk quantile interval of width 2*γ* for both treatment arms (at least one received *T*_{0} and at least one received *T*_{1}). Otherwise, the treatment effect is not estimable.

The boxcar kernel is known to be problematic, as it varies in a non-smooth way as subjects enter into and leave the kernel, as illustrated in supplementary Figure S1. A better approach is to use a window that gradually down weights the influence of subjects in the estimate as subjects move away from the prediction point at *q*_{i}. A smoother, improved reference class forecasting method uses the Epanechnikov kernel, again defined on the risk quantiles of radius of width *γ* around the subject *i*, with the weight given as:

$$ {}w_{k} \,=\, \left\{\begin{array}{ll} \frac{3}{4} \left(1 \,-\, (\frac{|i-k|}{\lfloor{\gamma N}\rfloor})^{2}\right), & \text{for}\quad \!\!\!\!\!i-\!\lfloor{\gamma/N}\rfloor \!\leq k \leq i + \lfloor{\gamma/N}\rfloor\\ 0, & \text{otherwise} \end{array}\right. $$

(3)

We use the same convention at the edges of the risk distribution for the subjects *i*<⌊*γ*/*N*⌋ and *i*>*N*−⌊*γ*/*N*⌋. Under an Epanechnikov reweighting scheme, the weights slowly decay as a function of the distance from the *i*th datapoint. Both the boxcar and Epanechnikov kernels, centred around the 25% risk quantile, are illustrated in Fig. 1.

These ‘local’ reference class forecasting methods are symmetrical around the prediction for the subject *i*; i.e. they use an equal number of datapoints each side of *i*. However, they both can be adapted so that the bandwidth varies, exploiting the maximum possible information around the subject *i* and preserving symmetry. For example, at the median risk quantile, a varying bandwidth method would use all the data. We denote these maximal bandwidth local reference classes, and define the size of the window of information around the *i*th subject as min(*i*,*N*−*i*), where the parameter *γ* now specifies the minimal value that this window can take.

### Estimation of a risk-based CATE using reference class forecasting with exponential tilting

Local reweighting schemes provide a principled approach for determining an ITE for a given subject in the trial up to a certain accuracy, with a certain bias-variance trade-off (see the next section). A different goal is to estimate population ATEs but in populations with a different risk distribution to that of the original trial. Often external populations for which the intervention is intended may differ to those of the trial due to issues such as non-representative inclusion criteria, selection bias or geographical clustering. We denote the estimation of the expected effect under a different population as a conditional average treatment effect (CATE). One interesting, and identifiable, external population can be made by tilting the original sample set through reweighting the contribution from each subject. Exponential tilting of the population weights is one example. Under this scheme we can consider estimating the ATE in an external tilted population that contains more higher risk subjects (or more lower risk ones) as compared with the original trial. In this scheme the *i*th subject with baseline covariates *x* is attributed a weight proportional to *e*^{λQ(x)} in the estimate of the external ATE, where the free parameter *λ* determines the overall effective sample of the scheme and how far ‘tilted’ the weights are to the highest risk subjects (*λ*>0) or the lowest risk subjects (*λ*<0). The choice of *λ*=0 recovers the original ATE. The ratio of the relative weight *w*_{1} (the lowest risk subject) to the relative weight *w*_{N} (the highest risk subject) is thus *e*^{−λ}. This is akin to estimating the ATE in a population whose participants are recruited with probability *e*^{λQ(x)} relative to the original trial population.

A CATE could be directly targeted at a population of interest, for example, the set of all screened but excluded subjects (e.g. exclusion due to co-morbidities). One could directly construct the set of weights that target this external population, as the internal validity of the RCT may not apply to the excluded population. We can target the external population by selecting a set of weights such that the weighted distribution of risk in trial participants approximates as best possible the distribution of risk in the external population.

### Effective sample size and bias-variance trade-off

Reference class estimators using reweighting schemes—whether they are global or local—provide unbiased estimators of the targeted treatment effect in the ‘local’ or ‘tilted’ population, but have increased variance with respect to that of the ATE estimated from the original RCT. This is a consequence of the reduced effective sample size within the reference class. The effective sample size can be thought of as the number of subjects (each given weight 1) required to obtain the same accuracy of estimation as in the reweighted population. As a function of the weights *w*_{i}, the effective sample size is given by \((\sum _{i=1}^{N} w_{i})^{2} / \sum _{i=1}^{N} w_{i}^{2}\). The effective sample size is equal to *N* when all weights *w*_{i} are equal to 1 and is strictly less than *N* otherwise. The effective sample size is directly related to the power to detect HTEs using a reweighted reference class. The more distinct the class, the lower the effective sample size, and thus the lower the power to reject the null hypothesis for any given HTE size.

For local reference classes, the effective sample size decreases with decreasing bandwidth of the kernel. This relates to a bias-variance trade-off in estimating the ITE at a reference quantile. The more localised the kernel, the lower the bias to estimate the target ITE, but the greater the standard error of the estimate, which is a function of the square root of the effective sample size. For instance, a kernel that only includes *x*_{i} at a reference point has zero bias for the unique ITE but infinite variance of the estimate, as only one outcome is observed.

### Properties of reweighting schemes under no heterogeneity

It is interesting to note that, under an assumption of ‘no treatment effect heterogeneity’, any weighted average of the outcomes is an unbiased estimator of the ATE, albeit with increased variance. If we consider the event ‘no HTE’ as a null hypothesis, then, under this null, the reweighted reference class ITEs will be distributed around the ATE with a variance determined by the effective sample size. This provides for a formal testing framework able to reject this null hypothesis at a certain level of significance, *α*, should there be an HTE under an alternative hypothesis.