The MAMS design
Suppose K experimental arms are to be compared to a common control over a maximum of J stages. In the first J−1 stages, experimental arms are compared to the control on an intermediate outcome, I, the requirements of which have been described previously [2]. Experimental arms that pass all J−1 interim analyses are then compared to the control on D at the end of stage J. It is also possible for the I and D outcomes to be the same. For example, a phase II trial is unlikely to consider the efficacy of a treatment on a long-term endpoint that would normally form the D outcome in a phase III study (e.g. overall survival) but instead focus only on a single short-term endpoint throughout the study, which could be an indicator for long-term efficacy (e.g. failure-free survival).
Denote by θ
jk
the underlying effect of experimental arm k relative to control on the outcome in stage j (\(j=1,\dots,J\); \(k=1,\dots,K\)). Without loss of generality, assume that a negative value of θ
jk
indicates a beneficial effect for arm k. Note that a MAMS design currently requires the same null and alternative hypotheses to be used for all arms in the trial, thus allowing each arm to be assessed simultaneously against the control at each interim analysis [3]. Therefore, the null (\(H^{0}_{jk}\)) and alternative (\(H_{jk}^{1}\)) hypotheses for θ
jk
can be written
$$ \begin{aligned} H_{jk}^{0}&: \theta_{jk} \geq {\theta_{j}^{0}}, \\ H_{jk}^{1}&: \theta_{jk} < {\theta_{j}^{0}}, \end{aligned} \qquad j=1,\dots,J; k=1,\dots,K $$
for some pre-specified null effects \({\theta _{j}^{0}}\). If I≠D then \({\theta _{j}^{0}}\) is the assumed null value for the effect on D and will be denoted by \({\theta _{D}^{0}}\). Likewise, the null effect for the interim stages (j<J) will be denoted by \({\theta _{I}^{0}}\). If I=D then \({\theta _{j}^{0}}={\theta _{D}^{0}}\) for all j. In practice, \({\theta _{I}^{0}}\) and \({\theta _{D}^{0}}\) are commonly taken to be 0 to represent no difference [e.g. for log hazard ratios (HRs)]. We will also apply similar notation to the underlying treatment effects for each experimental arm: when I=D, θ
jk
=θ
Dk
for all j, while in I≠D designs, θ
jk
=θ
Ik
for all \(j=1,\dots,J-1\) and θ
jk
=θ
Dk
for j=J. When K=1, we will drop the subscript k.
The current procedure for designing a MAMS trial is as follows [2]:
-
Choose the number of experimental arms, K, and stages, J, in the trial.
-
Define the null values \({\theta _{D}^{0}}\) and, if applicable, \({\theta _{I}^{0}}\) for the effects on the D and I outcomes, respectively, and specify any corresponding nuisance parameters (e.g. control event rates for binary outcomes, variances for continuous outcomes etc.).
-
Choose the allocation ratio A, that is the number of patients to allocate to each experimental arm for every patient allocated to the control. A=1 represents equal allocation while A<1 means that fewer patients will be allocated to each experimental arm than the control.
-
For each stage, choose the one-sided significance level, α
j
, and power, ω
j
, for all pairwise comparisons in that stage (\(j=1\dots,J\)). Rough guidelines for choosing α
j
and ω
j
are described in [2].
-
Choose the minimum target differences \({\theta _{I}^{1}}\) and \({\theta _{D}^{1}}\) that one would like to detect on the I and D outcomes, respectively.
-
Calculate the required sample size (or number of events for time-to-event outcomes), timing of each interim analysis and the overall type I error rate (see below) and power. Dedicated software is available in Stata for designing MAMS trials with time-to-event outcomes (nstage) [9, 10] and binary outcomes (nstagebin).
The analysis at the end of each stage occurs when the required sample size in the control arm has completed follow-up or, for time-to-event outcomes, when the required number of events has been observed in the control arm. At each interim analysis (end of stages \(1,\dots,J-1\)), recruitment is stopped to all experimental arms with observed treatment effects on I that are statistically non-significant at level α
j
, while recruitment to other arms continues into the next stage of the study. Experimental arms that reach the end of the final stage of the study are compared to the control on D at level α
J
and recruitment to the trial is terminated.
Pairwise type I error rate
The PWER is the probability of wrongly rejecting the null hypothesis for D, \({H_{D}^{0}}\), for a particular experimental arm. Since \({H_{D}^{0}}\) can only be rejected at the end of the final stage of a study, a type I error may only be made at that point (note that this MAMS design can be easily amended to accommodate stopping rules for extreme efficacy on D, which will have a negligible impact on the PWER [6]). Furthermore, a type I error cannot be made on the I outcome since this is not the primary outcome of the study. For a MAMS trial with J stages, Royston et al. [2] state that the PWER is given by
$$ \alpha = \Phi_{J}(z_{\alpha_{1}},\dots,z_{\alpha_{J}};R), $$
(1)
where Φ
J
is the J-dimensional normal distribution function with correlation matrix R. The (j,k)th entry of R is the correlation between the treatment effects in stages j and k under the null hypotheses of I and D. Calculation of these correlations is described in [2] for time-to-event outcomes, in [3] for binary outcomes and in [11] for a single normally distributed outcome. The overall pairwise power is calculated in a similar manner, replacing the stagewise significance levels (α
j
) in Eq. 1 with the corresponding stagewise powers (ω
j
).
Influence of an underlying effect on I on the PWER
When I≠D, the calculation of α described in [2] is made under the assumption that \({H_{D}^{0}}\) and the null hypothesis for I, \({H_{I}^{0}}\), are true. However, in practice it is possible for an experimental arm, to have a beneficial effect on I and yet remain ineffective on D. Rejecting \({H_{D}^{0}}\) at the end of the study would still constitute a type I error, yet the experimental arm will have a higher chance of reaching that point due to its effectiveness on I (i.e. it is more likely to pass the interim stages). Consequently, the PWER for such an arm will be higher than the value calculated in Eq. 1.
If the experimental arm is sufficiently effective on I that it would always pass all interim analyses, then the first J−1 stages effectively become redundant. Under such a scenario, the PWER for the experimental arm would be maximised and will be equal to the final-stage significance level, α
J
. To illustrate this, Fig. 1 shows the PWERs of two 2-stage I≠D trial designs with time-to-event outcomes in which the underlying log HR θ
I
varies and θ
D
=0 (i.e. the underlying HR on D is 1). The first-stage significance levels are α
1=0.5 in design (a) and α
1=0.2 in design (b). In both designs, the final-stage significance level is α
2=0.025, an equal allocation ratio is used (A=1) and \({\theta _{I}^{0}}=0\). Using Eq. 1 to estimate the PWER under the assumption that the experimental arm is ineffective on I gives α=0.0201 for design (a) and α=0.0165 for (b). To calculate type I error rates for other underlying log HRs on I, we simulated trial-level data under each design scenario using the procedure described in [9].
As expected, when θ
I
=0 (i.e. when \(\theta _{I}={\theta _{I}^{0}}\)), the PWER for both designs is equal to the corresponding value of α (Fig. 1). As the effectiveness of the experimental arm on I increases (i.e. as θ
I
decreases), the PWER eventually plateaus at a level equal to the final-stage significance level (α
2=0.025) with this value being practically reached even for modest effects on I. The increase in the type I error rate is greater for design (b) and this will generally be the case when the difference between α and α
J
is larger. This occurs when using more stages or smaller significance levels in the intermediate stages.
Controlling the PWER
Despite it being highly unlikely for a treatment arm to have such an effect on I and D that the maximum PWER is achieved (particularly if I is appropriately chosen), Fig. 1 shows that the inflation in the PWER above the value calculated in Eq. 1 is large even for arms with modest effects on I. To help guard against this possibility, one could choose an I outcome that has high sensitivity for D, since then if there is no effect on D it will be highly likely for there also to be no effect on I. However, this will not guarantee strong control of the PWER. Therefore, if strong PWER control is required, we recommend setting α
J
equal to the desired maximum value, α
∗, when designing a MAMS trial to ensure that it cannot exceed this value under any circumstance.
When the maximum type I error rate in I≠D designs is controlled using α
J
, the stopping boundaries for the interim analyses can be considered non-binding. In other words, recruitment to an experimental arm does not strictly have to be stopped at the jth interim analysis if its observed treatment effect is statistically non-significant at level α
j
. This flexibility is advantageous as it may not be desirable to drop arms that are performing no better than the control on I if they are showing promising effects on some other important outcome measures. Recruitment to such arms can, therefore, be continued to the next stage without inflating the maximum PWER, although the number of patients recruited will be higher than if the stopping guidelines were strictly followed.
When I=D, the PWER depends only on the underlying effect on a single outcome (D) and so it can be accurately estimated using Eq. 1. In contrast to the I≠D case, all stagewise significance levels contribute to this maximum value and so stopping boundaries must be binding (i.e. strictly adhered to) to avoid inflating α. If this is likely to be impractical due to the above reasons, then the maximum PWER can instead be controlled in a similar manner to the I≠D case by setting α
J
=α
∗ to allow stopping boundaries to be non-binding. Note, however, that this will come at the expense of an increase in the sample size for the final stage of the study due to the use of a smaller significance level in that stage.
Familywise error rate
When evaluating more than one experimental arm in a single study, the probability of at least one false-positive result, the FWER, will be higher than the PWER [12]. In many multi-arm settings, it may, therefore, be more desirable to control the type I error rate for the trial as a whole at some conventional level rather than for each individual treatment comparison.
In a MAMS design, the FWER can be calculated using a generalisation of a simulation procedure proposed by Wason and Jaki [11] for MAMS trials with a single outcome and equally spaced interim analyses. The procedure works by simulating the joint distribution of the z-test statistics for each arm at each stage of the study, accounting for the between-arm and between-stage correlations of the treatment effects. For MAMS designs with I=D, the maximum FWER occurs under the global null hypothesis (i.e. when \({H_{D}^{0}}\) is true for all experimental arms) [13, 14]. When I≠D, the FWER is maximised when all experimental arms are sufficiently effective on I that they would always pass all interim analyses but are all ineffective on D, i.e. when \(\theta _{Ik}=-\infty \) and \(\theta _{Dk}={\theta _{D}^{0}}\) for all k [9]. In this case, the interim stages effectively become redundant and the design reduces to a one-stage trial with the PWER equal to the final-stage significance level, α
J
(i.e. the maximum PWER). The maximum FWER can, therefore, be computed more quickly using the Dunnett probability: [15]
$$ \text{FWER} = 1-\Phi_{K}(z_{1-\alpha_{J}},\dots,z_{1-\alpha_{J}};C), $$
(2)
where C is the K×K between-arm correlation matrix with off-diagonal entries equal to A/(A+1).
Influence of the underlying effects on I on the FWER
To illustrate how quickly the maximum value of the FWER is reached as the true treatment effects on I vary, we calculated the FWER for designs (a) and (b) described in the previous section when two experimental arms are compared to the control. In both two-stage designs, we assumed \(\theta _{Dk}={\theta _{D}^{0}}\) (i.e. θ
Dk
=0, k=1,2) while the underlying effects on I in one or both experimental arms varied. For both designs, the maximum FWER (calculated in nstage using Eq. 2) is 0.045. Note that this maximum value is the same for both designs as they have identical numbers of experimental arms (K=2), allocation ratios and final-stage significance levels. Assuming the null hypothesis on I holds for both arms (i.e. the log HRs on I are 0), then the FWER is estimated using nstage to be 0.0372 and 0.0305 for designs (a) and (b), respectively. In this case, the FWER is lower for design (b) as it uses a lower significance level in the first stage.
To calculate the FWER when the underlying effects on I in one or both experimental arms vary, we used the simulation procedure described in [9]. The results presented in Fig. 2 show that when both experimental arms are even modestly effective on I (e.g. HR=0.8), the maximum FWER is practically reached. The rate of inflation in the FWER as the underlying effects on I increase is again greater for design (b), as was the case for the PWER. When only one experimental arm is effective on I, the FWER is still substantially higher than under the global null hypothesis on I, although only by about half the amount when both arms are effective on I.
Controlling the FWER
When I≠D, the FWER as well as the PWER can be controlled in the strong sense using the final-stage significance level alone. To find the value of α
J
corresponding to the desired FWER, a search procedure over α
J
can be used. For example, to find the required value of α
J
that controls the maximum FWER at the one-sided 2.5 % level in designs (a) and (b), we used nstage iteratively to calculate the maximum FWER of the designs using values of α
J
between 0.0125 and 0.025 (the minimum and maximum possible values of α
J
that can correspond to the maximum FWER) in increments of 0.0001. The final-stage significance level that most closely corresponded to a FWER of 0.025 without exceeding it was 0.0135. Alternatively, the qmvnorm function in R can also be used to compute the required values of α
J
.
When I=D, it is more difficult to find designs that control the FWER since a search procedure over all stagewise significance levels is required. Since I=D designs are also likely to be used in practice, a method for controlling the FWER in the I=D case is needed and is an area of ongoing research. However, if researchers wish to have the flexibility of non-binding stopping guidelines, then the maximum FWER can be controlled in the same manner as for an I≠D design and so the methods described above can be applied.