How do multi-stage, multi-arm trials compare to the traditional two-arm parallel group design – a reanalysis of 4 trials

Background To speed up the evaluation of new therapies, the multi-arm, multi-stage trial design was suggested previously by the authors. Methods In this paper, we evaluate the performance of the two-stage, multi-arm design using four cancer trials conducted at the MRC CTU. The performance of the design at fictitious interim analyses is assessed using a conditional bootstrap approach. Results Two main aims are addressed: the error rate of correctly carrying on/stopping the trial at an interim analysis as well as quantifying the gains in terms of resources by employing this design. Furthermore, we make suggestions for the best timing of this interim analysis. Conclusion Multi-arm, multi-stage trials are an effective way of speeding up the therapy evaluation process. The design performs well in terms of the type I and II error rates.


Background
Multi-arm, multi-stage (MAMS) trials employing an intermediate outcome in the early stages of a multi-stage trial with multiple research arms have been proposed as means of speeding up the evaluation of new therapies [1]. The design itself is based on discontinuing randomisation to 'poor' treatments at an early stage, and allowing through to the further stages only those treatments which show a predefined degree of advantage against the control treatment. In the early stages, the experimental arms are compared pairwise with the control according to an intermediate outcome measure. The advantage of using intermediate endpoints at interim analyses in general has been examined by Goldman et al. in simulation studies [2]. Treatment arms that survive this comparison then enter a further stage of patient accrual which ultimately culminates in pairwise comparisons against the control based on the primary endpoint. Thus, the trial is only stopped for lack of benefit, not evidence of a benefit as in many other designs [3,4]. A major issue for these designs is to assess their operating characteristics and in particular to preserve pairwise power at each stage and for the trial overall, whereby power in the second stage is conditional upon a treatment jumping the 'hurdle' at the first stage analysis.
The aim of this study was to determine whether trials which had been conducted as standard parallel group trials would have benefitted from being conducted as a multi-stage trial using the MAMS methodology. To achieve this, data from four different trials was used. There were two main objectives to this study: Firstly to assess whether using the MAMS methodology, trials would have been correctly carried on/stopped at interim analyses and secondly to quantify gains made in terms of use of resources by employing the MAMS methodology. Hence, we were concerned about the error rate at the first stage given a particular outcome at the final stage analysis. An error is defined in this context as either stopping early a trial where the final analysis would have shown an effect or carrying on a trial at the interim analysis when the final analysis would have shown no evidence of an effect. To accomplish this, a series of analyses of the intermediate outcome measure were mimicked for each of the trials using both the original trial data and bootstrapped samples from the data [5].

Trials
Four trials comparing treatments in different cancer sites were included in the analysis. Two of these had a positive outcome in favour of the new therapy, one was 'negative', showing no evidence of a difference between research and control arm, and one was a multi-arm trial with 'negative' and 'positive' results.
The first trial, RE01, showed considerable evidence of a treatment difference in terms of overall survival [6]. In this trial patients with metastatic renal cancer were randomised equally to subcutaneous alpha-interferon (experimental) or oral medroxyprogesterone acetate (control). A comparison of overall survival in the two groups showed a 28% reduction in the mortality rate in the alpha-interferon group (Hazard ratio = 0.72, 95% Confidence Interval 0.55-0.94). This trial used a triangular sequential design which allowed for early stopping as soon as results were conclusive.
Two trials in ovarian cancer were also re-analysed. In ICON3 [7], patients were randomised 1:2 to a research arm of paclitaxel plus carboplatin against a control arm of single agent carboplatin or CAP (cyclophosphamide, doxorubicin, cisplatin). The trial showed no evidence of an improvement of the experimental against the control arm (HR = 0.98, 95% Confidence interval 0.86 -1.10). The second ovarian cancer trial, ICON4 [8] was designed to assess whether giving paclitaxel with platinum-based chemotherapy would benefit women with relapsed 'platinum-sensitive' ovarian cancer more than conventional platinum-based chemotherapy. The trial showed a marked improvement of the experimental over the control treatment (HR = 0.78, 95% Confidence interval 0.65 -0.95).
Finally, the multi-arm trial FOCUS [9] in poor prognosis advanced colorectal cancer was also re-analysed. In this trial patients were randomised 2:1:1:1:1 to five treatment plans A, B, C, D and E. Regimen A (control) comprised single-agent fluorouracil until clinical disease progression, then single-agent irinotecan. B and C were fluorouracil then, respectively, fluorouracil/irinotecan or fluorouracil/oxaliplatin. D and E were fluorouracil/irinotecan or fluorouracil/oxaliplatin from the outset. Only the comparison of experimental arm C with control was found to be superior using the outcome measure of overall survival. All comparison results are given in table 1.

Design of multi-stage re-analysis
We re-designed all four trials as though they had been run in a two-stage design. To do this, parameters given in table 2 and based on the original trial protocols were used to calculate the necessary number of control arm events e I for the first stage using the n-stage program for Stata [10]. This program is also available from the authors upon request. For trials ICON3, ICON4 and RE01, these calculations were based on using progression free survival (PFS) as the intermediate outcome. In FOCUS, however, it was decided that using PFS as an intermediate outcome was not appropriate since the randomisation was to a package of treatment both in the first line setting and at progression. Hence, this trial re-analysis employs overall survival as the outcome at both Stages 1 and 2. While the targetted hazard ratio for the intermediate and final outcome (for the alternative hypothesis) was specified as the same value in most trials, in ICON4, the trial protocol had specified a slightly lower hazard ratio for the final outcome. This value was also used for this analysis. The lower final stage significance level used for FOCUS reflects an adjustment that was made for multiple comparisons due to the number of arms considered in the trial.
An analysis on the intermediate endpoint was conducted at the point where the target number of events for the interim analysis had been 'accrued' in real time in the control group. Patients who had not 'entered' the trial at this time point were excluded from the analysis. At this analysis a hazard ratio for progression free survival was calculated and compared with a critical value for the hazard ratio determined at the design stage. This critical hazard The number of control arm events targetted in this calculation remains the same under both the null and alternative hypothesis. Thus, this critical value gives the lower bound of a 1 -% confidence interval around the hazard ratio under H 0 of no effect [11] (p.79). This means that the one-sided 1 -% confidence interval around will just include H 0 at a power of %. Hence if an observed hazard ratio is greater than this critical value, we can reliable exclude an effect larger than under the nullhypothesis at level . Please see the Additional File 1 for a worked example. Therefore, if the hazard ratio was smaller than this critical value a trial was counted as continuing to randomise further patients and if it was larger, the trial was counted as 'stopped' in the sense that no further randomisation would be conducted to that particular experimental arm. A hazard ratio at the end of the trial using overall survival was also calculated to obtain correlation values between the test statistics on the intermediate outcome and the primary outcome. All hazard ratios were calculated using the Cox regression model with treatment the only covariate.
To calculate an error percentage estimate, 5000 trial datasets were created based on each original trial by taking bootstrap samples with replacement from the trial dataset [5] (p.82). If this scheme were employed without any further adjustments, we would obtain a number of trials in which the overall result does not match the original trial result in terms of significance at the two-sided 5% level. However, the question we wanted to answer was whether, given a final result showing evidence of an effect of the experimental treatment, this trial would have been correctly continued at the interim analysis using the intermediate outcome measure. Hence it was decided that we would discard bootstrap samples in which the treatment effect was non-significant at the 5% level. The same applied to negative trials whereby we would discard if the bootstrapped treatment effect was significant at this level. Therefore, we employed a bootstrap sampling mechanism which was conditional on the result of the final analysis.
On examining the mean of the resulting hazard ratios, we find that across the 5000 datasets, this is very close to the original trial result.  To obtain the graphical displays, patient accrual into the trial was mimicked as described above and in addition, a hazard ratio was calculated each time a new event was observed.
All analyses were conducted in Stata version 9.2.

Results of re-analysis using trial data only
In the first instance, all trials were re-analysed as described in the methods to consider what would have happened hypothetically if they had been designed as two-stage trials. evidence of a difference. The log hazard ratio is below the critical value initially, then moves above it for a short period of time after which it again moves below it. Only very much later on in the trial does it return to above the critical value again. The situation in FOCUS is similarly ambiguous in the first stages of the trial. In all comparisons the observed estimate of the hazard ratio is very close to the critical value at all times. This suggests that at best there is a very small effect of the experimental treatment over control, a long way away from the targetted hazard ratio. Tables 5 to 8 give results for a re-analysis of all trials in 2 stages with the first two tables employing PFS as the intermediate outcome. For comparisons between the control and treatment arms for ICON3, the percentage error relates to those trials where the treatment arm would have been carried on. This is the case because the final trial result showed no evidence of a difference. This error is perhaps of not such great concern as we are continuing arms which have no clear effect at the end of trial and thus it can be argued that this is a 'conservative' error. For the comparison in ICON4 for example though, percentage error refers to the number of times when the treatment arm would have been stopped at the stage 1 analysis even though the final result was positive. This is an error of greater concern to us since we are stopping a trial which would have shown a benefit.

Results of re-analysis using bootstrapped data in 2 stages
Our analyses show that if a treatment is successful on the final outcome at the end of the trial, it has a very good chance of 'jumping' the hurdle at the intermediate stages.
ICON4 and RE01 are both trials with a strong positive outcome at the end of the trial and for these two trials, the maximum error lies below 5% using PFS as the intermediate outcome and just above 6% using overall survival as  figure 1. The results for ICON3 (table 6) also illustrate that since the hazard ratio is not constant over time, the trend in the error rate is also non-monotonic and counter-intuitive. In this trial, as illustrated in figure 2, a very small effect size can be observed which is close to but not at H 0 and is also near the critical value. Therefore, the bootstrap gives a number of realisations which lie either side of the critical value. Tables 7 and 8 give the results for a bootstrap analysis using the primary outcome for all stages of the trial. Interestingly, results for the error rate are not improved (and sometimes worse) when using this outcome. This suggests that the error rate is not inflated by using a different outcome at the intermediate analysis as compared to the final analysis of the trial.
The correlation displayed is the correlation between the test statistic at stage 1 and the end of the trial, i.e. the correlation between the log hazard ratio for the intermediate outcome at the time of that analysis and the log hazard ratio for overall survival at the end of the trial. In the FOCUS reanalysis, the outcome used at both stages is the final outcome. As the results illustrate, the correlation is surprisingly low for early analyses. The reason is that at a high significance level, the number of control arm events e I which are available for analysis is small and there is a longer time interval between that analysis and the final analysis. Furthermore, the analysis is based on a smaller subset of patients.
Tables 6 and 8 additionally give information on the mean time saved if a trial was stopped at an earlier stage for those trials with a negative overall outcome. Mean time saved was calculated by subtracting the mean time required for a stage 1 analysis in the bootstrap runs from the estimated time required for a standard parallel group trial under the design assumptions. In this case, it was assumed that had the trial only been conducted with a final analysis on overall survival, this would have been conducted at a significance level of 5% two-sided and 90% power. If an unsuccessful treatment is rejected at an early stage, savings in trial time of 1.5 -2.3 years can be made.
Simultaneously considering the savings possible in total trial time if a trial is correctly stopped at an earlier stage and the error rate estimated for trials with a final analysis showing evidence of a difference, we can draw some conclusions on the best placing of the stage 1 analysis. This trade off between correctly and incorrectly stopping the trial early suggests that ideally, the stage 1 analysis should be placed at a significance level of 0.2 or 0.3. At this point, the error rate in both the RE01 and ICON4 re-analyses is negligible. At the same time, a saving in total trial time of on average 1 and 2 years could have been made in FOCUS and ICON3 respectively. However, more extensive studies would need to be conducted to obtain the optimal timing of the stage 1 analysis.
As the results in table 7 illustrate, FOCUS comparison A vs C would have been stopped early in some cases even though the overall trial result at the end of the trial could be viewed as being at the 'margins' of statistical significance. To circumvent this problem, it is anticipated that in a MAMS trial design, patients in those arms which were dropped at an earlier stage will still be followed up and the data analysed at a later stage.

Conclusion
Currently, according to FDA estimates [12], about 90% of agents entering Phase I do not succeed at Phase III. This is set against advances in our knowledge about pathways connected to cancer development and metastases and hence the availability of more agents for testing in clinical trials. After passing Phase I/II, the success rate of cancer Phase III trials is still only about 50%. However, if we for example employ a multi-arm, multi-stage design with four experimental regimens and one control, the probability of having at least one successful agent at the end of the trial increases to 87%. This calculation assumes independence of all experimental arms. Therefore, this type of trial design uses resources more efficiently.
In this design we propose to target the event rate in the control arm rather than the number of events in all arms combined. There are two reasons for this approach. Firstly, an event rate different to that anticipated for the trial overall could either arise due to a different underlying event rate in all arms or due to a hazard ratio different to that targeted initially. This level of ambiguity is removed by using the control arm event rate as the deciding factor for when to conduct the analysis. Secondly, when more than one experimental arm is recruited to, it is unlikely that we shall observe the same hazard ratio in all comparisons, giving different total numbers of events for each comparison. However, the calculation for the overall number of events assumes the same event rate in all comparisons in the experimental arms.
We have demonstrated that using the MAMS designs, significant savings can be made in terms of trial time if a treatment does not prove to be effective over the course of the trial. In this case, the trial re-analyses have demonstrated that around 50% of all bootstrapped trials would have been rejected at Stage 1, regardless of when this intermediate stage was conducted. Greater savings can thus be made if a MAMS trial is designed with three or more stages as the probability of early rejection increases.
Trials with good evidence of an effect on overall survival jump all of the intermediate hurdles with very high probability. For ICON4, this was found to be as high as 100%. However, these methods do come at a cost. As the re-analysis of FOCUS comparison A/C has shown, trials or treatment arms with very small treatment effects do risk being discontinued earlier on during the trial. To alleviate this problem in employing this design at the MRC CTU we follow patients up further on discontinued arms while randomisation to these arms is stopped. Thus, these arms will also be analysed on the primary outcome at a later date albeit with reduced power in some instances, depending on the maturity of the data, so at least an estimate of the effect size can be obtained.
Our re-analyses not only demonstrate the efficiency of the MAMS design in general, but also explore the best timing of the first stage analysis. The results suggest that the first stage is ideally placed at a significance level between 0.2 and 0.3 when the trade-off between the errors for correctly and incorrectly stopping is considered. This was the point at which the error rate in the RE01 re-analysis for example became negligible, both for PFS and overall survival as the intermediate endpoint. However, this is a practical recommendation only and does not reflect an optimal design as could be obtained from a simulation study.
A further observation in this study was the behaviour of the correlation between the test statistics on the interme-diate and primary outcome. If the first stage is conducted early on with a very small number of control arm events, this correlation coefficient is generally low, around 0.3. It increases over time and reaches values in the range of 0.5 to 0.7 for a very late first stage analysis.
This type of analysis would ideally be done as a simulation study, taking account of all possible trial scenarios. However, we believe that the much simpler conditional bootstrap approach that we employed is adequate. In this analysis, we needed to be able to distinguish between the error of stopping a trial early conditional on the final result showing good evidence of an effect, and the error of continuing a trial conditional on the final result showing no evidence of an effect. Hence we used a conditional bootstrap rather than the standard version since our interest was in the error conditional on a particular final outcome.
In this design no adjustment for multiple testing is made to the type I error at the interim stages. There are a number of reasons for this: i) early stopping for efficacy where the issue of type I error is perhaps more acute, is not incorporated in the design, ii) the significance level at each stage has a screening role only and is set close to 0.5 to ensure an early first stage look and iii) the overall significance level is bounded above by the significance level chosen for the final stage. If desired, this final stage significance level may be adjusted according to the number of experimental arms considered. In fact, as the methods we present are for stopping arms for lack of benefit it may be appropriate to adjust the type II error for multiple comparisons at each stage. However, in our experience this issue is typically not considered.
When analysing the trial at the intermediate stages, power may be increased by including covariates. Since the early stages will contain few patients, the trial population across the arms is more likely to be unbalanced in terms of potentially confounding covariates such as age. Including these known influential covariates in the analysis may increase the robustness of the results.  For details see legend of table 5