Impact of lack-of-benefit stopping rules on treatment effect estimates of two-arm multi-stage (TAMS) trials with time to event outcome

Background In 2011, Royston et al. described technical details of a two-arm, multi-stage (TAMS) design. The design enables a trial to be stopped part-way through recruitment if the accumulating data suggests a lack of benefit of the experimental arm. Such interim decisions can be made using data on an available ‘intermediate’ outcome. At the conclusion of the trial, the definitive outcome is analyzed. Typical intermediate and definitive outcomes in cancer might be progression-free and overall survival, respectively. In TAMS designs, the stopping rule applied at the interim stage(s) affects the sampling distribution of the treatment effect estimator, potentially inducing bias that needs addressing. Methods We quantified the bias in the treatment effect estimator in TAMS trials according to the size of the treatment effect and for different designs. We also retrospectively ‘redesigned’ completed cancer trials as TAMS trials and used the bootstrap to quantify bias. Results In trials in which the experimental treatment is better than the control and which continue to their planned end, the bias in the estimate of treatment effect is small and of no practical importance. In trials stopped for lack of benefit at an interim stage, the treatment effect estimate is biased at the time of interim assessment. This bias is markedly reduced by further patient follow-up and reanalysis at the planned ‘end’ of the trial. Conclusions Provided that all patients in a TAMS trial are followed up to the planned end of the trial, the bias in the estimated treatment effect is of no practical importance. Bias correction is then unnecessary.


Background
The two-arm, multi-stage (TAMS) trial design described by Royston et al. [1] provides a framework for efficiently evaluating an experimental treatment regimen against a control group, by using an intermediate outcome to potentially cease the trial for lack of benefit at an early stage. Choosing appropriate and valid intermediate (I) and definitive (D) outcomes is key to the success of a TAMS trial, for which Royston et al. [1] provides guidance. In this framework, we assume that both the intermediate and final outcomes are time-to-event outcomes. The basic assumptions are that I occurs no later than D, more frequently than D and is on the causal pathway to D. If the null hypothesis is true for I, it must also hold for D. In the absence of an obvious choice for I, a rational choice of I might be D itself earlier in time. In this instance, of course, I does not occur more frequently than D. The TAMS design framework can be well suited to cancer trials. In the cancer context, typical intermediate and definitive outcomes might be progression-free survival (PFS) and overall survival (OS), respectively. Information on PFS is usually available sooner in a study, and in most cancer sites, the treatment effect on PFS is usually highly positively correlated with that on OS [1].
It is well known that stopping a trial early, for example in sequential and group sequential designs, may yield http://www.trialsjournal.com/content/14/1/23 biased estimates of the treatment effect (Piantadosi [2], pp. 183, 387). By the 'treatment effect' we mean the difference on some suitable scale between the experimental and control arms; typically, for time-to-event data this will be the (log) hazard ratio between the survival distributions under proportional hazards (PH). When a trial is stopped early because accumulating evidence favors the alternative hypothesis, the maximum partial likelihood estimate (MPLE) of the treatment effect -in the context of the Cox PH model -is biased in the direction of the alternative hypothesis. The earlier a trial is stopped, the larger the potential bias [2]. Although the TAMS design framework can help (and is helping [3]) to expedite the discovery and evaluation of new and effective treatments, concerns have been raised about possible bias in the final treatment effect estimate induced by this approach, for example, hazard ratios (HR D ) on OS for trials with time-to-event outcomes.
In a TAMS trial, recruitment is halted at one of the interim stages if there is insufficient evidence in favor of the alternative hypothesis. Emerson [4] showed that applying any stopping rule affects the sampling distribution of the MPLE of the treatment effect (see Figure 3 in [4]) and consequently induces a potential bias. The distribution of test statistics and their P values are similarly affected by such rules. However, hypothesis testing is not the focus of the present paper as it has been already addressed in [1]. Bias is present in the estimated treatment effect whether or not a trial is stopped for lack of benefit. However, bias in treatment effect estimates in trials passing all interim lack-of-benefit assessments is more important than that in stopped trials, since such experimental treatments are much more likely to be considered worthy of further study or adoption into clinical practice.
Regardless of an interim decision on whether to stop or not, it is still important to estimate the treatment effect using all available data. Royston et al.'s [5] proposal to terminate recruitment of new patients if the experimental arm fails to show evidence of benefit, while at the same time continuing to follow up all recruited patients, was designed to make TAMS trials cost efficient but also to mitigate possible bias. However, the precise magnitude of the bias present in the final treatment effect estimate has not been rigorously explored. In this paper, we investigate the bias in the estimates of treatment effects resulting from a TAMS design. We also define the 'selection bias' in estimated hazard ratios and empirically quantify its likely magnitude in TAMS trials through simulation studies and bootstrap-based reanalyses of four completed cancer trials.
The structure of the paper is as follows. In the Methods section, we first outline how a TAMS trial is specified, noting the required design parameters and assumptions. Next, we discuss the 'selection' bias induced in TAMS trials by the use of lack-of-benefit stopping guidelines. For simplicity, we discuss this issue in a twostage TAMS setting. We describe our simulation study intended to explore the magnitude of the bias. The simulation study is carried out in a three-stage TAMS setting. In this section, we also introduce four real trials and 'redesign' them as if they were TAMS trials. In the Results, we present simulation results. We also show the results of our bootstrap reanalyses of the example trials in an empirical assessment of bias at the definitive analysis of the treatment effect. This is followed by a discussion.

Specification of a TAMS design
In a TAMS trial, we compare one experimental arm, E, with a control arm, C. A TAMS design has s ≥ 2 stages. The first s − 1 stages assess lack of benefit by comparing E with C on an intermediate outcome, I. The sth stage compares E with C for efficacy on the definitive outcome, D. Let HR I be the underlying hazard ratio for comparing E with C on I, and HR D be the underlying hazard ratio comparing E with C on D.
We assume that proportional hazards hold between the treatment arms, and also that the times to event are exponentially distributed for both I and D outcomes, with control-arm hazard rates of λ I and λ D , respectively.
The null and alternative hypotheses for a TAMS design are: The primary null and alternative hypotheses, H 0 (stage s) and H 1 (stage s), concern HR D , with the hypotheses on I playing a subsidiary role. However, we require design values for all the hypotheses. In practice, HR 0 I and HR 0 D are almost always taken as 1. In cancer trials, HR 1 D = 0.75 is a common choice.
Taking HR 1 I = HR 1 D is a conservative option; the design allows for the possibility that HR 1 I < HR 1 D . For example, in cancer, if I is the earlier of progression or death and D is death, it may be realistic and efficient to take, say, HR 1 D = 0.75 and HR 1 I = 0.7. By definition, if E is better than C then HR I < HR 0 I and HR D < HR 0 D . Let i (i < s) be the estimated hazard ratio comparing E with C on outcome I for all patients recruited up to and including stage i, and s be the hazard ratio comparing E with C on D for all patients at stage s (that is, http://www.trialsjournal.com/content/14/1/23 at the time of the analysis of the definitive outcome). The design is specified as follows: Applies to all stages: 1. Define the hazard rates λ I and λ D , or equivalently, the median times to event. 2. Define hazard ratios HR 0 I , HR 1 I , HR 0 D and HR 1 D . Usually, HR 0 I = HR 0 D = 1. 3. Define the allocation ratio, A, that is the number of patients allocated to E for every patient allocated to C. A = 1 represents equal allocation; with A < 1 relatively fewer patients are allocated to E, and with A > 1, relatively more patients are allocated to E.
For stages 1 to s − 1: 1. For stage i, define a one-sided significance level α i and power ω i . The motivation for one-sided tests is that we are interested only in rejecting the null hypothesis in the direction of benefit of E over C, not harm. We also specify r i , the expected total patient accrual rate per unit time. 2. From these inputs, the nstage software [6] reports e i , the cumulative number of events to be observed in the control arm during stages 1 through i ; n i , the number of patients to be entered in the control arm during stage i ; An i , the corresponding number of patients in the experimental arm; t i , the approximate (calendar) time, t i , of the end of stage i, under the design assumptions; and a critical value, δ i , for rejecting H 0 : HR I = HR 0 I . 3. If i ≥ δ i , the null hypothesis of HR I = HR 0 I cannot be rejected at the α i level, and the trial is stopped for lack of benefit of E over C. Otherwise, i < δ i , suggesting some degree of benefit of E, and recruitment continues to the next stage.

Stage s:
The same principles apply to stage s as to stages 1 to s − 1. Here, e s is the required number of control arm events for the D outcome, cumulative over all stages. We would typically recommend a one-sided significance level of α s = 0.025 at stage s, corresponding to a conventional two-sided 0.05 level.
If the treatment comparison survives all of the s − 1 tests at step 3 above, the trial proceeds to the final stage, otherwise recruitment is terminated early. Mathematical details of the sample size calculations are given in Section Methods of Royston et al. [1].

Interim selection on a definitive outcome
Here we consider the bias induced in the estimated treatment effect in a two-stage TAMS design with I = D. A lack-of-benefit stopping rule is applied at the first (interim) stage. If the treatment comparison shows some evidence of benefit of the experimental arm, recruitment continues and the definitive analysis is performed at a later second stage. Otherwise, recruitment is terminated.
Let θ D be the underlying log hazard ratio (log HR) comparing the experimental treatment with control. We define θ D such that θ D < 0 denotes a true advantage of the experimental treatment over control. Let θ D be the MPLE of θ D . In the absence of stopping rules, θ D is asymptotically unbiased and approximately normally distributed in repeated realizations of the trial ( [7], p. 40). No bias enters, and so: over repeated realizations of the trial. Letθ D be the estimated log HR for the data accumulated at the first stage (lack-of-benefit analysis). Recruitment stops ifθ D ≥ log(δ) and continues to the final stage if θ D < log(δ). The threshold δ is predefined according to a designated significance level and power. We have: where B 1 , B 2 > 0 and are functions of θ D . B 1 and B 2 may be termed the selection bias [8] inθ D in the two scenarios. Expressions 2 and 3 state that under the PH assumptionθ D is biased downwards by B 1 and upwards by B 2 in continuing and stopped trials, respectively. As an illustration, Figure 1 shows hypothetical sampling distributions (densities) ofθ D at the first stage for treatments with θ D negative, zero or positive. The vertical line denotes a typical lack-of-benefit threshold, log(δ) < 0. The probability of passing the lack-of-benefit threshold, Figure 1 Sampling distribution ofθ D , that is, estimated log hazard ratios, which are normally distributed, under different underlying effects, θ D . δ is the predefined threshold. http://www.trialsjournal.com/content/14/1/23 that is Pr θ D < log(δ) , is the area under the appropriate density to the left of δ. Trials of the treatment for which θ D < 0 (long-dashed line) have the largest chance of passing, and those for which θ D > 0 (short-dashed line) have the largest chance of stopping.
Selection bias B 1 among 'passed trials' is the largest for the treatment with θ D > 0 and smallest for that with θ D < 0. Conversely, selection bias B 2 among 'stopped trials' is the largest for the treatment with θ D > 0 (dotted line) and smallest for that with θ D < 0 (dashed line).

Interim selection on an intermediate outcome
We now consider the more complex scenario where we use a different outcome I at the interim stage. In Royston et al.'s [1] TAMS design it was proposed to cease/continue accrual according to the value of an intermediate outcome measure that is correlated with the definitive outcome measure. An example is selection on the basis of PFS log HRs but ultimately estimating the OS log HR. Now let θ D andθ I be treatment effect estimates on the D and I outcomes, respectively, at the interim stage. The selection bias inθ D given thatθ I passed the predefined threshold log(δ I ) could be expressed as: for some B 3 that depends on the underlying values θ D , θ I and their correlation, ρ θ I ,θ D . To illustrate this we assume, as in Royston et al. [1], thatθ I andθ D follow a bivariate normal distribution with correlation ρ θ I ,θ D . Figure 2 shows 1,000 log hazard ratio pairs, (θ I ,θ D ), simulated from a bivariate normal distribution with mean (log(0.8), log(0.8)) and a correlation coefficient of 0.8. Dots represent values of (θ I ,θ D ) in simulated trials in whichθ I < δ I . Becauseθ I andθ D are correlated, it is clear that the mean ofθ D in trials that either continue (θ I < log(δ I )) or are stopped (θ I ≥ log(δ I )) is biased with respect to θ D . In this example, the selection bias of the 'stopped' trials is larger than that of the 'continuing' trials, since the square is closer to the circle than the triangle is to the circle.

Simulation study
We conducted simulation studies to quantify the impact of various stopping rules on the estimates of the θ D , that is log(HR D ), in three-stage TAMS designs with two interim analyses. We considered bias in two situations: (i) simulated trials with an underlying hazard ratio close to the null hypothesis, which are likely to be stopped at the first of the two intermediate stages due to apparent lack of efficacy; (ii) simulated trials with an underlying hazard ratio close to the alternative hypothesis, which are more likely to pass both intermediate stages to reach the final stage (analysis of the D outcome). To fix ideas, we took the D outcome as OS and the I outcome (used for selection) as either OS or PFS. We denote the OS hazard ratio and PFS hazard ratio as HR D and HR I , respectively. When the I outcome was PFS, we generated correlated PFS and OS times to event according to the method of Royston et al. [1].
Design parameter values were based on the GOG182/ICON5 trial in advanced ovarian cancer [9]. We assumed the median time-to-event for OS and PFS outcomes to be 2 years and 1 year, respectively. (When I=D, we assumed the median time-to-event to be 1 year.) For sample size calculations, we chose the target hazard ratio to be 0.75 for efficacy and 1.0 for inefficacy on both outcome measures at all stages. Note that when I = D, TAMS designs allow the target hazard ratio(s) for efficacy at intermediate stages to be different (for example, more extreme) than the target hazard ratio at the final stage. Such designs would be even more efficient, but we adopted the conservative option of taking all target hazard ratios for efficacy to be the same across stages.
When generating simulated times to event, we applied each of four underlying hazard ratios: 1.1 and 1.0, to represent trials with an ineffective experimental treatment, and http://www.trialsjournal.com/content/14/1/23 0.88 and 0.75, to represent trials with an effective experimental treatment. The first two represent situation (i), whereas the latter two represent situation (ii) as explained above. In our simulations, 5,000 trials were replicated in each experimental condition. For trials which stopped at stage 1, we computed the mean of estimated OS log hazard ratios, that is log(HR D ), at that stage. For trials that reach the final stage, log(HR D ) is computed at that stage. In all scenarios, we report the results on the hazard ratio scale. To provide an estimate of spread, we also present the 2.5th and 97.5th centiles of the estimated OS hazard ratios. Aside from hazard ratios, we also report the absolute value (size) of percentage bias which is defined as: Data were simulated with staggered patient entry at a uniform accrual rate of 250 individuals per year. Equal numbers of patients were allocated to control and experimental arms in all stages. We also carried out similar simulations with target hazard ratios for efficacy of 0.85 instead of 0.75, requiring larger numbers of I and D events and generally longer timelines.

Bootstrap reanalysis Trials used as examples
To evaluate selection bias in the estimated treatment effects, we also 'reanalyzed' the data from four MRCcoordinated cancer trials as though the trials were run as two-stage TAMS designs (that is one interim analysis). The selected trials comprise two in advanced renal cancer (RE01 [10], RE04 [11]) and two in advanced ovarian cancer (ICON3 [12], ICON4 [13]). All except for RE04 were also reanalyzed from a methodological perspective by Barthel et al. [14]. ICON3 and RE04 were 'unsuccessful' in that no conventionally statistically significant treatment effect was found. ICON4 and RE01 were conventionally 'successful' and demonstrated clear evidence of improvement in survival due to the experimental therapy. Some details of the trial results are given in Table 2.   Figure 3 shows the Kaplan-Meier plots of OS in the trials, truncated at 5 years. There is a suspicion in Figure 3 that the survival curves of the two treatment groups may cross in the RE04 trial, suggesting possible nonproportional hazards. However, this was not confirmed by Grambsch-Therneau tests [15].

Design
We 'redesigned' all four example trials as two-stage TAMS designs. Design parameters are given in Table 3 and are based on values in the original trial protocols. We used the nstage program [6] to compute the required number of control arm events for the I outcome at stage 1 (interim analysis for lack of efficacy) and the D outcome at stage 2 (final analysis of the definitive outcome). We took the D outcome to be OS, and the I outcome to be OS or PFS in separate analyses. We studied five one-sided significance levels α 1 = (0.1, 0.2, 0.3, 0.4, 0.5) at stage 1, providing progressively earlier looks at the accumulating data. Stage 2 one-sided significance level was α 2 = 0.025 in all the scenarios.
We 'entered' patients one by one in the same order as they had presented in the original trial. Stage 1 analysis was conducted when the target number of I events had accrued (see Tables 4 and 5). Patients who had not entered by the time of the stage 1 analysis were excluded from the interim analysis. Trials were 'stopped' for lack of efficacy at stage 1 or continue recruitment to the final analysis at stage 2.   Similar to our simulation studies, the number of replicates was 5,000 in our bootstrap analysis of example trials. In each replicate, the two types of selection bias (in stopped 'unsuccessful' trials, and in 'successful' trials) were investigated exactly as in the simulation study. Means of OS log hazard ratios at stage 1 and at the planned end of the trials are calculated separately. In all scenarios, we report the results on the hazard ratio scale.

Simulation results
The simulation results are summarized in Tables 4 and 5. Table 4 gives the results for the trials that stop at stage 1. The percentage of simulated trials in which the estimated log hazard ratio exceeds the stage 1 threshold log(δ 1 ) is given. This is identified by '%Stop at stage 1' in Table 4. According to the TAMS design, recruitment to such trials is ceased at the interim stage. Table 4 presents the average treatment effect on the D outcome, that is HR D , together with the 2.5th and 97.5th centiles for the trials that stopped at stage 1. We also followed up the individuals in the same stopped trials to the original planned end and computed the estimates of OS hazard ratios then. Table 4 also shows the average of treatment effect on the D outcome at the end of the follow-up period in those trials that stopped at stage 1 for lack of efficacy.
The average treatment effect for trials stopped at stage 1 is biased in all experimental conditions. This bias increases as the underlying hazard ratio moves from 1.1 to 0.75. However, the smaller the underlying hazard ratio, the less likely a trial is to stop at stage 1 -see %Stop in Table 4. The bias is smaller in Design 2 because the lower significance level in stage 1 increases the required number of events and makes the data more mature at this point than in Design 1.
The results in Table 4 indicate that the true (underlying) hazard ratio is overestimated at stage 1 in all scenarios. A key finding is that when follow-up of patients in stopped trials is continued to the planned end of the final stage, the bias is much reduced. For instance, in Design 1 when the target hazard ratio HR 1 D = 0.75 with an underlying HR D of 1.1 -the first row in the left panel of Table 4 -the percentage bias in the average treatment effect for the trials which stopped at stage 1 is 8% -that is 100 × (1.19 − 1.10)/1.10. This decreases to 4% after follow-up to the planned end of the trial. In all cases, after follow-up of patients to the http://www.trialsjournal.com/content/14/1/23 original planned end of the trial, the bias is generally minimal (mostly less than 6%) if the underlying effect is in the direction of null hypothesis. In the dropped trials, the bias is slightly smaller when the intermediate outcome, that is PFS, is used for selection at the interim stages.
We also calculated the average of treatment effect on the D outcome for trials that stopped at either of the two interim stages (data not shown). The selection bias is very similar to the corresponding values in Table 4 when the I outcome is PFS, and the bias becomes smaller when the I outcome is OS. Table 5 presents the average treatment effect on the D outcome at the final stage for the trials that pass both interim stages. The bias at the final stage is generally smaller when the target hazard ratio HR 1 D = 0.85, compared with the corresponding values when target hazard ratio HR 1 D = 0.75. However, the main result from this table is that in the trials that reach the final stage, the selection bias in the average treatment effect is very small provided that the underlying effect is closer to the alternative hypothesis. There is some bias in the average treatment effect when the underlying effect is closer to the null hypothesis, but in such scenarios the chance that the research arm is dropped at the interim stages is large -see %Pass in Table 5.

Bootstrap results
The results of the bootstrap reanalyses of the trials showing evidence of an effect (ICON3 and RE04) are summarized in Tables 6 and 7. Table 6 shows the average treatment effect on the D outcome for the trials that stopped at stage 1 together with their corresponding 2.5th and 97.5th centiles. For the left side of the table OS was used at the interim stage to select trials. On the right PFS was used at the interim stage to select trials. The number of replications was 5,000 in all experimental conditions. Results in Table 6 indicates that bias is present in the trials that did not pass stage 1. For example, the original OS hazard ratio in ICON3 is 0.97 (95% bootstrap CI: 0.87-1.09). For this trial, 1907 (38%) of the 5,000 replicated trials stopped at stage 1 when I = D and α 1 = 0.50 -the first line in the left panel of Table 6. The average treatment effect for the stopped trials is 1.12. But, after the follow-up, the average treatment effect reduces to 0.96 and the selection bias nearly disappears. In general, the bias decreases with decreasing α 1 and δ 1 . This is due, as before, to the increasing amount of information (that is patient events) that is required at the first interim for these design parameters. In all example trials, the bias is very small after the follow-up if the stage 1 significance level α 1 is chosen to be smaller than 0.40.
Furthermore, for the scenarios presented in Table 6, we also computed the average treatment effect on the D outcome at the final stage for the trials which passed the interim stage. The results, presented in Table 7, show that the bias in the average treatment effect in those trials is very small in most scenarios. Results for RE04 show some bias in some scenarios, but the chance of passing the interim stage is relatively small in those conditions -see %Pass.
The results of the bootstrap reanalyses of the 'successful' trials (ICON4 and RE01) are summarized in Tables 8 and  9. Table 8 shows the results for the trials that reach the final stage. For ICON4, 99% of trials reached the final stage when α 1 = 0.5 and the I outcome was OS. In contrast to the unsuccessful trials, the results for the successful trials show that there is almost no bias in the estimated hazard ratio on OS at the final stage.
In ICON4 and RE01, we also computed the average treatment effect for the stopped trials at the interim stage. The results in Table 9 reaffirm that follow-up decreases the amount of bias in most scenarios. It should be noted that unlike our simulation studies where (under the proportional hazards assumption) the treatment effect is assumed to be constant over time, in real trials the effect may not be constant over time. With real trial data we will not know whether the underlying process that created it satisfies the PH assumption or not. Even if the underlying data generating model did satisfy the PH assumption, it is still possible for a single realization of this process (that is one trial's worth of data) to empirically depart from PH. In fact, as Figure 4 demonstrates the estimate of treatment effect in ICON4 fluctuates (in some parts markedly) early in the course of the trials. The final overall estimate for HR D is 0.82 -red dashed line. However, the mean bootstrapped estimate -that is the means of OS hazard ratios for all 5,000 replicated trials -changes from 0.83, 0.74, 0.70, 0.73 to 0.76 when α 1 is 0.5, 0.4, 0.3, 0.2 and 0.1, respectively. The corresponding time points of interim analysis for these α 1 values are 2.4, 2.9, 3.4, 4.1 and 4.9 years after the start of the trial, respectively. This is the reason for the (relatively large) bias in the average effect of stopped trials in some scenarios presented in Table 9. However, it can be argued that a larger bias in these situations is not so important since we are not claiming that the experimental treatment is effective.

Discussion
In this paper, we have assessed the validity of the estimates of treatment effects resulting from a TAMS design, with a specific focus on bias. By defining the 'selection bias' in selected and dropped treatments, we have quantified its likely magnitude via a simulation study and bootstrap reanalysis of existing trials. Our results highlight that the amount of selection bias is generally small and its degree depends on the design parameters and the unknown true (underlying) effect values.     In the TAMS design, the bias generally tends to be larger when selecting 'early' , that is, when the decision is based on a relatively small number of events. The results also show that, as pointed out by Royston et al. [5], under some assumptions bias in treatment effects on the definitive outcome can be markedly reduced by following all patients up to the planned end of the trial and performing analyses then, irrespective of whether recruitment was stopped early for lack of benefit. (Follow-up can also help in capturing the relevant information on safety endpoints.) Of course, it can be argued that by definition for arms that have stopped early a claim that the experimental treatment is better than the control is not made so the fact that treatment effect is biased is less important.
In our analyses, by choosing different significance levels for the first interim stage we also explored the timing of the first interim stage analysis at which the bias will be small. Our investigations suggest that the bias will be minimal, if the first interim stage is placed at a significance level of 0.3 or less. As a trade-off between the amount of bias and efficiency, we suggest that the first interim stage to be defined by a significance level between 0.2 and 0.3. This suggestion accords with the recommendations made by Barthel et al. [14]. However, this is only a practical recommendation with respect to bias and does not reflect an optimal design which could be obtained from a simulation study or theoretical calculation. Furthermore, we have shown that the bias in the treatment effect would become negligible if the TAMS trials were powered at small effect sizes investigating treatments with true large effect sizes. This, however, would in practice increase the number of events required and so the cost and duration of the trial. Finally, our simulation results showed that using an intermediate outcome measure reduces the selection bias in the estimates of treatment effects in both selected and dropped arms -provided that the chosen intermediate outcome measure satisfies the conditions set out by Royston et al. [1].
However, we emphasize that the selection bias in the estimate of treatment effect of trials that reach the final stage is a more major consideration than that in stopped trials. An effective experimental arm is very likely to reach the final stage of a TAMS trial, and the results of such trials are more likely to be adopted into clinical practice. Our empirical studies showed that the size of selection bias for the trials that reach the final stage is generally small. In fact, the bias is negligible if the experimental arm is truly effective.
For a dropped treatment arm, the estimate of the treatment effect is generally on the extremes of its sampling distribution -see Figure 2 and also Figure 2 in [8]. The estimate, as suggested by Goodman [16] and Freidlin and Korn [17], is generally on a random high (or low, depending on the direction of efficacy). Freidlin and Korn [17] argued that one should take this into consideration, and compare the average effect in the dropped arm with the average effect of a 'similar' fixed sample size trial, which is on the random high -see [16,17] for their definition of 'similar' . Their proposed fixed sample size comparator is hypothetical and quite complicated. In our simulation studies, we also compared the average effect in the dropped arm of a TAMS design with their http://www.trialsjournal.com/content/14/1/23 proposed comparator (results not shown). Our findings showed that after the follow-up the average effect in the dropped arm is almost identical to their proposed comparator. Freidlin and Korn [17] concluded that in trials with a well-designed interim-monitoring plan, the selection bias is negligible if one compares the average effect in the dropped arms to their fixed sample size comparator. Therefore, our conclusions about the TAMS designs, although in a slightly different context, agree in principle with the findings of Freidlin and Korn [17]'s investigations.
Several unbiased estimators of the treatment effect have been proposed to correct for bias inherent in two-stage designs of the TAMS type, although they were originally developed in a different context for trials with continuous, conditionally normal outcome variables. Cohen and Sackrowitz [18] and Bowden and Glimm [19]'s formula can be applied to the definitive endpoint at the end of a two-stage trial when the definitive endpoint has been used to decide on continuing/dropping the research arm at the interim analysis. Sill and Sampson [20] extended Cohen and Sackrowitz's unbiased (UMVCUE) estimator to the case where the interim decision is based on an intermediate outcome. We chose not to include a thorough comparison of these bias-adjusted estimators in our paper for several reasons. First and foremost, we are dealing with (censored) time-to-event data and Sill and Sampson's [20] formulae do not naturally extend to such a case. Second, in our situation the bias in the standard treatment effect estimates at the end of the trial was shown to be small. Third, the aforementioned formulae are presently only available for two-stage trials and are inapplicable to TAMS designs with more than two stages. This is a topic for further research. Finally, even if an unbiased estimator was available, it might not be preferred to the slightly biased standard (ML) estimator because its mean square error is likely to be larger [20,21].

Conclusions
Our empirical studies show that the estimated treatment effect on the definitive outcome has a small bias at the time of ceasing recruitment to an arm. However, if follow-up is continued to the planned end of the trial, even this small bias decreases markedly. Our results also suggest that in trials with a truly efficacious experimental arm that continue to the planned end, the bias is very small and of no practical importance.