Bayesian adaptive designs for multi-arm trials: an orthopaedic case study

Background Bayesian adaptive designs can be more efficient than traditional methods for multi-arm randomised controlled trials. The aim of this work was to demonstrate how Bayesian adaptive designs can be constructed for multi-arm phase III clinical trials and assess potential benefits that these designs offer. Methods We constructed several alternative Bayesian adaptive designs for the Collaborative Ankle Support Trial (CAST), which was a randomised controlled trial that compared four treatments for severe ankle sprain. These designs incorporated response adaptive randomisation (RAR), arm dropping, and early stopping for efficacy or futility. We studied the operating characteristics of the Bayesian designs via simulation. We then virtually re-executed the trial by implementing the Bayesian adaptive designs using patient data sampled from the CAST study to demonstrate the practical applicability of the designs. Results We constructed five Bayesian adaptive designs, each of which had high power and recruited fewer patients on average than the original designs target sample size. The virtual executions showed that most of the Bayesian designs would have led to trials that declared superiority of one of the interventions over the control. Bayesian adaptive designs with RAR or arm dropping were more likely to allocate patients to better performing arms at each interim analysis. Similar estimates and conclusions were obtained from the Bayesian adaptive designs as from the original trial. Conclusions Using CAST as an example, this case study shows how Bayesian adaptive designs can be constructed for phase III multi-arm trials using clinically relevant decision criteria. These designs demonstrated that they can potentially generate earlier results and allocate more patients to better performing arms. We recommend the wider use of Bayesian adaptive approaches in phase III clinical trials. Trial registration CAST study registration ISRCTN, ISRCTN37807450. Retrospectively registered on 25 April 2003.


Background
The traditional phase III trial design generally involves randomising patients to one of two arms, often with equal probability of allocation and using fixed sample sizes. The sample size is calculated using frequentist methods, which involve assuming a particular treatment effect and type I error rate to achieve a particular level of power. Phase III trials generally require large sample sizes, have long duration, and many are declared "unsuccessful" due to a perceived lack of difference between treatment arms [1]. For decades, statisticians have been developing more efficient methods for designing clinical trials, yet the majority of trials continue to use traditional methods.
Adaptive trial designs have the potential to allow trials to answer their questions more efficiently, particularly for multi-arm trials, by enabling design components to be altered based on analyses of accumulated data. Adaptive designs have been encouraged by regulatory bodies (e.g. [2]) and a Consolidated Standards of Reporting Trials (CONSORT) extension for adaptive designs is being developed [3]. All possible decisions and adaptations must be specified before the trial commences, as well as the decision criteria. Potential adaptations in multi-arm trials include: stopping early for high probability of efficacy or futility; arm dropping; and altering the randomisation probabilities between arms, known as outcome or response adaptive randomisation (RAR).
RAR methods are increasingly being proposed as an alternative to equal randomisation (ER) for comparative trials since they allow the treatment allocation probabilities to be updated at each interim analysis based on the accrued outcome data. For instance, the probability of being assigned to an arm could increase when the accumulated outcome data suggest that the treatment arm is superior, and thus maximises the number of patients receiving the better treatment. Advocates of RAR consider it to be more ethical than ER since it can allow more patients to be treated with superior treatments [4][5][6] whilst providing information about treatment efficacy. However, the use of RAR in phase III trials is controversial, particularly for two-arm trials where it may be inefficient [7,8].
Arm dropping may be performed in multi-arm trials to remove an arm that does not appear to be effective (e.g. [9]). There is no globally optimal method for patient allocation in multi-arm trials and the choice of method depends on the aims and setting of the trial, as some allocation methods may be more practical than others. It is also advantageous to have planned interim analyses so that if the treatment effect is large and there is a high probability of claiming superiority, or conversely, if the treatment effect is very small or nonexistent, then the trial can be stopped early.
Adaptive designs have often been constructed and applied in phase III trials using frequentist approaches (e.g. [10,11]). Further advantages to trial design and analysis can be gained by using Bayesian methods. The Bayesian approach allows previous information on the treatment effect or response to be incorporated into the design via the prior distribution. The prior distribution is updated as data are observed in the trial to become a posterior distribution. The posterior distribution provides probabilistic statements about the values of various measures of interest, such as the treatment effect, adverse event rates, or arm with the maximum response. For instance, one could obtain from the posterior distribution the probability that the relative risk is less than 1. The prior and posterior distributions also account for uncertainty in the unknown values of the measures of interest. Bayesian approaches may be used for fixed or adaptive designs. The posterior distribution may be updated at any time to incorporate current information and can be used to drive the decisions at the interim analyses, in what we refer to as a "Bayesian adaptive design".
Bayesian adaptive designs have often been used in early-phase trials, but there are few published phase III trials that have used a Bayesian adaptive approach from the design phase (e.g. [12][13][14]). In this work we will explore how Bayesian adaptive designs could be constructed for an emergency medicine (orthopaedic) multiarm trial and examine the potential benefits that these designs may offer.

Case study
The Collaborative Ankle Support Trial (CAST; [15][16][17]) was a phase III pragmatic, individually randomised controlled trial (RCT) that compared the effectiveness of three types of mechanical ankle support with tubular bandage (control) for patients with severe ankle sprains. The three interventions were the Aircast® ankle brace, the Bledsoe® boot, and a below-knee cast. Patients above 16 years of age with an acute severe ankle sprain who were unable to bear weight, but had no fracture, were recruited from eight emergency departments in England. The primary outcome was the quality of ankle function at 12 weeks post-randomisation as measured by the foot-and anklerelated quality of life (QoL) subscale of the Foot and Ankle Outcome Score (FAOS) [18]. The FAOS QoL subscale ranges from 0 (extreme symptoms) to 100 (no symptoms). Randomisation occurred 2-3 days after the initial visit to the emergency department at a follow-up clinical visit.
The CAST study was designed using frequentist methods and initially planned to have a fixed-sample design, but the sample size was subsequently altered using adaptive sample size re-estimation. A pragmatic approach to estimating the sample size was used, where the Data Monitoring Committee (DMC) reviewed the assumptions regarding the baseline pooled standard deviation of the primary outcome [15]. No comparison of between-group differences was performed during the trial in the original CAST study and no alpha was spent during the study (until the final analysis).
Originally a target sample size of 643 patients was required to provide more than 90% power to detect an absolute difference of 10 in the FAOS QoL, assuming a two-sided type I error rate of 5%, a small to moderate effect size and 20% loss to follow-up [16,17]. The sample size calculation was based on a standard sample size calculation for a two-sample t test with equal variances [16]. The minimal clinically important difference (MCID) in the FAOS QoL subscale was specified as a change between 8 and 10. The aim of this trial was to identify the best arm for treatment of severe ankle sprains to assist in recovery. A limited number of comparisons between the treatment arms were pre-specified in a hierarchical order to protect against the consequences of multiple testing.
After reviewing the underlying assumptions of the sample size calculation, a revised sample size was calculated by the DMC after 100 participants were recruited and an estimated target of 480-520 participants provided at least 80% power to detect the MCID, assuming a two-sided type I error rate of 5% [17].
The CAST study randomised 584 patients: 144 to tubular bandage, 149 to Bledsoe® boot, 149 to Aircast® brace, and 142 to below-knee cast. At 12 weeks postrandomisation, the FAOS QoL was estimated to be 53.5 (95% confidence interval (CI) 48.4-58.6) for the tubular bandage arm. Clinically important benefits were found at 12 weeks in the FAOS QoL with the below-knee cast compared to the tubular bandage (mean difference 8.7; 95% CI 2.4-15.0) and with the Aircast® brace compared to the tubular bandage (mean difference 8; 95% CI 1. 8-14.2). The Bledsoe® boot did not offer a clinically important difference over the tubular bandage (mean difference 6.1; 95% CI 0-12.3). These estimates were adjusted for baseline FAOS QoL (standardised using the median as the centre), as well as age and sex.

Potential adaptations for Bayesian designs
In our Bayesian adaptive designs we want to quickly identify the best performing intervention arm. A secondary aim is to deliver the best therapy to patients within the trial. Our designs will reward better performing arms and remove poorly performing arms. The Bayesian adaptive designs were constructed as one-sided superiority studies as we were interested in demonstrating improvement over control.
To achieve this, the following types of adaptations will be explored: RAR, arm dropping and early stopping for either efficacy or lack of benefit (futility). Below we describe how these adaptive features have been incorporated into the Bayesian designs, as well as the rules with which these adaptations could be implemented. The rules for implementing these adaptations were determined based on the input of clinicians, criteria used in previous studies (e.g. [5,19]) and the results of simulations which explored a range of clinically relevant values. Decision thresholds (stopping boundaries, arm dropping thresholds, trial success criteria) were also chosen to optimise the probability of trial success, the average number of patients randomised, and the proportion of patients randomised to the best therapy. Stopping boundaries and final analysis success criteria were also chosen to ensure that practically relevant values were used and that the simulated one-sided type I error rate was <2.5%.
The Bayesian adaptive designs were constructed by a statistician (EGR) who was independent of CAST and who was blind to the data and results of the trial until the operating characteristics of the designs had been simulated. The designs were constructed using the CAST protocol, and discussions were held with CAST investigators (SEL and EW) to derive the design parameters, using as similar values to the original study as possible, and to determine how the adaptive features could be incorporated to ensure the designs were practically feasible.

Interim analysis schedules and candidate designs
We investigated a range of interim analysis schedules where adaptations could be performed every 50, 100 or 200 patients due for their primary outcome assessment (12 weeks post-randomisation). We note that, operationally, fewer interim analyses are typically preferred. We found that performing RAR or arm dropping more frequently increased the probability of trial success and decreased the average sample size (results not shown), and so we only present the adaptive designs that performed RAR or arm dropping every 50 patients. Assessment of early stopping for efficacy or futility was performed every 200 patients due for their primary outcome assessment in each adaptive design. This was performed less frequently than RAR/arm dropping to control the type I error and reduce operational complexity, particularly for the monitoring committees who may not need to meet for randomisation probability updates or arm dropping decisions. A fixed Bayesian design was also investigated for comparative purposes. For each adaptive design, the maximum sample size was specified to be the same as the original planned sample size (N = 643). The Bayesian designs explored are described in Table 1. We note that an interim analysis at 600 patients due for their primary outcome assessment may not provide much additional benefit, unless recruitment is slow, since the maximum sample size may have been randomised by this time. Wason et al. [20] discuss the importance of considering the recruitment rate and follow-up duration when planning the timing of interim analyses in adaptive designs.

Response adaptive randomisation
ER was used prior to the first interim analysis. We wanted to use RAR so that more allocations could be given to the better dose. A number of methods have been proposed for calculating the trial arm allocation probabilities for RAR (e.g. [4,5,19,21,22]), depending on the aims of the trial. We use the approach given in Equation 2 of [22]. At each interim analysis the randomisation probabilities for the intervention arms were updated to be proportional to the posterior probability that the arm was the best intervention arm: where π t is the probability that intervention arm t is the best arm and π boot , π brace , π below − knee cast are the probabilities that each of the intervention arms are the best. This probability was raised to the power γ to avoid extreme randomisation probabilities. We chose γ = 0.6 based on the operating characteristics it produced. The randomisation probabilities were then adjusted to sum to 1. Enrolment was suspended to arms that had a randomisation probability <0.1 (and the randomisation probabilities were re-adjusted to sum to 1). The suspended arm(s) could re-enter the randomisation allocation at later interim analyses if the randomisation probabilities crossed above the threshold. Similar to Viele et al. [23], we explored designs that employed different approaches for control arm allocation in RAR. First, we simulated trials in which the control allocation was matched to the intervention arm with the highest probability of allocation. This maximises the power for the comparison of the best arm to the control. We then assumed a fixed control allocation of approximately 40%, which may be preferred for logistical reasons. Various fixed allocations for the control were explored via simulation and the allocation of 40% was chosen based on the resulting power it produced (results not shown). A similar optimal control allocation has been previously found [23,24]. Finally, we explored a design in which the control arm (tubular bandage) allocation varied according to its probability of being the best arm. In this design, all arms were considered as interventions, and recruitment to the tubular bandage arm could be suspended if it had a low probability of being the best arm (as for the other arms).

Arm dropping
We also investigated the use of permanent arm dropping, where an arm could be dropped if it had a low posterior probability (<10%) of being the best arm at an interim analysis. In the arm dropping designs, the control arm could not be dropped, but any intervention arm could be dropped. If an arm was dropped, the randomisation block size was reduced, but the overall maximum sample size was kept the same. Equal allocation was used for the remaining arms.

Early stopping for efficacy or futility
Early stopping for efficacy and futility was assessed at interim analyses performed when 200, 400 and 600 patients were due for their primary outcome assessment visit (12 weeks post-randomisation) in all adaptive designs.
For most of the adaptive designs explored (designs 2-5; Table 1), we allowed early stopping for efficacy if there was a fairly large posterior probability of there being an MCID of 8 between the best intervention arm and the tubular bandage in the primary outcome (Eq. 2) and if there was a high probability (>90%) that the arm is the best arm (Eq. 3): and where θ Best and θ tubular bandage are the FAOS QoL scores at 12 weeks for the best intervention arm and the tubular bandage, respectively, and S i is the stopping boundary for efficacy at interim analysis i for the comparison of the best arm to the tubular bandage.
Both criteria in Eqs. 2 and 3 must be met for the trial to stop early for efficacy. The S i values used were 0.75, 0.7 and 0.6 for interim analyses performed at 200, 400 and 600 patients due for their primary outcome visit, respectively. These values were used for designs 2-5 ( Table 1). The stopping boundaries were chosen to ensure acceptable power and were clinically relevant values. We also defined success criteria for the trial at the final analysis to enable the type I error and power to be calculated and compared across the designs. At the final analysis, the trial was declared successful for designs 1-5 if: If this criterion was not met, then the trial was declared unsuccessful.
For designs 2-5, early stopping for statistical futility was based on having a small posterior probability that the best arm is better than the tubular bandage: Design 6 ( Table 1) used RAR where allocation to the tubular bandage arm could vary according to its probability of being the best arm. This design focussed on identifying the best arm overall with a high probability rather than looking for an MCID between intervention arms and the tubular bandage arm. The motivation behind design 6 was to reduce allocation to poorly performing arms, including the tubular bandage arm. Early stopping for efficacy or futility was based on the probability of being the best arm, evaluated at the best arm: where t is the best arm. If this probability was <0.1 at interim analyses performed at 200, 400 or 600 patients, then the trial was stopped early for futility. If this probability was >0.975 at 200 patients, >0.95 at 400 patients, or >0.925 at 600 patients, then the trial was stopped early for efficacy. The trial was deemed to be successful at the final analysis if this probability was >0.9. These stopping boundaries were chosen to produce high power and (1-sided) type I error <2.5%.

Simulation settings
Simulations of the designs were performed in the Fixed and Adaptive Clinical Trial Simulator (FACTS; version 6.2) [25] software so that the operating characteristics of each design could be studied. We used a recruitment rate of 5 patients/week and assumed it took 12 weeks to reach this recruitment rate. We also explored recruitment rates of 25 and 56 patients/week (assuming it took 12 weeks to reach these recruitment rates). We used the same dropout rate that the original study design assumed (20%). The posterior distribution was estimated for each treatment arm, and the FAOS QoL estimates at 12 weeks were adjusted for the baseline scores using a linear model. The (unadjusted) mean response for each arm was assumed to be normally distributed with a mean FAOS QoL of 50 and a standard deviation of 20. The variance of the FAOS QoL was modelled using an inverse-gamma distribution, where the central variance value was assumed to be 20 2 and a weight of 1 was used (giving α = 0.5, β = 200). There was little previous information available at the time that the CAST study was designed and so we relied upon the opinions of clinicians in forming the prior distributions. Further details on the model and priors used are given in Additional file 1.
Prior to the start of the CAST study there was uncertainty regarding the effect size and FAOS QoL values, and so we simulated a range of different true effect size scenarios for each design. The different scenarios explored for the primary outcome in each arm are given in Table 2.
We simulated 10,000 trials for each scenario in Table 2 for each design. The type I error was estimated using the proportion of simulations that incorrectly declared the trial to be successful when no difference was present in the true primary outcome scores (null scenario above). The power was calculated as the proportion of simulations that correctly declared the trial to be successful, when at least one treatment was superior in the true FAOS QoL score.
We wanted to accurately estimate the response of the arm that was chosen to be the best. Some studies have shown that RAR can lead to a larger estimation bias compared to ER (e.g. [8]). To quantify bias in the estimates of the best arm responses, we use the mean square error (MSE) of estimation where the expectation is taken over the space of successful trials since estimation of the best arm is only important in this scenario.

Virtual re-execution of designs
A virtual re-execution of the CAST study was performed by implementing the Bayesian designs using the CAST data to illustrate the application and potential benefits of the Bayesian adaptive designs on a real-world trial. We maintained the original enrolment dates for the CAST patients in the re-execution. Since designs 3-6 incorporated arm dropping or RAR every 50 patients, the required allocations for these designs are unlikely to match the allocations that actually occurred in the CAST data. Therefore, at each interim analysis we used the updated randomisation probabilities to obtain allocations for the next 50 patients and then randomly sampled (with replacement) a CAST patient for the re-execution dataset that had a matching treatment allocation and was randomised into the original CAST study within ±6 weeks of the reexecution enrolment date. To avoid bias, for each design the trial was virtually re-executed 1000 times by drawing data from the CAST dataset and performing the interim analyses. A flow diagram of the re-sampling and interim analysis process for designs 3-6 is given in Fig. 1. Further details are given in Additional file 1.
Designs 1 and 2 had fixed arm allocation probabilities throughout the trial, and so we could use the actual CAST data in the virtual executions of these designs without the need for re-sampling. We also used a simplified version of the process described in Fig. 1 to re-sample many datasets from the CAST data to virtually execute designs 1 and 2 so that their results were more comparable to those from designs 3-6. This also enabled us to examine potential gains in efficiency over a range of datasets.
Since the CAST study only recruited 584 patients, we were unable to perform all planned interim analyses. The last interim analysis for early stopping for efficacy/futility occurred at 400 patients. The final analysis occurred once follow-up data had been collected for the 584 patients. The re-executions were performed in R (version 3.5.0; R Foundation for Statistical Computing) and the JAGS package [26] was used to perform the Bayesian analyses. We used a similar approach to Luce et al. [27] to perform the virtual re-executions and re-sampling of patients.

Operating characteristics for Bayesian designs
Select operating characteristics for the Bayesian designs are presented in Table 3 and Fig. 2. Further operating characteristics are given in Additional file 2. Boxplots of the distribution of the allocations to the control/tubular bandage and true best arm for each scenario across the 10,000 simulations are presented in Fig. 3. The effect of using a faster recruitment rate is summarised in Additional file 3.
The Bayesian adaptive designs generally offered a decreased average sample size and increased power/probability of trial success across the scenarios explored, compared to the Bayesian fixed design (design 1). The Bayesian adaptive designs only offered small savings in the average sample size for the null scenario (N average = 637-642 compared to N = 643 in the fixed design) since we used stringent futility stopping rules. For designs 1-5, which used efficacy criteria based on the probability of an MCID, the simulated type I error was approximately 0. Whilst the efficacy stopping boundaries could have been lowered to produce a type I error closer to 2.5%, we felt that lower thresholds for efficacy stopping would not have been practically sensible nor accepted by the clinical community. Designs 2-5 offered modest reductions in the average sample size when a difference of 5 was assumed between the tubular bandage and the best intervention arm, with design 2 producing the lowest average sample size (N average = 617) and highest probability of trial success (14.54%).
Designs 4 and 5, which performed RAR, tended to produce the lowest average sample sizes and highest power for the scenarios where one arm was clearly performing best and had an MCID, in other words "One works, 10 more", "Better, best", and "One worse, others work" scenarios. Based on the average sample sizes, these designs offered savings of 142-193 patients across the abovementioned scenarios whilst maintaining >84% probability of having a successful trial. Designs 2 and 3 were only slightly less efficient for these scenarios. For the scenario where two arms offered the same MCID ("All work, two similar"), designs 2-5 offered similar savings to the sample sizes (N average = 584-589) and provided similar probability of trial success (range 89.15-91.79%).
Bayesian design 6, which used RAR and allocated all arms according to their probability of being the best arm, had an acceptable type I error of 2.3%. Design 6 offered large sample size savings for the "One works, 10 more", "Better, Best" and "One worse, others work" scenarios where the average sample sizes ranged from N average = 379 to N average = 473 across these scenarios. The probability of trial success was ≥94% for design 6 for these three scenarios. This design offered moderate gains in efficiency for the "One works, 5 more" and "All work, two similar" scenarios, with average sample sizes of N average = 589 and N average = 592, respectively, and probabilities of trial success of 68.53% and 67.88%, respectively.
We also simulated a scenario where all the intervention arms were inferior to the tubular bandage arm (mean FAOS QoL 50, 45, 45, and 45 for tubular bandage, boot, brace, and below-knee cast, respectively; standard deviation = 20 for each arm). In designs 1-5, all of the simulated trials were declared to be unsuccessful at the final analysis for this scenario and 41.72-58.91% of the simulated trials stopped early for futility (designs Null 50 (20) 50 (20) 50 (20) 50 (20) One works, 10 more 50 (20) 50 (20) 50 (20) 60 (20) One works, 5 more 50 (20) 50 (20) 50 (20) 55 (20) Better, Best 50 (20) 55 (20) 60 (20) 65 (20) One worse, others work 50 (20) 45 (20) 55 (20) 60 (20) All work, two similar 50 (20) 55 (20) 60 (20) 60 (20) FAOS Foot and Ankle Outcome Score, QoL quality of life, SD standard deviation 2-5). For this scenario, design 6 had similar results to the "One arm works, 5 more" scenario since it did not consider the tubular bandage to be a control arm and considered one arm to be superior by an FAOS of 5. A faster recruitment rate was found to decrease the efficiency of the adaptive designs (Additional file 3). Due to the lack of successful trials in the null and "one arm works, 5 more" scenarios for the majority of designs, the MSE was not calculated for these scenarios. The adaptive designs tended to have slightly higher MSE than the fixed design, apart from design 6 which had lower MSE. RAR and arm dropping designs had lower MSE compared to the design that just had early stopping for efficacy or futility (design 2).
Across the designs, the correct selection of the best arm was made in 94-100% of the simulated trials, where at least one arm was superior to control by an MCID (see Additional file 2). From Table 3 and Fig. 3, it can be seen that, on average, more allocations were given to the best arm under designs that incorporated RAR or arm dropping when at least one arm was superior. Equal allocation to the treatment arms was achieved in the null scenario for these designs. Design 6 tended to allocate the highest proportion of patients to the best arm. Designs 3-5  tended to have similar allocations. The designs with RAR or arm dropping (designs 3-6) had a fairly large variation in their allocations to the best arm and the control, and were quite often skewed in their distribution. For design 3, the proportion of arm drops was low for the best arm and high for the other arms (Additional file 2).
Virtual re-execution of designs Table 4 presents a summary of the virtual re-execution of the CAST study under each Bayesian design across the 1000 trials that re-sampled the CAST study data. The results of the re-executions show that the Bayesian adaptive designs recommended early stopping for efficacy in 7.6-25.9% of trial re-executions, with the most frequent early stopping occurring in design 2 which had fixed allocations and only allowed for early stopping of the trial. None of the trial re-executions recommended early stopping for futility since all of the interventions performed better than the tubular bandage. At the final analysis for designs 1-5, 83.5-89.4% of the trials were declared successful. Design 6, where decisions were based on having a high probability of being the best arm, had a low proportion (23%) of trials that were declared successful at the final analysis. This is due to the fact that the brace and below-knee cast had similar primary outcome scores, and both performed well compared to the other arms. Thus, one arm was not often declared superior with a high probability. For each of the Bayesian designs, the below-knee cast was most frequently declared the best arm at the final analysis in the re-executions and thus had the same conclusion as the original trial.
The medians of the posterior estimates for the treatment effects over the 1000 re-executions were generally similar to the original frequentist analysis estimates. Designs 4 and 5 (RAR with control allocation matched to best arm and RAR with fixed control allocation, respectively) had slightly lower estimates of the mean difference between Bledsoe boot and tubular bandage. Design 6 had slightly higher estimates of the mean difference between the ankle brace and tubular bandage, and also between the below-knee cast and tubular bandage. One should also bear in mind that the re-executions were performed on re-sampled data from the original dataset, and so the estimates are likely to vary slightly.  Fig. 2 Average sample sizes (a, c, e, g, i, k) and probability of trial success (Pr (Success); b, d, f, h, j) for each design. Each row represents a different scenario: a, b "Null" scenario; c, d "One works, 10 more"; e, f "One works, 5 more"; g, h "Better, Best"; i, j "One worse, others work"; k, l "All work, two similar". The type I error is represented in b; The power is given in d, f, h, j, l Further summaries of the results and randomisation allocations at each interim analysis for each adaptive design are given in Additional file 4, as well as the results for the re-executions of designs 1 and 2 where no resampling of the data was performed. These results show that the randomisation probabilities differed between Bayesian designs 4-6 at each interim analysis, and that these RAR designs often had quite different allocations to the CAST study, depending on which arm was "the best" at that interim analysis.

Summary
In this study we have demonstrated how Bayesian adaptive designs can be constructed for phase III multi-arm RCTs. Using an orthopaedic trial as a case study, we Allocations (Prop Alloc) across 10,000 simulated trials for the tubular bandage arm and true best arm. Each design is represented on the x axis. a "One works, 10 more" tubular bandage allocation; b "One works, 10 more" true best arm allocation; c "One works, 5 more" tubular bandage allocation; d "One works, 5 more" true best arm allocation; e "Better, Best" tubular bandage allocation; f "Better, Best" true best arm allocation; g "One worse, others work" tubular bandage allocation; h "One worse, others work" true best arm allocation; i "All work, two similar" tubular bandage allocation; j "All work, two similar" true best arm allocation outline the process involved in constructing the designs, describe the adaptive schemes and stopping rules employed, and demonstrate the behaviour of the designs through their operating characteristics across a range of scenarios. We also performed virtual executions of the Bayesian designs using data from the CAST study to demonstrate the decisions that would be made using the Bayesian designs and the trial data. Through use of the Bayesian adaptive approach we were able to make decisions about whether to stop the trial early based on the probability of having an MCID, update the randomisation allocations according to the probability of being the best arm, and suspend recruitment to arms that had a low probability of being the best.
Based on the operating characteristics, the use of Bayesian adaptive designs for this case study generally increased the power and decreased the average sample size compared to a fixed design. The use of RAR generally offered slightly increased power and slightly smaller average sample sizes compared to adaptive designs that employed equal randomisation allocations at each interim analysis (with or without arm dropping) when it was assumed that one arm offered an MCID. Small sample size savings were obtained when no effect or a small effect was assumed to occur, and when two arms were assumed to have an MCID. All designs had low type I error and high probabilities to detect an MCID in at least one arm when it was assumed that one arm was superior and had an MCID. The correct selection of the best arm was made in 94-100% of the simulated trials where at least one arm was superior to control with an MCID. Use of RAR or arm dropping produced simulated trials that gave more allocations to the best arm when at least one arm was superior. Equal allocation occurred when the arms had approximately the same primary outcome scores.
Design 6, the decisions of which were made based on the probability of being the best arm, showed that it could potentially produce large savings in sample size for scenarios where one arm was clearly superior and had an MCID, whilst maintaining high power. However, this design was less efficient when two arms showed a similar improvement compared to the other arms since it was unable to declare a single arm as superior with a high probability. Design 6 had different objectives and decision criteria to the other Bayesian designs, and so care should be taken when choosing a preferred design since the designs are tailored to the aims of the investigators. Criteria such as those used in Design 6 are useful for multi-arm studies in which the investigators want to order the treatments by effectiveness. The virtual executions of the Bayesian designs using the CAST data showed that early stopping for efficacy only occurred in a small proportion of trials and that no trials stopped early for futility. At the final analysis, >80% of the trials were declared successful in the 1000 executions of designs 1-5. When design 6 was executed 1000 times using the resampled trial data, only 23% of the trials were declared successful at the final analysis since both the brace and below-knee cast performed similarly well and a "best arm" was not declared with a high probability. A benefit of design 6 was that the tubular bandage arm, which was the control arm in the other designs, had smaller allocation probabilities which allowed more allocations to better performing arms. The below-knee cast was most often declared the best arm at the final analysis in the re-executions, and so the Bayesian designs led to the same conclusion as the original trial. If we had known a priori that two arms were likely to perform similarly well, then we would have chosen different success criteria. These results also reflect the problem of dichotomy at a final analysis-if we just reported posterior probabilities of a treatment benefit or MCID then the trial would likely have been viewed more optimistically.
The decisions made at the interim and final analyses of the Bayesian designs were driven by the primary outcome. We did not incorporate other outcomes and are not intending that the conclusions generated in this reexecution be used to inform clinical practice or to alter the conclusions of the original study.
Recruitment can often be challenging in clinical trials, causing delays in their delivery. Approaches which reduce the sample size whilst maintaining high power to determine the effect of interventions should be welcomed by study teams to assist them in completing recruitment on time and within budget.

Limitations
Adaptive designs have great promise for producing trials with better operating characteristics but present a number of practical challenges. Korn and Freidlin [28] provide a summary of some of the advantages and disadvantages of different adaptive design elements. Wason et al. [20] provide a discussion around the situations in which adaptive designs are and are not useful, and some of the logistical challenges they present.
Adaptive designs require a larger amount of expertise and work to build and evaluate potential designs compared to fixed designs, often involving extensive simulations, and may take more effort to obtain approval from review boards. However, the use of the simulations forces the study team to consider the effects of faster/ slower recruitment, follow-up length, smaller effect sizes than anticipated, or higher/lower response rates than anticipated on the operating characteristics of the adaptive designs. Thus, the simulations required by adaptive designs allow study teams to anticipate the effects of differing trial conditions, which are often not considered when using traditional designs.
Adaptive designs can also be more complicated to implement. Performance of the interim analyses and making the required adaptations is dependent on being able to collect, enter, clean and analyse data in a timely manner, and alter the randomisation system with ease. This requires the trial management team, statisticians, programming teams and trial treatment providers/intervention suppliers to be responsive to changes that need to be made. Otherwise, the adaptive designs may lose their gains in efficiency. Timely data entry may be difficult for orthopaedic studies where primary outcomes may be obtained from patientcompleted questionnaires that are collected within a 2-to 4-week window of a long follow-up period. The rapid changes required may not be possible in all trial settings.
The interim analyses also need to be adequately spaced to allow time for DMCs and Trial Steering Committees (TSCs) to meet. Statistically, more frequent interim analyses generally produce better operating characteristics for designs that use RAR or arm dropping (e.g. [29]), but frequent interim analyses may not always be practical. The DMC/TSC may not necessarily need to meet for every interim analysis, for example for RAR adaptations, but would need to meet for stopping decisions.
The types of adaptations that can be made to multi-arm trials are situation-dependent. RAR presents difficulties in being able to anticipate and arrange for the delivery of treatments. The original CAST study design, which had fixed allocations, allowed the supply of treatment arms (including the supply of staffing) to be planned more easily than a design with RAR. RAR may not always be possible due to restrictions on resources for delivering the treatments or delays in collecting the primary outcome data. Closure of arms may be practically easier to achieve, particularly for a trial such as CAST for which there need to be sufficient supplies of each treatment available as well as staff proficient in their administration. Whilst early stopping of trials may have benefits for funding agencies, academic trial investigators often do not wish to terminate trials early due to potential loss of research income and staff retention. Changes in funding models are likely to be required to fully take advantage of innovation in trial design, such as a minimum study time funded with a mechanism to release funding if full study time is required. Additionally, trials that stop early may have little information on the long-term effects of treatment, on secondary outcomes, or on cost-effectiveness. They are also likely to produce less precise estimates of the treatment effects. Gallo [30] provides further discussion on some of the operational challenges in adaptive design implementation.
Multi-arm, multi-stage (MAMS) designs are another method for improving the efficiency and ethics in multiarm trials (with a common control) where experimental arms may be dropped at pre-planned analysis points if they show insufficient evidence of effectiveness. Wason and Trippa [6] showed that Bayesian designs with RAR are more efficient than MAMS designs when there is a superior experimental arm, but that MAMS designs perform slightly better if none of the experimental arms are effective. They also showed that the operating characteristics for the RAR designs were less sensitive than MAMS designs to changes in the amount of primary outcome data available at the interim analyses to the original planned number.
The use of RAR remains controversial and some of its properties are not well understood by clinicians. RAR has its greatest potential in multi-arm trials but has limited usefulness in two-armed trials [7,31]. Adaptive designs are more susceptible to changes in patient population over time. Designs with RAR have been shown to be robust to moderate changes in patient population, and certain RAR rules have been shown to be effectively unaffected by time trends [32,33], but adaptive designs are not appropriate if the patient population changes dramatically during the trial. When evaluating adaptive designs, simulation is required to illustrate the operating characteristics and potential benefits, and investigate potential biases introduced by each adaptive feature.
Fairly short follow-up times, relative to the planned recruitment duration, are required for adaptive designs to offer improved efficiency. Adaptive designs are difficult to implement for very fast recruitment rates, particularly for studies that have relatively longer follow-up periods since less information will be available at each interim analysis [6,20]. We also found that a faster recruitment rate decreased the efficiency of the adaptive designs. This poses difficulties for phase III trials, such as those performed in orthopaedics/rehabilitation, since the primary outcome is often based on long-term measures, and it may be difficult to design adaptive trials without extending the time frame of recruitment to allow for the interim analyses and potential adaptations to occur. Thus, there may be a trade-off in reduced sample size but increased recruitment time (at a slower recruitment rate) for some adaptive trial design contexts.
In this work we virtually executed each of the proposed Bayesian designs using trial data to illustrate their practical applicability. However, in reality, one design would have been chosen and implemented, depending on its operating characteristics, practical restraints and the aims of the trial. Although we tried to ensure that the statistician (EGR) remained blind to the trial results until the design operating characteristics had been obtained via simulations, the study clinicians were involved in discussions around the prior distributions and stopping criteria. It is difficult to completely remove hindsight bias in these historical case studies.
When virtually executing the designs that incorporated arm dropping or RAR, re-sampling from the original trial data was required to obtain the required randomisation allocations. This may lead to an underestimation of the uncertainty in the results [5]. We addressed this by re-executing the CAST study 1000 times and re-sampled patients within each trial. If different datasets had been used, different conclusions may have been obtained using these designs.
We did not simulate the decision making process of a DMC/TSC. We have assumed that the decision-making process was driven by the primary outcome, but the DMC/TSC would also examine safety data and any relevant external evidence. Whilst the role of these committees is to ensure that the study protocol is accurately followed, they may also need to make deviations to ensure patient safety. For example, RAR may recommend increasing the allocation probability to an arm that has a higher rate of adverse events-an event that was not accounted for in the RAR algorithm. Alterations to the previously defined adaptations can lead to unknown operating characteristics.
The Bayesian adaptive designs were constructed as onesided superiority studies, whereas the original CAST study was a two-sided trial. We were interested in demonstrating improvement over a much cheaper control and felt that a DMC would be unlikely to continue enrolment into a poorly performing comparator just to show it is worse. Under most of our Bayesian adaptive designs, if an intervention arm performed poorly it would be dropped or have a very low probability of allocation. Harm may or may not be reflected in the FAOS QoL score, but the DMC could intervene if any arms were causing harm.
The designs presented here are situation-specific and have been tailored to the clinical situation and aims of the CAST study. The definition of a successful trial and the level of sufficient evidence required to make decisions will differ between researchers and stakeholders, and will depend on the consequences of the actions that may be taken. The designs and findings from this work will not generalise to all phase III RCTs, but similar approaches can be used to construct Bayesian adaptive designs. We recommend that simulations are used to study the impact of each type of adaptive component on the operating characteristics when constructing Bayesian adaptive designs for multi-arm trials.
One of the potential barriers to using Bayesian adaptive designs in practice is the computational time and resources that are required to construct the designs. Trialists or statisticians less familiar with Bayesian methods may not have the time or knowledge to program their own Bayesian adaptive designs, and commercial solutions such as FACTs may not be available to all. A review of available software and code for adaptive clinical trial designs is provided by Grayling and Wheeler [34].

Conclusions
To enable phase III trials to achieve their aims, more efficient methods are required. Innovation in clinical trial design is extremely important as it can potentially improve the efficiency, quality of knowledge gained, cost and safety of clinical trials. In this work we have demonstrated how Bayesian adaptive trials can be designed and implemented for multi-arm phase III trials. Using a published example from orthopaedic medicine, we highlight some of the benefits of these designs, particularly for multi-arm trials.