The International Rare Cancers Initiative (IRCI) was set up in 2011 and comprises organisations involved in trials in rare cancers from several countries, including the UK, USA, Canada, and France [8]. One of the IRCI’s objectives is to promote the development of innovative methodologies for research in rare cancers. This paper contributes to this objective for multi-stage trials where the first stage consists of choosing from two or more promising treatments. Formally incorporating a margin of practical equivalence in the design, and calculating the sample size accordingly, allows researchers to determine when the choice of therapy should be made only on the efficacy outcome measure, or when it can be done on the basis of other factors (e.g. toxicity) because efficacies are considered similar. This design provides a more flexible and realistic approach when deciding which treatment among several should be investigated further.

In the randomised selection design of Simon et al. [4], the sample size is calculated so that there is a high probability of correctly selecting a superior treatment for further testing in phase III, if such a treatment exists. However, if the treatments truly have equal efficacy, the treatment with the highest observed response rate will be declared superior, even though this is not true. In other words, there is no control of the false positive rate. Another practical limitation is that the situation where the observed response rates are equal is not considered, although this can happen, especially with small sample sizes and binary endpoints. Even when using a margin of practical equivalence, the minimum difference in number of patients between arms may be just one patient. For example, this would be the case if 19 patients are used per arm and a margin of 5 percentage points is used. Indeed one patient out of 19 represents 5.2%. Nonetheless, the advantage of the proposed approach is that it allows researchers to formalize, at the start of the study, how the extra considerations (cost, QoL, etc.) might be used in the decision rule if the treatments have equal response rates.

To overcome these limitations, recent selection trials have included in their protocols rules based on toxicity, QoL, and survival to choose a treatment when the observed response rates are equal/similar. Examples of such ongoing trials are InterAACT [5], COSMIC [9], and NEOSCOPE [10]. In the ongoing InterAACT trial, toxicity and QoL will be used to make the selection if the response rates are equal. However, if response rates are very similar but not equal, the treatment with the highest observed rate will be selected even if the difference might be due to chance. In the COSMIC trial of chemotherapy plus ofatumumab at standard or high dose in chronic lymphocytic leukaemia, the protocol stipulates that ’if there are less than 3 responses (8%) difference observed between the arms, the trial will be declared statistically ambiguous, and alternative selection criteria will be used to select the schedule for further investigation. Otherwise, the schedule with the better observed response rate will be recommended to be taken forward’. Interestingly, the additional criteria to make the selection (if less than three responses difference are observed between arms) are not specified. In NEOSCOPE, a randomised trial of induction chemotherapy followed by either oxaliplatin/capecitabine- or paclitaxel/carboplatin-based chemoradiation as a pre-operative regimen for resectable oesophageal cancer, the protocol includes specific rules based on survival and toxicity to make the selection if efficacy is comparable between arms. Effectively, these three trials employ a margin of practical equivalence in their decision rule. However, the sample size calculations were performed without taking into account the margin, potentially leading to reduced power compared to the planned target study size. In InterAACT and COSMIC it was assumed that one of the two treatments had a higher efficacy than the other, while in NEOSCOPE the sample size was calculated as if the two arms were independent single-arm phase II studies.

Sample size calculations based on traditional phase III trial designs often lead to unfeasibly large sample sizes in rare cancers. Recruitment is a major challenge, given the low incidence and the geographical spread of patients with rare cancers across countries. In InterAACT, 388 patients and 25 years of recruitment would be required to demonstrate an increase in TRR from 40 to 50% using carboplatin and paclitaxel compared to cisplatin with 5-fluorouracil, with the traditional superiority design and 80% power at a 5% two-sided significance level [6]. With the design of Simon et al., 36 patients were needed per arm to demonstrate the same increase from 40 to 50%, which has an 80% chance of selecting the superior treatment at the end of trial (the researchers increased this number to 40 per arm for logistical reasons). If a margin of practical equivalence of 5 percentage points had been used, 38 patients per arm would have been required, but with the advantage of having a more flexible and pre-planned strategy for choosing the ‘best’ treatment, if treatments demonstrate equal or similar efficacy in the study. In general, the sample size required for a selection trial is larger when a margin is incorporated, compared to designs that do not include a margin; and this difference in study size gets larger as *d* increases. Keeping all other input parameters fixed, the smallest sample size is reached when the margin *d* is fixed at zero (corresponding to the design of Simon et al.). In other words, there is a trade-off between sample size and the flexibility introduced by the margin.

In practice, for a given sample size, if the true difference in TRR between treatments is not as large as was expected in the sample size calculation, the chance to select the most efficacious treatment is reduced. For example, 19 patients are required per arm if we use a TRR of 20% for the most efficacious treatment, a *delta* of 10 percentage points (i.e. the TRR of the other treatment is expected to be 10%), a *P*_{most} of 80% and a margin of 5 percentage points; see Table 1. If the true difference is less than 10 percentage points, the chance of selecting the most efficacious treatment (which happens if there is more than one patient difference between the two arms) is reduced to less than 80%. If both treatments have a TRR of 20%, both treatments would be chosen based on efficacy only 42% of time. In 16% of the time, the study would be in a situation of practical equivalence. Assuming that both treatments have the same or comparable profiles on the non-efficacy considerations, then the study has 100% chance to make an appropriate choice, as both treatments are equal. However, if one treatment is noticeably better in terms of toxicity and/or QoL, then it would be chosen only slightly more than 50% of the time. Indeed it would be chosen based on efficacy only 42% of the time, and at least 50% of the time in situations of practical equivalence (16% chance of such a situation happening). This example demonstrates the importance of determining the *delta* realistically, as well as the importance of setting the margin *d* in such a way that making a decision based on efficacy considerations only is acceptable medically.

Nonetheless, it is possible that a treatment is considered superior based on efficacy considerations only at the end of the study, but only by a very small margin. In such circumstances, the non-efficacy considerations may still be taken into account in the final decision, especially if crucial non-efficacy considerations emerged during the trial. The decision rule might then be seen as a guideline or starting point for making a judgment on which treatment should be taken for further testing. Additionally, the determination of the margin *d* may take into account potential prior information on non-efficacy considerations. In general, the design assumes that the treatments are equal with respect to non-efficacy considerations, prior to starting the study. However, if this is not the case, the margin may be made wider at the calculation stage to reflect the additional gain in efficacy that the treatment eventually taken further should demonstrate to compensate for the slightly reduced QoL or increased toxicity, for example.

We note that the sample size depends on the absolute values of the TRRs, which should be determined using the best available evidence to date. For the same *delta* of 10 percentage points in Table 1, the sample size increases when the middle point of the [0,1]-axis for the TRR is approached compared to the boundaries of the [0,1]-axis. This is due to the bounded nature of the binomial distribution.

The proposed design is based on binary outcome measures; hence, a limitation is that it does not allow for other types, i.e. continuous or time-to-event endpoints. However, binary outcomes can be used for either short (e.g. TRR as in our example) or longer term measures (e.g. 1- or 2-year progression-free or overall survival). Another potential advantage of the proposed design is for use in biomarker-directed early phase treatment trials, in which arms could be compared with a control therapy.

The user-friendly online application should help promote the randomised selection design with a margin of practical equivalence. Users currently have the choice between two- or three-arm trials, a margin *d* of 2.5% or 5%, and a *delta* of either 10% or 15%. There might be other options in the future, but in the meantime users can contact us if they wish to specify other values for the input parameters that are not currently available within the web application.

Although our paper was developed primarily for treatment trials for uncommon cancers, the methods can be applied to other cancers too, particularly when a smaller efficacy study is considered appropriate or more feasible as an initial assessment of a therapy. Moreover, this method may be applicable for randomisation purposes at the end of a phase I dose-finding trial when there remains considerable uncertainty in the selection of the dose to take for further testing.