For most diseases there are multiple new treatments at the same stage of clinical development. For example, in oncology there are over 1,500 treatments in the clinical pipeline [1]. With limited resources and patients available, alternative trial designs are needed to maximise the number of treatments tested. Multi-arm designs are an important example of an alternative trial design that substantially improves efficiency over the traditional two-arm randomised controlled trial (RCT).
Multi-arm trials vary considerably in design and objective, but have in common that more than two treatment arms are included in the same trial protocol. They evaluate multiple research questions that would otherwise require several trials, and have two main advantages in comparison to separate trials: 1) a reduction in administrative burden; 2) improved efficiency by using shared information. The improved efficiency can be used to reduce the sample size required for a given power, or to maintain the sample size whilst increasing the power to show that one of the experimental treatments is better than control [2]. A common multi-arm design that provides increased efficiency is one that tests multiple experimental arms against a shared control arm. The shared control arm is used for testing the effect of each experimental treatment, reducing the total number of patients needed (see Figure 1). Some other multi-arm trial designs also have this advantage, for example when a single experimental treatment is compared to placebo and an active control [3]. A recent review by Baron et al. [4] found that 17.6% of published randomised controlled trials in 2009 were multi-arm.
Despite their common use, there is no consensus in the literature about whether a trial with multiple arms should make a statistical correction for the fact that multiple primary hypotheses are tested in the analysis. In this paper we provide a summary of different viewpoints on the subject, and conduct a literature review to investigate how often recently published multi-arm trials in major medical journals include an adjustment for multiple-testing.
What is multiple-test correction, and is it necessary for multi-arm trials?
A multi-arm trial has multiple null-hypotheses, each representing a different primary research question. This creates an additional layer of complexity over a trial with one primary null-hypothesis, such as a two-arm RCT. When there is a single null-hypothesis, the significance level (or type-I error rate) is the probability of rejecting the null-hypothesis when it is true. In a multi-arm trial, there are more potential ways in which a false-positive finding can be made: any true null-hypothesis that is rejected will mean that the trial makes a false-positive finding. For example, if four independent true null-hypotheses are tested at 5% significance level the total chance of a false-positive is 19%.
A multiple-testing procedure is a statistical method of adjusting the significance level used for testing each hypothesis so that the chance of making a type-I error is controlled. There are various characteristics that the testing procedure can have. Amongst the strictest is strong control of the family-wise error rate (FWER). The FWER is the probability of making at least one type-I error and strong control means that the maximum possible FWER is controlled at a pre-defined level. For example, testing four null-hypotheses would strongly control the FWER at level 0.05 if the maximum possible chance of rejecting a true null-hypothesis is less than or equal to 0.05. Weak control of the FWER is similar, but only controls the maximum possible FWER under the subset of situations when all null-hypotheses are true, known as the global null-hypothesis. A procedure that controls the FWER strongly will control it weakly, but not necessarily vice versa. Another commonly considered quantity is the false discovery rate (FDR), which is the expected proportion of true null-hypotheses that are rejected. A procedure controlling the FDR would permit true null-hypotheses to be rejected, as long as the expected proportion of true null-hypotheses that are rejected is below a target level; a procedure controlling the FWER would control the probability of rejecting at least one true null-hypothesis. A procedure that controls the FWER will also control the FDR at the same level.
Multiple-testing arises in many areas of biology, not just in clinical trials. For example, many advances in multiple-testing procedure methodology have been motivated by genomics [5] where studies routinely test many thousands of hypotheses in a single study. Some authors, example Rothman [6], claim that multiple-test corrections should never be used in scientific experiments. Rothman argues that advocating multiple-testing adjustment assumes that all null-hypotheses are true, and when that is not the case, it will reduce the power to find genuine associations. However, other authors have subsequently argued that multiple-testing correction is necessary in different clinical trial scenarios. An example of a paper that argues against Rothman’s view is Bender and Lange [7], which provides a discussion of multiple-testing in biomedical and epidemiological research and an overview of methods used to correct for multiple-testing.
In clinical trials, multiple primary hypotheses can arise in several ways, not only due to considering more than two arms. For example, clinical trials commonly assess the performance of a new treatment by recording several outcomes. If the treatment would be declared effective if there is a significant difference in any of the outcomes then there is the potential for an increased type-I error rate. Feise [8] provides a balanced consideration over whether a multiple-testing correction is required in a trial using multiple outcomes and recommends the use of composite measures or selecting a single primary outcome measure in order to avoid the problem entirely. If multiple primary outcomes are used in a confirmatory clinical trial, and any significant result would be grounds for licensing the treatment, then regulators are clear that a multiple-testing adjustment is required [9, 10]. Another situation in which a multiple-testing correction is routinely used is in a trial where the same hypothesis is tested at multiple interim analyses. Again, it is fairly well accepted that in this case a multiple-testing correction is required, with an extensive literature on group-sequential designs that control the type-I error rate and power when interim analyses are used (see Jennison and Turnbull [11] for an extensive summary of methods).
For multi-arm trials, the context of the trial influences whether multiple-testing correction is desirable. If the trial is exploratory, and any findings will be tested in further trials, then there is less need for a multiple-testing correction, as any false-positive findings will not change practice. In fact, recent evidence has shown that from an efficiency standpoint, exploratory multi-arm studies should use high significance levels when they are followed by a confirmatory trial [12].
In confirmatory settings, when the multi-arm trial is designed to provide a definitive answer to the hypotheses being tested, there are conflicting views about the necessity of multiple-test correction. Cook and Farewell [13] argue that if the different hypotheses represent distinct research questions (for example, the effect of distinct experimental treatments in comparison to the control treatment), then it is reasonable to not apply a procedure that strongly controls the FWER. In Bender and Lange [7], a section on experiments with multiple treatments argues that it is mandatory to control the FWER when multiple significance tests are used for primary hypotheses in a confirmatory setting. Hughes [14] makes the argument that multiple-testing adjustment is not necessary when several experimental arms are compared to a control group, as that adjustment would not be needed if the treatments were tested in separate trials. This argument differentiates testing independent treatments rather than considering the question of whether any of the treatments are beneficial. However, multi-arm trials are usually reported in a single paper and the treatment effects are often discussed and interpreted relative to each other. Freidlin et al. [15] refines the view of Hughes, arguing that a multiple-testing adjustment is necessary when several doses or schedules of the same treatment are tested against a common control, but not when the treatments are distinct and the multi-arm trial is conducted for efficiency reasons. This distinction is made because any rejected null-hypothesis will result in the new treatment being recommended. Proschan and Waclawiw [16] provide consideration of many sources of multiplicity in clinical trials, including multiple experimental arms. It is stated that multiple-testing adjustment is more necessary when: 1) the hypotheses being tested are more related; 2) the number of comparisons is higher; 3) the degree of controversy is higher (that is whether the trial is aiming to definitely answer a question that has had conflicting results in the literature); 4) when one party stands to benefit from the multiple-testing (for example, several of the treatments in the trial are produced by a single manufacturer). Wason et al. [17] argue that for a multi-arm trial, the FWER should be strongly controlled in confirmatory trials, and reported in exploratory trials. This argument was based on the two main regulatory bodies for pharmaceutical trials currently providing advice suggesting that adjustment is required for definitive trials. The European Medicines Agency (EMEA) guidance on multiplicity [10] states that any confirmatory trial with multiple primary null-hypotheses should control the maximum probability of making a type-I error. The Food and Drugs Administration (FDA) (draft) guidance on adaptive designs [18] states that the total study-wise error rate should be controlled in all confirmatory trials, although does not explicitly mention multi-arm trials. To our knowledge, there are no official guidelines on this issue for non-pharmaceutical trials.
Thus there is no unanimous view on the issue of multiple-testing corrections in confirmatory multi-arm trials. There are indications that it is a regulatory requirement, but this would only be relevant for trials that aim to gather evidence to support registration of a drug. There is little evidence about whether correction is done in practice. Baron et al. [4] found that around 40% of multi-arm trials published in 2009 adjusted for multiple-testing, although did not distinguish between exploratory and confirmatory trials.
In the next section we investigate what proportion of recently published multi-arm clinical trials corrected for multiple-testing.