Sample size requirements to estimate key design parameters from external pilot randomised controlled trials: a simulation study
© Teare et al.; licensee BioMed Central Ltd. 2014
Received: 13 December 2013
Accepted: 20 June 2014
Published: 3 July 2014
External pilot or feasibility studies can be used to estimate key unknown parameters to inform the design of the definitive randomised controlled trial (RCT). However, there is little consensus on how large pilot studies need to be, and some suggest inflating estimates to adjust for the lack of precision when planning the definitive RCT.
We use a simulation approach to illustrate the sampling distribution of the standard deviation for continuous outcomes and the event rate for binary outcomes. We present the impact of increasing the pilot sample size on the precision and bias of these estimates, and predicted power under three realistic scenarios. We also illustrate the consequences of using a confidence interval argument to inflate estimates so the required power is achieved with a pre-specified level of confidence. We limit our attention to external pilot and feasibility studies prior to a two-parallel-balanced-group superiority RCT.
For normally distributed outcomes, the relative gain in precision of the pooled standard deviation (SD p ) is less than 10% (for each five subjects added per group) once the total sample size is 70. For true proportions between 0.1 and 0.5, we find the gain in precision for each five subjects added to the pilot sample is less than 5% once the sample size is 60. Adjusting the required sample sizes for the imprecision in the pilot study estimates can result in excessively large definitive RCTs and also requires a pilot sample size of 60 to 90 for the true effect sizes considered here.
We recommend that an external pilot study has at least 70 measured subjects (35 per group) when estimating the SD p for a continuous outcome. If the event rate in an intervention group needs to be estimated by the pilot then a total of 60 to 100 subjects is required. Hence if the primary outcome is binary a total of at least 120 subjects (60 in each group) may be required in the pilot trial. It is very much more efficient to use a larger pilot study, than to guard against the lack of precision by using inflated estimates.
Keywordssample size feasibility studies pilot studies binary outcomes continuous outcomes, RCTs
In 2012/13, the National Institute for Health Research (NIHR) funded £208.9 million of research grants across a broad range of programmes and initiatives to ensure that patients and the public benefit from the most cost-effective up-to-date health interventions and treatments as quickly as possible . A substantial proportion of these research grants were randomised controlled trials (RCTs) to assess the clinical effectiveness and cost-effectiveness of new health technologies. Well-designed RCTs are widely regarded as the least biased research design for evaluating new health technologies and decision-makers, such as the National Institute for Health and Care Excellence (NICE), are increasingly looking to the results of RCTs to guide practice and policy.
RCTs aim to provide precise estimates of treatment effects and therefore need to be well designed to have good power to answer specific clinically important questions. Both overpowered and underpowered trials are undesirable and each poses different ethical, statistical and practical problems. Good trial design requires the magnitude of the clinically important effect size to be stated in advance. However, some knowledge of the population variation of the outcome or the event rate in the control group is necessary before a robust sample size calculation can be done. If the outcome is well established, these key population or control parameters can be estimated from previous studies (RCTs or cohort studies) or through meta-analyses. However, in some cases finding robust estimates can pose quite a challenge if reliable data, for the proposed trial population under investigation, do not already exist.
A systematic review of published RCTs with continuous outcomes found evidence that the population variation was underestimated (in 80% of reported endpoints) in the sample size calculations compared to the variation observed when the trial was completed . This study also found that 25% of studies were vastly underpowered and would have needed five times the sample size if the variation observed in the trial had been used in the sample size calculation. A more recent review of trials with both binary and continuous outcomes  found that there was a 50% chance of underestimating key parameters. However, they too found large differences between the estimates used in the sample size calculation compared to the estimates derived from the definitive trial. This suggests that many RCTs are indeed substantially underpowered or overpowered. A systematic review of RCT proposals reaching research ethics committees  found more than half of the studies included did not report the basis for the assumed values of the population parameters. So the values assumed for the key population parameters may be the weakest part of the RCT design.
A frequently reported problem with publicly funded RCTs is that the recruitment of participants is often slower or more difficult than expected, with many trials failing to reach their planned sample size within the originally envisaged trial timescale and trial-funding envelope. A review of a cohort of 122 trials funded by the United Kingdom (UK) Medical Research Council and the NIHR Health Technology Assessment programme found that less than a third (31%) of the trials achieved their original patient recruitment target, 55/122 (45.1%) achieved less than 80% of their original target and half (53%) were awarded an extension . Similar findings were reported in a recently updated review . Thus, many trials appear to have unrealistic recruitment rates. Trials that do not recruit to the target sample size within the time frame allowed will have reduced power to detect the pre-specified target effect size.
Thus the success of definitive RCTs is mainly dependent on the availability of robust information to inform the design. A well-designed, conducted and analysed pilot or feasibility trial can help inform the design of the definitive trial and increase the likelihood of the definitive trial achieving its aims and objectives. There is some confusion about terminology and what is a feasibility study and what is a pilot study. UK public funding bodies within the NIHR portfolio have agreed definitions for pilot and feasibility studies . Other authors have argued against the use of the term ‘feasibility’ and distinguish three types of preclinical trial work .
Distinguishing features of pilot and feasibility studies
NIHR guidance states:
standard deviation of the outcome measure, which is needed in some cases to estimate sample size;
willingness of participants to be randomised;
willingness of clinicians to recruit participants;
number of eligible patients over a specific time frame;
characteristics of the proposed outcome measure and in some cases feasibility studies might involve designing a suitable outcome measure;
follow-up rates, response rates to questionnaires, adherence/compliance rates, intracluster correlation coefficients in cluster trials, etc.
Feasibility studies for randomised controlled trials may themselves not be randomised. Crucially, feasibility studies do not evaluate the outcome of interest; that is left to the main study.
If a feasibility study is a small RCT, it need not have a primary outcome and the usual sort of power calculation is not normally undertaken. Instead the sample size should be adequate to estimate the critical parameters (e.g. recruitment rate) to the necessary degree of precision.
Pilot trials are a version of the main study that is run in miniature to test whether the components of the main study can all work together. It will therefore resemble the main study in many respects, including an assessment of the primary outcome. In some cases this will be the first phase of the substantive study and data from the pilot phase may contribute to the final analysis; referred to as an internal pilot. Or at the end of the pilot study the data may be analysed and set aside, a so-called external pilot.
For the purposes of this paper we will use the term pilot study to refer to the pilot work conducted to estimate key parameters for the design of the definitive trial. There is extensive but separate literature on two-stage RCT designs using an internal pilot study [11–14].
There is disagreement over what sample size should be used for pilot trials to inform the design of definitive RCTs [15–18]. Some recommendations have been developed although there is no consensus on the matter. Furthermore, the majority of the recommendations focus on estimating the variability of a continuous outcome and relatively little attention is paid to binary outcomes. The disagreement stems from two competing pressures. Small studies can be imprecise and biased (as defined here by comparing the median of the sampling distribution to the true population value), so larger sample sizes are required to reduce both the magnitude of the bias and the imprecision. However, in general participants measured in an external pilot or feasibility trial do not contribute to the estimation of the treatment effect in the final trial, so our aim should be to maintain adequate power while keeping the total number of subjects studied to a minimum. Recently some authors have promoted the practice of taking account of the imprecision in the estimate of the variance for a continuous outcome. Several suggest the use of a one-sided confidence interval approach to guarantee that power is at least what is required more than 50% of the time [15, 18, 19].
This paper aims to provide recommendations and guidelines with respect to two considerations. Firstly, what is the number of subjects required in an external pilot RCT to estimate the uncertain critical parameters (SD for continuous outcomes; and consent rates, event rates and attrition rates for binary outcomes) needed to inform the design of the definitive RCT with a reasonable degree of precision? Secondly, how should these estimates from the pilot study be used to inform the sample size (and design) for the definitive RCT? We shall assume that the pilot study (and the definitive RCT) is a two-parallel-balanced-group superiority trial of a new treatment versus control.
For the purposes of this work we assume that the sample size of the definitive RCT is calculated using a level of significance and power argument. This is the approach that is currently commonly employed in RCTs; however, alternative methods to calculate sample size have been proposed, such as using the width of confidence intervals  and Bayesian approaches to allow for uncertainty [21–23].
Our aim is to demonstrate the variation in estimates of population parameters taken from small studies. Though the sampling distributions of these parameters are well understood from statistical theory, we have chosen to present the behaviours of the distributions through simulation rather than through the theoretical arguments as the visual representation of the resulting distributions makes the results accessible to a wider audience.
Randomisation is not a necessary condition for estimating all parameters of interest. However, it should be noted that some parameters of interest during the feasibility phase are related to the randomisation procedure itself, such as the rate of willingness to be randomised, and the rate of retention or dropout in each randomised arm. In addition, randomisation ensures the equal distribution of known and unknown covariates on average across the randomised groups. This ensures that we can estimate parameters within arms without the need to worry about confounding factors. In this work we therefore decided to allow for the randomisation of participants to mimic the general setting for estimating all parameters, although it is acknowledged that some parameters are independent of randomisation.
We first consider a normally distributed outcome measured in two groups of equal size. We considered study groups of from 10 to 80 subjects using increments of five per group. For each pilot study size, 10,000 simulations were performed. Without loss of generality, we assumed the true population mean of the outcome is 0 and the true population variance is 1 (and that these are the same in the intervention and control groups). We then use the estimate of the SD, along with other information, such as the minimum clinically important difference in outcomes between groups, and Type I and Type II errors levels, to calculate the required sample size (using the significance thresholds approach) for the definitive RCT.
The target difference or effect size that is regarded as the minimum clinically important difference is usually the difference in the means when comparing continuous outcomes for the intervention with those of the control group. This difference is then converted to a standardised effect size by dividing by the population SD. More details of the statistical hypothesis testing framework in RCTs can be found in the literature [24, 25].
For a two-group pilot RCT we can use the SD estimate from the new treatment group or the control/usual care group or combine the two SD estimates from the two groups and use a pooled standard deviation (SD p ) estimated from the two-group specific sample SDs. For sample size calculations, we generally assume the variability of the outcome is the same or equal in both groups, although this assumption can be relaxed and methods are available for calculating sample sizes assuming unequal SDs in each group [26, 27]. This is analogous to using the standard t-test with two independent samples (or multiple linear regression), which assumes equal variances, to analyse the outcome data compared with using versions of the t-test that do not assume equal variances (e.g. Satterthwaite’s or Welch’s correction).
We assume binary outcomes are binomially distributed and consider a number of different true population proportions as the variation of proportion estimator is a function of the true proportion. When estimating an event rate, it may not always be appropriate to pool the two arms of the study so we study the impact of estimating a proportion from a single arm where the study size increases in steps of five subjects. We considered true proportions in the range 0.1 to 0.5 in increments of 0.05. For each scenario and sample size, we simulated the feasibility study at least 10,000 times depending on the assumed true proportion. For the binary outcomes, the number of simulations was determined by requiring the proportion to be estimated within a standard error of 0.001. Hence, the largest number of simulations required was 250,000 when the true proportion was equal to 0.5. Simulations were performed in Stata version 12.1  and R version 13.2 .
Normally distributed outcomes
To quantify the relative change in precision, we compared the average width of the 95% confidence intervals (WCI2n ) for the SD p for study sizes of 2n with the average width when the study size was increased to 2(n + 5). We use the width of the confidence interval as this provides a measure of the precision of the estimate.
Bias is assessed by subtracting the true value from each estimate and taking the mean of these differences.
To consider the impact on power and planned sample size, we need to state reasonable specific alternative hypotheses. In trials, it is uncommon to see large differences between treatments so we considered small to medium standardised effect sizes (differences between the group means) of 0.2, 0.35 and 0.5 . For each true effect size of 0.2, 0.35 or 0.5, we divide by the SD p estimate for each replicate, and use this value to calculate the required sample size. For each simulated pilot study, we calculate the planned sample size for the RCT assuming either the unadjusted or adjusted SD p estimated from the pilot. Using this planned sample size (where the SD p has been estimated) we then calculate the true power of the planned study assuming that we know that the true population SD p is in fact 1.
As for the continuous outcomes, bias is assessed by subtracting the true population value from each estimate and taking the signed mean of these. We also report the 95% coverage probability .
Results and discussion
Normally distributed outcomes
Our simulated data visually demonstrate the large sampling variation that is the main weakness when estimating key parameters from small sample sizes. Small samples sizes do lead to biased estimates, but the bias is negligible compared to the sampling variation. When we examine the relative percentage gain in precision by adding more subjects to the sample, our data suggest that a total of at least 70 may be necessary for estimating the standard deviation of a normally distributed variable with good precision, and 60 to 100 subjects in a single group for estimating an event rate seems reasonable. Treatment-independent parameters may be estimated by pooling the two groups, so in many cases our recommended sample size will be the total sample size. On average when the definitive RCT is planned using an estimate from a pilot study there will be a tendency for the planned study to be underpowered. However, if the definitive RCT is planned for a continuous outcome requiring a power of 90% then the true power will be 80% with at least 76% assurance provided the estimates come from a pilot with at least 20 subjects. We considered three realistic effect sizes of 0.2, 0.35 and 0.5 of a standard deviation to evaluate the impact of adjusting for the anticipated uncertainty in the estimate from the pilot when calculating the sample size for the planned RCT as was recently suggested . For all of the effect sizes considered, it is not efficient to use small pilots and apply the inflation adjustment, as this will result in larger sample sizes (pilot plus main study) in total. Further, we only considered sample sizes planned when requiring 90% power, and examine the conditional power assuming we know the true alternative. On average using imprecise estimates but requiring high power will result in acceptable power with much less ‘cost’ as measured by total sample size. Hence, it is actually more efficient to use a large external pilot study to reduce the variation around the target power for the definitive RCT.
The implication of using estimates of key parameters from small pilot studies is the risk of both over- and underpowered studies. While overpowered studies may not seem such an acute problem, they are potentially a costly mistake and may result in a study being judged as prohibitively large. This would seem to be an argument in favour of utilising internal pilot studies, but an internal pilot requires the key design features of the trial to be fixed, so any change in measurement of the treatment effect following an internal pilot will lead to analysis difficulties.
A major and well-documented problem with published trials is under recruitment, where there is a tendency to recruit fewer subjects than targeted. One reason for under recruitment may well be that event rates such as recruitment and willingness to be randomised cannot be accurately estimated from small pilots, and in fact increasing the pilot size to between 60 and 100 per group may give much more reliable data on the critical recruitment parameters.
In reality, when designing external pilot trials, there is a need to balance two competing issues: maximising the precision (of the critical parameters you wish to estimate) and minimising the size of the external pilot trial, which impacts on resources, time and costs. Thus there is a trade-off between the precision (of the estimates of the critical parameters) and size (number of subjects) of the pilot study. When designing external pilot trials, researchers need to understand that they are trading off the precision of the estimates against the total sample size of the definitive study when they decide to have an external pilot study with a small sample size.
National Institute for Health and Care Excellence
National Institute for Health Research
randomised control trial
MDT, SJW, AW and NS are funded by the University of Sheffield. MD is fully funded by NIHR as part of a doctoral research fellowship (DRF-2012-05-182). AH was funded by NIHR-Research Design Service and the University of Sheffield. The views expressed are those of the authors and not necessarily those of the National Health Service, the NIHR, the Department of Health or organisations affiliated to or funding them.
The authors thank the three reviewers for their detailed critical comments, which substantially improved the manuscript. We also thank members of the Medical Statistics Group at the School of Health and Related Research, University of Sheffield, for constructive discussions and input to the project. We acknowledge the University of Sheffield for supporting this research.
- NIHR Annual Report 2012/2013. [http://www.nihr.ac.uk/publications]
- Vickers AJ: Underpowering in randomized trials reporting a sample size calculation. J Clin Epidemiol. 2003, 56 (8): 717-720. 10.1016/S0895-4356(03)00141-0.View ArticlePubMedGoogle Scholar
- Charles P, Giraudeau B, Dechartres A, Baron G, Ravaud P: Reporting of sample size calculation in randomised controlled trials: review. BMJ. 2009, 338: b1732-10.1136/bmj.b1732.View ArticlePubMedPubMed CentralGoogle Scholar
- Clark T, Berger U, Mansmann U: Sample size determinations in original research protocols for randomised clinical trials submitted to UK research ethics committees: review. BMJ. 2013, 346: f1135-10.1136/bmj.f1135.View ArticlePubMedPubMed CentralGoogle Scholar
- McDonald AM, Knight RC, Campbell MK, Entwistle VA, Grant AM, Cook JA, Elbourne DR, Francis D, Garcia J, Roberts I: What influences recruitment to randomised controlled trials? A review of trials funded by two UK funding agencies. Trials. 2006, 7 (1): 9-10.1186/1745-6215-7-9.View ArticlePubMedPubMed CentralGoogle Scholar
- Sully BG, Julious SA, Nicholl J: A reinvestigation of recruitment to randomised, controlled, multicenter trials: a review of trials funded by two UK funding agencies. Trials. 2013, 14 (1): 166-10.1186/1745-6215-14-166.View ArticlePubMedPubMed CentralGoogle Scholar
- NIHR, Feasibility and pilot studies. [http://www.nets.nihr.ac.uk/glossary]
- Arnold DM, Burns KEA, Adhikari NKJ, Kho ME, Meade MO, Cook DJ: The design and interpretation of pilot trials in clinical research in critical care. Crit Care Med. 2009, 37 (1): S69-S74.View ArticlePubMedGoogle Scholar
- Thabane L, Ma J, Chu R, Cheng J, Ismaila A, Rios L, Robson R, Thabane M, Giangregorio L, Goldsmith C: A tutorial on pilot studies: the what, why and how. BMC Med Res Methodol. 2010, 10 (1): 1-10.1186/1471-2288-10-1.View ArticlePubMedPubMed CentralGoogle Scholar
- Lee EC, Whitehead AL, Jacques RM, Julious SA: The statistical interpretation of pilot trials: should significance thresholds be reconsidered?. BMC Med Res Methodol. 2014, 14: 41-10.1186/1471-2288-14-41.View ArticlePubMedPubMed CentralGoogle Scholar
- Proschan MA: Two-stage sample size re-estimation based on a nuisance parameter: a review. J Biopharm Stat. 2005, 15 (4): 559-574. 10.1081/BIP-200062852.View ArticlePubMedGoogle Scholar
- Birkett MA, Day SJ: Internal pilot studies for estimating sample size. Stat Med. 1994, 13 (23–24): 2455-2463.View ArticlePubMedGoogle Scholar
- Wittes J, Brittain E: The role of internal pilot-studies in increasing the efficiency of clinical-trials. Stat Med. 1990, 9 (1–2): 65-72.View ArticlePubMedGoogle Scholar
- Friede T, Kieser M: Blinded sample size re-estimation in superiority and noninferiority trials: bias versus variance in variance estimation. Pharm Stat. 2013, 12 (3): 141-146. 10.1002/pst.1564.View ArticlePubMedGoogle Scholar
- Browne RH: On the use of a pilot sample for sample-size determination. Stat Med. 1995, 14 (17): 1933-1940. 10.1002/sim.4780141709.View ArticlePubMedGoogle Scholar
- Julious SA: Sample size of 12 per group rule of thumb for a pilot study. Pharm Stat. 2005, 4 (4): 287-291. 10.1002/pst.185.View ArticleGoogle Scholar
- Julious SA: Designing clinical trials with uncertain estimates of variability. Pharm Stat. 2004, 3 (4): 261-268. 10.1002/pst.139.View ArticleGoogle Scholar
- Sim J, Lewis M: The size of a pilot study for a clinical trial should be calculated in relation to considerations of precision and efficiency. J Clin Epidemiol. 2012, 65 (3): 301-308. 10.1016/j.jclinepi.2011.07.011.View ArticlePubMedGoogle Scholar
- Kieser M, Wassmer G: On the use of the upper confidence limit for the variance from a pilot sample for sample size determination. Biom J. 1996, 38 (8): 941-949. 10.1002/bimj.4710380806.View ArticleGoogle Scholar
- Bland JM: The tyranny of power: is there a better way to calculate sample size?. BMJ. 2009, 339: b3985-10.1136/bmj.b3985.View ArticlePubMedGoogle Scholar
- Sahu SK, Smith TMF: A Bayesian method of sample size determination with practical applications. J R Stat Soc Ser A – Stat Soc. 2006, 169: 235-253. 10.1111/j.1467-985X.2006.00408.x.View ArticleGoogle Scholar
- O’Hagan A, Stevens JW, Campbell MJ: Assurance in clinical trial design. Pharm Stat. 2005, 4 (3): 187-201. 10.1002/pst.175.View ArticleGoogle Scholar
- Brutti P, De Santis F: Robust Bayesian sample size determination for avoiding the range of equivalence in clinical trials. J Stat Plann Inference. 2008, 138 (6): 1577-1591. 10.1016/j.jspi.2007.05.041.View ArticleGoogle Scholar
- Kirkwood BR, Sterne JAC: Essential Medical Statistics. 2003, Oxford: Blackwell Science, 2Google Scholar
- Campbell MJ, Walters SJ, Machin D: Medical Statistics: A Textbook for the Health Sciences. 2007, Chichester: Wiley, 4Google Scholar
- Satterthwaite FE: An approximate distribution of estimates of variance components. Biometrics Bull. 1946, 2: 110-114. 10.2307/3002019.View ArticleGoogle Scholar
- Welch BL: The generalization of ‘Student’s’ problem when several different population variances are involved. Biometrika. 1947, 34: 28-35.PubMedGoogle Scholar
- StataCorp: Statistical Software: Release 12. 2011, TX: College StationGoogle Scholar
- Team RC: R: A Language and Environment for Statistical Computing. 2013, Vienna, Austria: R Foundation for Statistical ComputingGoogle Scholar
- Cohen J: Statistical Power Analysis for the Behavioural Sciences. 1988, Hillsdale, NJ: Lawrence Erlbaum, 2Google Scholar
- Agresti A, Coull BA: Approximate is better than ‘exact’ for interval estimation of binomial proportions. Am Statistician. 1998, 52 (2): 119-126.Google Scholar
- Burton A, Altman DG, Royston P, Holder RL: The design of simulation studies in medical statistics. Stat Med. 2006, 30 (25): 4279-4292.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.