Our aim is to demonstrate the variation in estimates of population parameters taken from small studies. Though the sampling distributions of these parameters are well understood from statistical theory, we have chosen to present the behaviours of the distributions through simulation rather than through the theoretical arguments as the visual representation of the resulting distributions makes the results accessible to a wider audience.

Randomisation is not a necessary condition for estimating all parameters of interest. However, it should be noted that some parameters of interest during the feasibility phase are related to the randomisation procedure itself, such as the rate of willingness to be randomised, and the rate of retention or dropout in each randomised arm. In addition, randomisation ensures the equal distribution of known and unknown covariates on average across the randomised groups. This ensures that we can estimate parameters within arms without the need to worry about confounding factors. In this work we therefore decided to allow for the randomisation of participants to mimic the general setting for estimating all parameters, although it is acknowledged that some parameters are independent of randomisation.

We first consider a normally distributed outcome measured in two groups of equal size. We considered study groups of from 10 to 80 subjects using increments of five per group. For each pilot study size, 10,000 simulations were performed. Without loss of generality, we assumed the true population mean of the outcome is 0 and the true population variance is 1 (and that these are the same in the intervention and control groups). We then use the estimate of the SD, along with other information, such as the minimum clinically important difference in outcomes between groups, and Type I and Type II errors levels, to calculate the required sample size (using the significance thresholds approach) for the definitive RCT.

The target difference or effect size that is regarded as the minimum clinically important difference is usually the difference in the means when comparing continuous outcomes for the intervention with those of the control group. This difference is then converted to a standardised effect size by dividing by the population SD. More details of the statistical hypothesis testing framework in RCTs can be found in the literature [24, 25].

For a two-group pilot RCT we can use the SD estimate from the new treatment group or the control/usual care group or combine the two SD estimates from the two groups and use a pooled standard deviation (SD_{
p
}) estimated from the two-group specific sample SDs. For sample size calculations, we generally assume the variability of the outcome is the same or equal in both groups, although this assumption can be relaxed and methods are available for calculating sample sizes assuming unequal SDs in each group [26, 27]. This is analogous to using the standard *t*-test with two independent samples (or multiple linear regression), which assumes equal variances, to analyse the outcome data compared with using versions of the *t*-test that do not assume equal variances (e.g. Satterthwaite’s or Welch’s correction).

We assume binary outcomes are binomially distributed and consider a number of different true population proportions as the variation of proportion estimator is a function of the true proportion. When estimating an event rate, it may not always be appropriate to pool the two arms of the study so we study the impact of estimating a proportion from a single arm where the study size increases in steps of five subjects. We considered true proportions in the range 0.1 to 0.5 in increments of 0.05. For each scenario and sample size, we simulated the feasibility study at least 10,000 times depending on the assumed true proportion. For the binary outcomes, the number of simulations was determined by requiring the proportion to be estimated within a standard error of 0.001. Hence, the largest number of simulations required was 250,000 when the true proportion was equal to 0.5. Simulations were performed in Stata version 12.1 [28] and R version 13.2 [29].

### Normally distributed outcomes

For each simulation, sample variances were calculated for each group (

and

) and the pooled SD was calculated as follows:

We also computed the standard error of the sample pooled SD which is

To quantify the relative change in precision, we compared the average width of the 95% confidence intervals (WCI_{2n
}) for the SD_{
p
} for study sizes of 2*n* with the average width when the study size was increased to 2(*n* + 5). We use the width of the confidence interval as this provides a measure of the precision of the estimate.

Given the sampling distribution of the SD, its lower and upper 95% confidence limits are given by:

and the relative percentage gain in precision is quantified as the reduction in 95% confidence interval width if the sample size is increased by five per group:

Bias is assessed by subtracting the true value from each estimate and taking the mean of these differences.

We also consider the impact of adjusting the SD estimate from the pilot as suggested originally by Browne in 1995 [

15]. Here a one-sided confidence limit is proposed to give a corrected value. If we used the 50% one-sided confidence limit, this would adjust for the bias in the estimate, and this correction has also been proposed when using small pilots [

17]. If we specify 50% confidence then our power will be as required 50% of the time. Sim and Lewis [

18] suggest that it is reasonable to require that the sample size calculation guarantees the desired power with a specified level of confidence greater than 50%. For the sake of illustration, we will consider an 80% confidence level for the inflation factor. So we require the confidence interval limit associated with 80% confidence above that value. Hence the inflation factor to apply to the SD

_{
p
} from the pilot is:

To consider the impact on power and planned sample size, we need to state reasonable specific alternative hypotheses. In trials, it is uncommon to see large differences between treatments so we considered small to medium standardised effect sizes (differences between the group means) of 0.2, 0.35 and 0.5 [30]. For each true effect size of 0.2, 0.35 or 0.5, we divide by the SD_{
p
} estimate for each replicate, and use this value to calculate the required sample size. For each simulated pilot study, we calculate the planned sample size for the RCT assuming either the unadjusted or adjusted SD_{
p
} estimated from the pilot. Using this planned sample size (where the SD_{
p
} has been estimated) we then calculate the true power of the planned study assuming that we know that the true population SD_{
p
} is in fact 1.

### Binary outcomes

We consider that the binary outcome will be measured for one homogeneous group only. The following is repeated for each true population success probability. We examined nine true success probabilities from 0.1 to 0.5 in intervals of 0.05. We considered 41 different pilot study sizes ranging from 10 to 200 consisting of multiples of five subjects. The subscripts

*i* and

*j* are used to denote the true proportion and the pilot study size, respectively. For each simulated pilot study of size

*n*
_{
j
}, the number of successes (

*Y*
_{
ij
} ~ Bin(

*n*
_{
j
},

*θ*
_{
i
})) in the simulation

*n*
_{
j
} are counted. First, the observed proportions,

, for each of the nine true success probabilities were calculated by:

The associated 95% confidence interval was calculated using Wilson’s score [

21] given by:

Second, this process was repeated for

*N*
_{
s
} (the number of simulations needed to estimate the true success probability to within 0.1% of its standard error) and the average observed success probability for each of the nine true success probabilities (

*θ*) for a given fixed pilot size were calculated as follows:

where

is

for the

*k*th simulated pilot study. Third, due to the relatively small sample size of the pilot trials, we computed the mean width of the 95% confidence interval of the true success probability averaged over

*N*
_{
s
} simulations using the Wilson’s score method [

31] for a fixed sample size, which is given by:

The relative percentage gain in precision around the true binomial proportion per increase of five study participants is defined as before:

As for the continuous outcomes, bias is assessed by subtracting the true population value from each estimate and taking the signed mean of these. We also report the 95% coverage probability [32].