 Methodology
 Open access
 Published:
A sample size planning approach that considers both statistical significance and clinical significance
Trials volume 16, Article number: 213 (2015)
Abstract
Background
The CONSORT statement requires clinical trials to report confidence intervals, which help to assess the precision and clinical importance of the treatment effect. Conventional sample size calculations for clinical trials, however, only consider issues of statistical significance (that is, significance level and power).
Method
A more consistent approach is proposed whereby sample size planning also incorporates information on clinical significance as indicated by the boundaries of the confidence limits of the treatment effect.
Results
The probabilities of declaring a “definitivepositive” or “definitivenegative” result (as defined by Guyatt et al., CMAJ 152(2):169173, 1995) are controlled by calculating the sample size such that the lower confidence limit under H_{1} and the upper confidence limit under H_{0} are bounded by relevant cutoffs. Adjustments to the traditional sample size can be directly derived for the comparison of two normally distributed means in a test of nonequality, while simulations are used to estimate the sample size for evaluating the hazards ratio in a proportionalhazards model.
Conclusions
This sample size planning approach allows for an assessment of the potential clinical importance and precision of the treatment effect in a clinical trial in addition to considerations of statistical power and type I error.
Background
The importance of confidence intervals is clearly attested by journal guidelines [13] as they “convey information about magnitude and precision of effect simultaneously, and keep these two aspects of measurements closely linked” [4]. For clinical trials, the CONSORT statement [5] stipulates the reporting of the “estimated effect size and its precision (such as 95% confidence interval)” and “how sample size was determined,” but traditional sample size calculations for testing scientific hypotheses consider only statistical significance and power. The precision and clinical importance of the effect that can be depicted by confidence intervals is ignored. Under the usual practice, one calculates the sample size needed to declare some “clinically important difference” statistically significant at the αlevel with 1  β probability. The problem is that there is substantial subjectivity in quantifying this difference, and this can turn the sample size calculation into a moot exercise for choosing a difference to justify the number of patients the study can afford [6]. Frequently, the selected difference ends up larger than what is usual, and thus many studies may display large differences but lack the precision to make them statistically significant. Such shortcomings have led some to argue for reform of current sample size conventions in order to avoid misinterpretation of completed studies and harm to scientific research [7].
What would be helpful is a sample size estimation procedure that provides information on the confidence interval to supply users with information on the clinical significance and precision of the treatment effect in addition to power and statistical significance. Beal [8] suggested selecting sample size such that there is a high probability of the halfwidth of the confidence interval being less than some prescribed length, conditional on the interval containing the parameter of interest. Similarly, Liu [9] chose the sample size to yield a short confidence interval width but conditional on the rejection of the null hypothesis H_{0}. Jiroutek et al. [10] combined the two by considering the probability of attaining a certain interval width conditional on both rejection of H_{0} and inclusion of the true parameter. Cesana et al. [11,12] introduced a twostep procedure by first obtaining the sample size according to power and then iteratively increasing the sample size until the probability of obtaining confidence intervals with widths less than the expected interval width under H_{1} exceeds a specified level.
In the above methods, the user either has to designate an interval length as reference or rely on the expected interval width, which may not be clinically relevant. A more straightforward alternative is to calculate a sample size such that the confidence limits of the parameter will be bounded by designated cutoffs. Specifically, the sample size is chosen such that according to the confidence limits the result can be deemed “definitivepositive” if there is indeed an effect or deemed “definitivenegative” if there is none. According to Guyatt et al. [13], a “definitivepositive” result implies that the lower confidence limit (LCL) of the parameter is not only larger than zero, implying a “positive” and statistically significant study, but above a relevant nonzero threshold. Conversely, a “definitivenegative” result implies that the upper confidence limit (UCL) is below some nonzero threshold. In hypothesis testing, one does not know whether H_{1} or H_{0} is true and can only control the probabilities of making a false positive or false negative error. Likewise, in this approach, we control the probabilities of declaring a “definitivepositive” or “definitivenegative” result by calculating the sample size such that LCL under H_{1} and UCL under H_{0} are bounded by fixed cutoffs. The following section demonstrates these concepts first for continuous normally distributed data and then for timetoevent data.
Methods
Normally distributed data
Consider a randomized 1:1 clinical trial comparing the mean responses between the treatment and control groups. When the response (or appropriately transformed response) can be regarded as normally distributed, the assessment of the treatment effect can be formulated as a hypothesis test of H_{0}: μ_{1}  μ_{0} = 0 versus H_{1}: μ_{1}  μ_{0} ≠ 0. The sample size is then given by
where Z_{γ} is the γth quantile of the standard normal distribution, (μ_{0,}σ_{0}) and (μ_{1,}σ_{1}) are the means and standard deviations of the control and treatment groups, respectively, \( {\sigma}^2={\sigma}_0^2+{\sigma}_1^2 \), and δ = μ_{1} μ_{0} is the clinically important difference to be detected at level α with power 1  β.
We first examine how likely the above sample size will yield a “definitivenegative” or “definitivepositive” result by calculating, respectively, the probabilities Pr(UCL < k_{0}δ  H_{0}) and Pr(LCL > k_{1}δ  H_{1}) for k_{0}, k_{1} ∈ [0,1]. Without loss of generality, assume δ > 0 and let \( \overline{D} \) be the sample estimate of the treatment difference. If σ is known, then
where Z is the standard normal variable. As k_{0}, k_{1} vary from 0 to 1, these two probability functions are mirror images about 1/2, with Pr(LCL > δ /2  H_{1}) = Pr(UCL < δ /2  H_{0}). At the boundaries of 0 and 1, Pr(LCL > 0  H_{1}) = Pr(UCL < δ  H_{0}) = 1  β.
Based on the derivations of equations (2) and (3), it can be shown that if the sample size is increased to \( {n}_0=n/{k}_0^2 \) then Pr(UCL < k_{0}δ  H_{0}) = 1  β for k_{0} ∈ (0,1) and if it is increased to n_{1} = n/(1 − k_{1})^{2} then Pr(LCL > k_{1}δ  H_{1}) = 1  β for k_{1} ∈ (0,1). For example, with k_{0} = k_{1} = 1/2 and sample size n_{0} = n_{1} = 4n both Pr(LCL > δ /2  H_{1}) = Pr(UCL < δ /2  H_{0}) = 1  β. Note that if k_{0} = k_{1} < 1/2 then n_{0} > n_{1} and a larger sample size is required to establish a “definitivenegative” compared to a “definitivepositive” result. Conversely, if k_{0} = k_{1} > 1/2, then n_{0} < n_{1,} and a larger sample size is needed to establish a “definitivepositive” result. In general, if
then Pr(UCL < k_{0}δ  H_{0}) = Pr(LCL > k_{1}δ  H_{1}) = 1  β. For example, if k_{0} = 2/3, k_{1} = 1/3 and n_{0} = n_{1} = 9n /4 then Pr(LCL > δ /3  H_{1}) = Pr(UCL < 2δ /3  H_{0}) = 1  β.
Timetoevent data
We extend our proposed method to include timetoevent data, and use this case to show how a simulationbased approach can be used to estimate the sample size when the validity of normal approximation may be in doubt. In situations where a closedform sample size formula is not readily available or difficult to derive, simulation provides an alternative and offers greater flexibility for adapting to more complicated analyses. Briefly, the initial sample size required to detect the clinically important difference δ at power 1  β is first calculated and then iteratively increased until Pr(LCL > k_{1}δ  H_{1}) and Pr(UCL < k_{0}δ  H_{0}) reach desired levels. The hazard ratio Δ is chosen as the parameter of interest with its corresponding confidence limits LCL and UCL being estimated using Cox regression. In the following description, we select for simplicity and convenience a single common cutoff by letting k_{0} = k_{1} = 1/2.
Under the proportional hazards assumption, the initial total sample size N_{0} for detecting δ = log_{e}Δ at level α and power 1  β can be estimated using Schoenfeld’s [14] formula,
where π_{ c } is the overall censoring proportion, and P_{0} and P_{1} are the proportion of subjects in the treatment and control groups, respectively. (Another choice is to use Freedman’s [15] formula, which gives a slightly smaller sample size.)
Timetoevent data are simulated from the exponential distribution since it is most widely used to model timetoevent data under the proportional hazards assumption. Specifically, we simulate exponential survival times T_{ i } and exponential censoring times L_{ i } for subjects i = 1, …, N_{0}/2 in each group, and consider a subject censored whenever T_{ i } < L_{ i }. According to Halabi and Bahadur [16], the parameters for the survival and censoring time distributions are given by
where λ_{0}, λ_{1} are the hazard rates of the exponential survival times for the control and treatment groups, respectively, and λ_{ c } is the hazard rate for the exponential censoring time. When π_{ c } = 0.5, equation (6) reduces to the simple relationship
We set λ_{0} = 1 and select four values, (1.25, 1.5, 1.75, 2.0), for the hazard ratio Δ ≡ λ_{1}/λ_{0} = λ_{1}. For each value of Δ, the procedure goes through the following steps:

1.
With α = 0.05, β = 0.2, P_{0} = P_{1} = 0.5, π_{ c } = 0.5, and δ = log_{e}(Δ), calculate the initial total sample size N_{0} using (5);

2.
Simulate N_{0}/2 independent samples of exponential survival and censoring times for the treatment and control groups with corresponding parameters λ_{0} = 1, λ_{1}, and \( {\lambda}_c=\sqrt{\lambda_1}; \)

3.
Compare the survival times between the treatment and control groups using Cox regression and compute the 95% confidence interval for log_{e}(Δ);

4.
Repeat steps (2) and (3) for 10,000 iterations and estimate Pr(LCL > δ /2  H_{1}) using the proportion of iterations where LCL > δ /2;

5.
Set Δ = 1 and repeat steps (2) and (3) 10,000 times to estimate Pr(UCL < δ /2  H_{0}) using the proportion of times when UCL < δ /2;

6.
Replace N_{0} with a larger sample size and repeat steps (2) through (5) until the estimates for both Pr(LCL > δ /2  H_{1}) and Pr(UCL < δ /2  H_{0}) are greater than some desired level (for example, 0.8).
The above procedure was programmed using SAS 9.2, and a sample SAS program is provided in the Appendix as reference.
Results
For comparing the means of normally distributed outcomes, Figure 1 shows that when α = 0.05 and power = 0.8, Pr(LCL > kδ  H_{1}) decreases steadily from 0.8 to 0.025 while Pr(UCL < kδ  H_{0}) increases steadily from 0.025 to 0.80 as k varies from 0 to 1. In fact, these two probability functions are mirror images about k = 1/2, where they both equal 0.288. This implies that a trial designed to detect a clinically important difference δ at the 5% significance level with 80% power will be “definitivepositive” about 29% of the time if one wants to say with 95% confidence that the treatment effect must be at least δ /2.
For timetoevent data, the initial total sample size (N_{0} = 1264) for detecting a hazard ratio Δ = 1.25 is almost 5/(1  π_{ c }) or ten times larger than that (N_{0} = 132) for detecting Δ = 2.00 according to Schoenfeld’s [14] formula. At these initial sample sizes, the estimates of Pr(LCL > 0  H_{1}) ranged from 0.79 to 0.81 as expected, while Pr(UCL < δ  H_{0}) ranged from 0.70 to 0.77, slightly less than 0.8. Similarly, estimates for Pr(LCL > δ /2  H_{1}) ranged from 0.27 to 0.29, close to what is expected for normally distributed data, while estimates of Pr(UCL < δ /2  H_{0}) are slightly lower than expected, ranging from 0.23 to 0.27. For a specific example, say Δ = 1.75, then N_{0} = 204 according to (5) and the estimates of α and β are 0.0485 and 0.2044, respectively. The β estimate implies that 79.6% of the samples have LCL > 0 under H_{1}. But the mean LCL is 0.16, thus as shown in Table 1 only 27.7% of the samples have LCL > δ /2 = log_{e}(1.75)/2 = 0.28. Correspondingly, 95.2% of the samples under H_{0} have confidence intervals that include zero, but since the mean UCL is 0.42 only 25.4% of the samples have UCL < 0.28.
Table 1 suggests that sample sizes need to be larger by four to five times the initial sample size before estimates of both Pr(LCL > δ /2  H_{1}) and Pr(UCL < δ /2  H_{0}) are above 0.8. For example, with Δ = 1.75, the mean LCL for samples under H_{1} equals 0.38 when the sample size reaches 938 (4.6 times N_{0}), and 85.0% of the samples then have LCL > δ /2 = 0.28. In addition, at this sample size, the mean UCL for samples under H_{0} equals 0.19, and 80.2% of the samples have UCL < 0.28. In terms of confidence interval width, the final sample sizes yield confidence interval widths that are between 0.4 to 0.5 times narrower than those at the initial sample sizes. For example, with Δ = 1.75 and a final sample size of 938, the mean confidence interval widths are 0.37 and 0.39 under H_{0} and H_{1}, respectively, and 0.46 times narrower than the corresponding mean confidence interval widths at the initial sample size of 204.
Discussion
Many researchers realize that a traditional sample size calculation for testing H_{0}: μ_{1}  μ_{0} = 0 versus H_{1}: μ_{1}  μ_{0} ≠ 0 with α = 0.05 and 80% power to detect a clinically important difference δ implies that: 1) 95% of its 95% confidence intervals for μ_{1}  μ_{0} will include zero when H_{0} is true, and 2) 80% of the 95% confidence intervals will exclude zero when H_{1} (that is, μ_{1}  μ_{0} = δ) is true. However, a confidence interval with a LCL that is barely larger than zero may indicate a statistically significant treatment effect but be unconvincing to investigators who desire a “definitivepositive” result [13]. In contrast, a confidence interval that includes zero and demonstrates a “statistically nonsignificant” effect may be more convincing as a “definitivenegative” result when its UCL is small. Therefore, we propose that information on Pr(LCL > cutoff  H_{1}) and Pr(UCL < cutoff  H_{0}) be available to assist investigators in gauging the clinical significance of the treatment effect. For example, a plot similar to Figure 1 can be provided as a supplement to the usual sample size calculation or the investigator can directly estimate the sample size required such that LCL and UCL are bounded by relevant cutoffs with high probability. This offers a more consistent approach since the confidence interval becomes an important component in the design of clinical trials and not solely for analysis.
One question for this method concerns how a clinically relevant cutoff can be selected. Since δ, the clinically important difference, is already defined in the original sample size calculation, a convenient choice is to specify the cutoff with respect to δ. Given the uncertainty involved in quantifying δ and the tendency to inflate it [6], we set the cutoff equal to kδ for k ∈ (0,1). This bypasses the need to additionally specify a confidence interval reference width [810] or calculate an expected confidence interval width [11,12]. For example, δ /2 can be used as the cutoff since it gives equal consideration to the expected precision of symmetrical intervals under H_{0} and H_{1}. However, it should be stressed that there is no requirement for intervals under H_{0} and H_{1} to be given equal emphasis or for the boundaries of LCL and UCL to be the same. A researcher may well choose different cutoffs corresponding to a “definitivepositive” and a “definitivenegative” result; for example, LCL > 3δ /4 and UCL < δ /4 or LCL > δ /3 and UCL < 2δ /3.
Previous considerations of sample size estimation by controlling statistical power and precision often involve complex calculations even for normally distributed or binary outcomes. The current proposal is pedagogically straightforward as it simply focuses on the position of the confidence limits in relation to clinically relevant boundaries. Greenland [17] designed a method that provides high power to discriminate between the parameter values under H_{0} and H_{1}. A sample size was chosen such that the discriminatory power, min{ Pr(LCL > 0  H_{1}), Pr(UCL < δ  H_{0})}, equals a specified level. Our method also focuses on the probabilities of the lower and upper confidence limits being bounded, but the boundaries are different as Greenland was not thinking of clinically important effect sizes but the original parameter values under H_{0} and H_{1}.
The condition LCL > k_{1}δ corresponds to the alternative hypothesis for a superiority test of H_{0}: μ_{1}  μ_{0} ≤ k_{1}δ versus H_{1}: μ_{1}  μ_{0} > k_{1}δ. However, the sample size n_{1} to attain a “definitivepositive” result is different from the sample size for the superiority test since the former is twosided while the latter is onesided. For example, with α = 0.05, β = 0.2, σ^{2} = 2, δ = 1, and k_{1} = 1/2, equations (1) and (4) imply that n_{1} = 4×16 = 64, while the sample size for the superiority test, as given by
equals 50. More importantly, our method calculates not only the sample size involving LCL > k_{1}δ but also that for UCL < k_{0}δ.
Conclusions
In summary, our proposed method allows the researcher to calculate the sample size for a clinical trial not only according to the specifications of statistical significance (that is, α and β) but also in terms of clinical significance as judged by the boundaries of the confidence limits. For normally distributed data, simple formulae are available and their results serve as a reference for sample size planning when analyzing other types of data. For example, to ensure that LCL and UCL are both bounded by δ /2 the sample size needs to be increased 4fold when comparing normally distributed means. Likewise, when evaluating the hazard ratio for timetoevent data, simulation results also suggest that sample sizes need to be 4 to 5 times larger. The results of our method indicate that sample size needs to be increased but our intention is not to mandate larger sample sizes per se. Such an effort may be futile since in practice cost constraints force clinical trials to aim for the smallest possible sample size What is important is that researchers be informed, for example by a graph similar to Figure 1, as to how their sample size will affect judgments of clinical significance using confidence intervals. In this respect, our proposal directs attention back to the importance of gauging effect sizes using confidence intervals, and is consistent with the predicted confidence intervals Goodman and Berlin [6] advocated to help investigators better understand the idea of statistical power when calculating sample size.
Abbreviations
 CONSORT:

Consolidated Standards of Reporting Trials
 LCL:

lower confidence limit
 UCL:

upper confidence limit
References
Simon R, Wittes RE. Methodologic guidelines for reports of clinical trials. Cancer Treat Rep. 1985;69:1–3.
Bailar JC, Mosteller F. Guidelines for statistical reporting in articles for medical journals. Ann Intern Med. 1988;108:266–73.
Lang T. Documenting research in scientific articles: guidelines for authors. 1. Reporting research designs and activities. Chest. 2006;130:1263–8.
Rothman K. Modern epidemiology. Boston: Little Brown; 1986.
Altman DG, Schulz KF, Moher D, Egger M, Davidoff F, Elbourne D, et al. The revised CONSORT statement for reporting randomized trials: explanation and elaboration. Ann Intern Med. 2001;134:663–94.
Goodman S, Berlin J. The use of predicted confidence intervals when planning experiments and the misuse of power when interpreting results. Ann Intern Med. 1994;121:200–6.
Bacchetti P. Current sample size conventions: flaws, harms, and alternatives. BMC Med. 2010;8, e17.
Beal SL. Sample size determination for confidence intervals on the population mean and on the difference between two population means. Biometrics. 1989;45:969–77.
Liu XS. Implications of statistical power for confidence intervals. Br J Math Stat Psychol. 2012;65:427–37.
Jiroutek MR, Muller KE, Kupper LL, Stewart PW. A new method for choosing sample size for confidence intervalbased inferences. Biometrics. 2003;59:580–90.
Cesana BM, Reina G, Marubini E. Sample size for testing a proportion in clinical trials: a ’twostep’ procedure combining power and confidence interval expected width. Am Stat. 2001;55:288–92.
Cesana BM. Sample size for testing and estimating the difference between two paired and unpaired proportions: a ‘twostep’ procedure combining power and the probability of obtaining a precise estimate. Stat Med. 2004;23:2359–73.
Guyatt G, Jaeschke R, Heddle N, Cook D, Shannon H, Walter S. Basic statistics for clinicians: 2. Interpreting study results: confidence intervals. Can Med Assoc J. 1995;152:169–73.
Schoenfeld DA. Samplesize formula for the proportionalhazards regression model. Biometrics. 1983;39:499–503.
Freedman LS. Tables of the number of patients required in clinical trials using the logrank test. Stat Med. 1982;1:121–9.
Halabi S, Bahadur S. Sample size determination for comparing several survival curves with unequal allocations. Stat Med. 2004;23:1793–815.
Greenland S. On samplesize and power calculations for studies using confidence intervals. Am J Epidemiol. 1988;128:231–7.
Acknowledgements
None. This research was not supported by any external funding resources.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
HSL conceived the study, performed the analyses, and drafted the manuscript. BJ participated in the analyses and drafted the manuscript. Both authors have read and approved the final manuscript.
Appendix
Sample SAS program to estimate the total sample size for testing H_{0}: Δ = 1 versus H_{1}: Δ ≠ 1 such that Pr(LCL > δ /2  H_{1}) = Pr(UCL < δ /2  H_{0}) = 1  β. Survival and censoring times are assumed to be exponentially distributed, and the overall censoring proportion equals 0.5. The initial sample size is estimated using Schoenfeld’s [14] formula for detecting δ = log_{e}(Δ) with 80% power at the 5% significance level.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Jia, B., Lynn, H.S. A sample size planning approach that considers both statistical significance and clinical significance. Trials 16, 213 (2015). https://doi.org/10.1186/s1306301507279
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1306301507279