Non-inferiority test for a continuous variable with a flexible margin in an active controlled trial: an application to the “Stratall ANRS 12110 / ESTHER” trial

Background Non-inferiority trials are becoming increasingly popular in public health and clinical research. The choice of the non-inferiority margin is the cornerstone of such trials. Most of the time, the non-inferiority margin is fixed and constant, determined from historical trials as a fraction of the effect of the reference intervention. But in some circumstances, there may some uncertainty around the reference treatment that one would like to account for when performing the hypothesis testing. In this case, the non-inferiority margin is not fixed in advance and depends on the reference intervention estimate. Hence, the uncertainty surrounding the non-inferiority margin should be accounted for in statistical tests. In this work, we explore how to perform the non-inferiority test for a continuous variable with a flexible margin. Methods We have proposed in this study, two procedures for the non-inferiority test with a flexible margin for continuous endpoints. The proposed test procedures are based on a test statistic and confidence interval approaches respectively. Simulations have been used to assess the performances and properties of the proposed test procedures. An application was done on a real-world clinical data, to assess the efficacy of clinical monitoring alone versus laboratory and clinical monitoring in HIV-infected adult patients. Results Basically, for both proposed methods, the type I error estimate was not dependent on the values of the reference treatment. In the test statistic approach, the type 1 error rate estimate was approximatively equal to the nominal value. It has been found that the confidence interval level determined approximatively the level of significance. For a given nominal type I error α, the appropriate one- and two-sided confidence intervals should be with levels 1−α and 1−2α, respectively. Conclusions Based on the type I error rate and power estimates, the proposed non-inferiority hypothesis test procedures had good performances and were applicable in practice. Trial registration ClinicalTrials.gov NCT00301561. Registered on March 13, 2006, url: https://clinicaltrials.gov/ct2/show/NCT00301561.


Background
After developing a new health intervention (treatment or diagnostic test), the next step is to assess its effectiveness, relative to the existing reference intervention. There are several strategies to do this, such as the superiority trials which involve testing whether the new treatment is superior to another (placebo, reference, or active control treatment). However, when the active control intervention achieves maximum efficacy or the use of a placebo is unethical, it becomes difficult to statistically show the superiority of the new health intervention. Studies aimed at showing that a new intervention is not worse than the active control intervention by more than a pre-specified amount of efficacy have become increasingly common in the recent decade [1]. The expression is not worse than the active control intervention by more than a pre-specified amount, means it is acceptable to lose a "little bit" of the main effect of the active control intervention compared to a new intervention's benefits (fewer side effects, costs, tolerable, and safer). This acceptable loss of efficacy is illustrated numerically as the non-inferiority margin. A trial showing that the new intervention is non-inferior to the active control intervention is called a non-inferiority trial [1].
The Food and Drug Administration (FDA) [2] provided general principles for an appropriate choice of the noninferiority margin. The non-inferiority margin is at the upper limit of the confidence interval, so the trial is designed to show evidence of no more than this "loss of maximum efficacy. " Generally, this margin is fixed, determined from historical trials as a fraction of the treatment effect. However, in some cases, the mean estimate of reference treatment could be subjected to variations to the levels that adopting a fixed margin would not be relevant. Indeed, the fixed margin cannot take into account the variability which surrounds the reference treatment estimate, in this case, the margin should be a function of the reference treatment. For binary endpoints, tests that account for non-fixed margins have been studied [3][4][5]. One finds that most works on the non-inferiority test for continuous endpoints with fixed and linear margin have been focused on the confidence intervals approach [6][7][8], mainly consisting of comparing the bounds of the treatments difference to the fixed margin. However, few studies have been performed for a non-fixed or variable margin for continuous endpoints. This work is aimed at deriving non-inferiority tests for continuous endpoints with flexible margin in active randomized controlled trials. An application of the proposed methods is done on the Stratall ANRS 12110/ESTHER trial.

Notations
The following are the definition of the basic notations used.
• X R and X N are the the random variables for continuous primary endpoint in the active control group (reference) and new intervention group (new group), respectively. • n R and n N are the the sample sizes for the active control group and new group, respectively. • μ R and μ N are the the means of continuous primary endpoint for the active group and new group, respectively. • σ 2 R and σ 2 N are the the variances of continuous primary endpoint for the active group and new group respectively.
is the non-inferiority margin, and = μ N − μ R is the difference of true means. • H 0 and H 1 are the null and alternative hypotheses, respectively.

Approach using a test statistic
Without loss of generality, assuming that an increase in the endpoint corresponds to more efficacy. The noninferiority hypotheses can be formulated as follows: The formulation of the hypotheses test in Eq. (1) shows that the non-inferiority means that the new intervention is not worse than the active control intervention with a L margin. When L is fixed, testing the hypotheses (1) can be viewed as a classical composite hypotheses test for mean difference [9]; therefore, based on the central limit theorem applied to the boundary of the null hypothesis, the asymptotic test Z fixed can be obtained by: In effect, when L is fixed, we have: The null hypothesis is rejected if Z fixed > Z 1−α , where Z 1−α is the (1 − α) percentile of the standard normal distribution. From the Karlin-Rubin theorem, this test is the uniformly most powerful test of level α [10].
If L is not fixed, i.e, if L is a function of μ R , then Var{X N −X R + L (X R )} = Var(X N ) + Var(X R ), and therefore, Var(X N ) + Var(X R ) is not a valid variance of X N −X R + L (X R ). Under the assumption that L is a continuously differentiable function, variance estimation was performed using delta method discussed below.

Variance estimation using delta method
If L (.) is a continuously differentiable such that L (μ R ) = 0 ( L is the first derivative of L ), then using the Taylor series of order 1 in a neighborhood of μ R , Hence, Thus, the variance estimate is: The test statistic can then be expressed as:

Asymptotic properties of the test statistic Z flexible
From the central limit theorem, when n N and n R approach infinity, the random variable Z flexible ∼ N(0, 1) on the boundary of null hypothesis, that is, asymptotically, μ R is unknown and σ 2 R and σ 2 N may be unknowns, which need to be estimated. We used the maximum likelihood estimation method on the boundary of the null hypothesis (μ N = μ R − L (μ R )). The unknown parameters are estimated considering the cases where the variances σ 2 R and σ 2 N are known, unknown, equal, or unequal. The maximum likelihood (ML) estimatorsμ R ,σ R 2 and σ N 2 for μ R , σ 2 R and σ 2 N , respectively, are consistent. Moreover, since L is assumed continuous, L (μ R ) is a consistent estimator for L (μ R ). The estimatorẐ flexible of the test statistic Z flexible can be obtained by replacing the unknown parameters in (6) by their ML estimators. Therefore, the test H 0 versus H 1 (where H 0 is the bound- where α is the nominal type I error and z 1−α denotes the 1 − α percentile of the standard normal distribution. The significance level of this test tends to α when n N and n R approach infinity.

Assuming that, under alternative hypotheses
Hence, if η is the power of the test, it follows that: where, under alternative hypothesis, and denoting by δ = v/σ , the power, given as a function of δ, n N , n R , and α is: where is the cumulative distribution function of the standard normal distribution. For a fixed nominal type I error α, and for any fixed μ R and μ N such that v = μ N − μ R + L (μ R ) > 0, when n R → ∞ and n N → ∞, it follows that η → 1. Therefore, the test Z flexible is asymptotically convergent. From Eq. 8, it is possible to find the sample size that achieves the nominal fixed power. Denoting the nominal type II error by β and assuming that n N = rn R with r > 0, the sample size which will allow nominal power (1 − β) is such that: This formula is equivalent to the one found in [9] when the margin is fixed. Practically, δ is equivalent to the standardized difference in the comparison of the means, and in this work, it would be named standardized noninferiority difference. In the power and sample sizes calculations, one will fix δ (for example, δ = 0.05 or δ = 0.5 if one wants to detect small or large inferiority differences respectively), and μ R could be pre-specified from historical studies with similar treatment.
The proposed test statisticẐ flexible is asymptotic, hence works well for large sample sizes, hence not adapted for datasets with small sample sizes, which are not uncommon in pratical situations. In such cases, the nonparametric test based on the percentile bootstrap confidence interval which does not require any assumptions on the sample size or sample distribution can be used [11].

Approach based on confidence intervals
For any test based on confidence intervals, the main interest is on the level of confidence intervals which is required to achieve a desired nominal type I error. Moreover, as discussed in [9] and [12], the type I error is a controversial issue in clinical trial tests. In the framework of noninferiority tests, when the non-inferiority margin is fixed, [13] recommended using 1−α and 1− α 2 for two-sided and one-sided confidence interval levels respectively, while [7] recommended to use 1 − 2α for two-sided and 1 − α for one-sided confidence intervals. In [7], it is argued that the recommendation of [13] would lead to a conservative test, as the estimate type I error rate would be half the nominal one. Moreover, it has been argued that there would be approximately a 10% loss of power. In this section, we propose a non-parametric procedure for the confidence interval (one-sided and two sided) construction when the non-inferiority margin is flexible.
An intuitive procedure based on confidence intervals for the hypotheses test in Eq. (1) would be by checking the overlapping of the confidence intervals of μ N − μ R and − L (μ R ). The null hypothesis would be rejected if the two confidence intervals are non-overlapped and not rejected otherwise. In such case, as illustrated in [14], the intervals may be overlapped while the statistics would not be necessarily non-significantly different; thus, the power of the test would be lower. The proposed procedure involves comparing the lower bound of the confidence interval (one-or two-sided, respectively) with γ % level of μ N − μ R + L (μ R ) with 0. The null hypothesis H 0 is rejected if the lower bound of the confidence interval for Estimation of the type I error is performed using simulations and non-parametric estimation of confidence intervals on the boundary of the null hypothesis. The detailed steps are described below.
4. For i ∈ {1 · · · m} H 0 is rejected when a i > 0, thus the level of significance is estimated by: Like any other power estimation, the data are drawn under the alternative hypothesis that is, μ N > μ R − L (μ R ). Since there is a wide range of possibilities on the alternative hypothesis, in practice, one considers the equivalence point, that is, μ R = μ N . Therefore, similarly to studies of [5] and [15], the equivalence point (μ R = μ N ) will be used for drawing data for the power estimation.
1. Given μ R , simulate m pairs of samples X N and X R of respective sizes n N and n R using the respective normal

Performances assessment
Simulations were done to evaluate the finite-sample performances of the asymptotic test and confidence interval based test. The performance indicators used were the type I error and statistical power. Monte-Carlo simulation techniques were used for the estimation of the considered indicators. In the simulations, we considered the margin L (μ R ) = μ 1/4 R ; and unknown variances σ 2 R and σ 2 N . Both indicators were computed for the two proposed tests according to the reference treatment. For the type I error, data were drawn on the boundary of the null hypothesis: for a given μ R , μ N is obtained such that μ N = μ R − L (μ R ). For the power, data were drawn under the alternative hypothesis: for a given μ R , μ N is obtained such that μ N > μ R − L (μ R ). Usually, one takes μ N = μ R . In all cases, it is assumed that μ R vary in [ 1,1000]. In the test based on statistic, the power was estimate using formula (8), and two cases were considered for δ = 0.05 and δ = 0.5.
In the approach based on the asymptotic test, the nominal type I error was fixed and set at α = 5%. For the confidence interval based test, we considered 95% oneand two-sided confidence interval levels. The purpose was to estimate the type I error rate for the respective confidence interval. In all the simulations, we considered balanced sample sizes (that is when n = n N = n R ), n = 30, 100, and 1000 for small, medium, and large sample sizes, respectively. The number of bootstrap samples with replacement was B = 1000, and the number of simulation replications was m = 10000. The R software programming language [16] was used to conduct the simulations and codes are accessible in a separate file on request.

Application to the Stratall ANRS 12110 / ESTHER
This study was motivated by the randomized noninferiority "Stratall ANRS 12110 / ESTHER" trial [17]. The main purpose was to assess an exclusively clinical monitoring strategy compared with a clinical monitoring strategy plus laboratory monitoring in terms of effectiveness and safety in HIV-infected patients in Cameroon. The idea was to achieve the scaling-up of HIV care in rural districts where most people live with HIV, but local health facilities generally have low-grade equipment. A total of 459 HIV-infected patients were included in the study and randomly allocated to two groups, one receiving exclusively clinical monitoring (intervention group, N = 238) and the other receiving laboratory and clinical monitoring (active control group (reference), N = 221). All patients included were initiated antiretroviral treatment and were followed up for 24 months. Clinical monitoring alone was compared to laboratory and clinical monitoring in a noninferiority design. The continuous primary endpoint was the increase in CD4 cells count from treatment initiation to the twenty-fourth month. Based on previous studies, the non-inferiority margin ( L (R)) was prespecified as a linear function (25%) of the mean CD4 cells increase (μ R ) after 24 months of antiretroviral treatment in laboratory and clinical monitoring group , L (R) = 25 100 μ R . Unlike other non-inferiority studies [18,19], the noninferiority margin in this study was varied (depending on the mean increase in CD4 in the active control group (reference)). However, the classical two-sided confidence interval based test with 90% level were used to obtain a type I error (α) close to 5% [17]. Indeed, the statistical test procedures that explore the non-inferiority test for con- Fig. 2 Power estimates according to sample sizes for test statistic based test (with standardized non-inferiority difference delta = 0.05). Power estimates as function of reference treatment (with standardized non-inferiority difference delta = 0.05), for test statistic based test. From the left to the rigth, sample sizes are n N = n R = 20, 100, and 1000, respectively  Type I error rate estimates according to sample sizes for the 95% one-sided confidence intervals level based test. Type I error rate estimate as function of reference treatment, for the 95% one-sided confidence intervals level based test. From the left to the rigth, sample sizes are n N = n R = 20, 100, and 1000, respectively tinuous data with variable margins were not available at that time in the original paper [17]. Moreover, as discussed in [12], the relationship between the confidence intervals level and the type I error can be controversial.
More details about the background of the study and the clinical trial process can be found in [17]. Two analyses were done according to the type of data: 1 Firstly, the increase of CD4 cells count at 24 months from the baseline was considered, which implies missing or lost patients before the end of follow-up period were excluded in the analysis. In that case, the total number of patient in the analysis reduced to n = 334, with n R = 169 and n N = 165. "Observed data" will refer to the case where data are analyzed by excluding participants with missing observation at 24 months. 2 Secondly, an analysis was done with all participants who attended at least one follow-up visit, and the last observation carried forward (LOCF) imputation method was applied for participants whose CD4 data were missing at 24 months (in this case, the number of patients to analyzed is the same as the baseline: n = 459, n R = 238 , n N = 221).
The classical parametric two-sided confidence interval based test with 90% level was used by [17] to perform the non-inferiority test. The final result was that the CLIN was inferior to the LAB.

Test statistic based test
The results for the approach based on a statistic are summarized in Figs. 1, 2, and 3 for type I error rate and power estimates, respectively. Whatever the sample size, it is observed that the type I error rate estimates were constant and were not μ R dependent. For small sample size, the type I error rate estimate was slightly above the nominal value, while the median value estimate was 0.053, and an Interquartile Range(IQR) of [ 0.051 − 0.054]. As the sample size increases, the type I error estimates get close to the nominal value. In effect, for medium sample size of n = 100, the type I error estimate is close to the nominal value, the median value estimate for μ R was 0.051 (IQR =[ 0.050 − 0.052]). For large sample sizes, for example, n = 1000, the type I error estimate was more accurate and closer to the nominal value, the median estimate was 0.050 (IQR =[ 0.050 − 0.050]). The power estimates were summarized in Figs. 2 and 3, and they were not μ R -dependent. As expected, the power increased with sample sizes for fixed standardized noninferiority difference δ, and larger values of δ led to a higher power estimate for fixed sample size.

Confidence interval based test
The results for the approach based on confidence intervals are summarized in Figs. 4, 5, 6, and 7. For 95% both oneand two-sided confidence interval levels, the estimate type I error rates remained around 0.05 and 0.025, respectively, and are more concentrated around those values as the sample sizes get larger. Then, for a given nominal type I error of α, the suitable confidence intervals level would be 1 − α and 1 − 2α for one-and two-sided confidence intervals, respectively. The power (at the equivalence point, μ R = μ N ) increases with the sample sizes, but the convergence to 1 seemed to require very large sample sizes. This is not the case for the test statistic based method. Therefore, in terms of power estimate, the approach based on the test statistic would perform better than the confidence intervals based approach.

The Stratall ANRS 12110 / ESTHER trial
The proposed methods were also applied to the Stratall ANRS 12110 / ESTHER tria, based on Observer and LOCF data, with a linear margin of L (R) = 25 100 R. The results for the approach based on the test statistic are summarized in Table 1. The p-value is calculated based on the test statistic in Eq. (6). The statistical power was computed using Eq. (8) and based on the same inputs as in [17], which were μ N = μ R = 140 and σ N = σ R = 130. For the Observed data, the p-value estimate was = 0.02, and the null hypothesis that CLIN was inferior to the LAB was rejected at 0.05 level. On the other hand, for the LOCF data, the p-value was = 0.09, and the null hypothesis that Fig. 6 Type I error rate estimates according to sample sizes for the 95% two-sided confidence intervals level based test. Type I error rate estimate as function of reference treatment, for the 95% two-sided confidence intervals level based test. From the left to the rigth, sample sizes are n N = n R = 20, 100, and 1000, respectively Power estimates according to sample sizes for the 95% two-sided confidence intervals level based test. Power estimates as function of reference treatment, for the 95% two-sided confidence intervals level based test. From the left to the rigth, sample sizes are n N = n R = 20, 100, and 1000, respectively CLIN was inferior to the LAB was not rejected at 0.05 level.
For the confidence interval-based approach, the test was performed by considering the one-and two-sided confidence interval levels. The results are presented in Table 2. The null hypothesis that CLIN was inferior to LAB was not rejected for any of the confidence intervals used with "LOCF data. " On the other hand, when using "Observed data, " the null hypothesis of inferiority was not demonstrated.
The two proposed methods produced consistent results on the Stratall ANRS 12110 / ESTHER trial. Moreover, based on LOCF data, the obtained results are in line with those in [17]: the clinical monitoring alone was inferior to laboratory plus clinical monitoring.

Discussions
In this study, we have proposed two non-inferiority test approaches for a continuous endpoints with flexible margins: a test based on a test statistic and a confidence interval based test. The confidence interval approach is more used in literature and recommended by the international guideline [2]. For the non-inferiority test with continuous endpoints and fixed margin, some studies like [7] and [12] studied the confidence interval approach which does not allowed for explicit sample size calculation. Comparatively, our proposed test based on a statistic allows explicit calculation of sample size and power formula.
The simulation results for the confidence intervals based test showed that the confidence interval level determined approximatively the type I error rate. The test with 95% one-and two-sided confidence intervals level led to type I errors which were approximated by 0.05 and 0.025, respectively. Therefore, for a given nominal type I error α = 0.05, the confidence intervals based test would be performed with one-or two-sided confidence intervals  Decision Non-inferiority Non-inferiority with 1 − α or 1 − 2α levels, respectively; these findings are consistent with those in [7]. The non-inferiority hypothesis test is a one-tailed test, so when performing the testing procedure with the classical nominal type I error α, the actual type I error would be α/2. Therefore, for a given desired nominal type I error, to avoid the conservativeness of the test, the test should be performed with this nominal error times two. However, the debate on which of the one-or two-sided confidence intervals should be used in non-inferiority trials remains open, which is discussed in [20]. The most important output of this study was the type I error which was not varying according to the value of reference treatment, either for the test based on a statistic or the test based on confidence intervals. This suggested that the variability and uncertainty around the margin were accounted for, without affecting the properties of the proposed tests. The proposed methods in this study could therefore be viewed as a generalization of the case where the non-inferiority margin is fixed for continuous endpoints.

Conclusions
In an active controlled trial of non-inferiority, the noninferiority margin should be a function of reference treatment to account for the uncertainty surrounding the mean estimate of reference treatment. This paper produced a framework on how to perform the non-inferiority hypothesis test with a flexible margin. Based on type I one error rate and power estimates, the proposed non-inferiority hypothesis test procedures have good performances and are applicable in practice, a practical application on clinical data was illustrative.