The implications of outcome truncation in reproductive medicine RCTs: a simulation platform for trialists and simulation study

Background Randomised controlled trials in reproductive medicine are often subject to outcome truncation, where the study outcomes are only defined in a subset of the randomised cohort. Examples include birthweight (measurable only in the subgroup of participants who give birth) and miscarriage (which can only occur in participants who become pregnant). These outcomes are typically analysed by making a comparison between treatment arms within the subgroup (for example, comparing birthweights in the subgroup who gave birth or miscarriages in the subgroup who became pregnant). However, this approach does not represent a randomised comparison when treatment influences the probability of being observed (i.e. survival). The practical implications of this for the design and interpretation of reproductive trials are unclear however. Methods We developed a simulation platform to investigate the implications of outcome truncation for reproductive medicine trials. We used this to perform a simulation study, in which we considered the bias, type 1 error, coverage, and precision of standard statistical analyses for truncated continuous and binary outcomes. Simulation settings were informed by published assisted reproduction trials. Results Increasing treatment effect on the intermediate variable, strength of confounding between the intermediate and outcome variables, and the presence of an interaction between treatment and confounder were found to adversely affect performance. However, within parameter ranges we would consider to be more realistic, the adverse effects were generally not drastic. For binary outcomes, the study highlighted that outcome truncation could cause separation in smaller studies, where none or all of the participants in a study arm experience the outcome event. This was found to have severe consequences for inferences. Conclusion We have provided a simulation platform that can be used by researchers in the design and interpretation of reproductive medicine trials subject to outcome truncation and have used this to conduct a simulation study. The study highlights several key factors which trialists in the field should consider carefully to protect against erroneous inferences. Standard analyses of truncated binary outcomes in small studies may be highly biassed, and it remains to identify suitable approaches for analysing data in this context. Supplementary Information The online version contains supplementary material available at 10.1186/s13063-021-05482-4.


Background
Outcome data are usually unavailable for some participants in a randomised controlled trial (RCT). Most frequently, this is due to the loss to follow-up or patient withdrawal from the study. However, in many reproductive medicine trials, the availability of a participant's outcome data depends on their status in relation to an intermediate response variable. For example, trials of assisted reproductive technologies (ART) are generally conducted in individuals trying to become pregnant and have babies. In these trials, pregnancy outcomes such as miscarriage (occurring only in the subset of women who become pregnant) and infant outcomes such as birthweight (measurable only in participants who have births) are often of interest. These outcomes cannot be collected in all participants, even if there is no loss to follow-up, as they are not observable for everyone in the cohort. This phenomenon has been described as 'truncation (or censoring) due to death' [1], because it often arises in studies where mortality precludes measurement of the outcome variable [2]. However, since this form of missing data also occurs in populations where mortality is not a material concern, we use the more general term 'outcome truncation'.
In reproductive medicine trials, outcome data subject to truncation are frequently analysed by comparing study arms in the subset of participants who were not truncated. This is typically done by calculating standard measures of treatment effect (such as an unadjusted mean difference or odds ratio) and performing standard statistical tests (such as a t-test or chi-squared test). These approaches would be valid if the treatment had no effect on the intermediate (censoring) variable. Otherwise, various authors have pointed out that standard analyses of truncated outcome data are subject to a form of selection bias, whereby selecting on intermediate outcomes breaks randomisation and therefore biases treatment effect estimates [2][3][4][5][6][7]. Figure 1 shows the conditions under which selection bias due to outcome truncation will, in principle, arise. Outcome truncation also reduces the sample size compared to the entire randomised cohort, which is anticipated to impact the precision of the effect estimate [4,8]. Such loss in precision would need to be accounted for during the study design stage to ensure adequately powered studies are pursued.
Some authors have nonetheless argued that a comparison of outcomes in the observable study participants is the correct analysis, since this captures the effect in the only group of relevance-those who are at risk [9]. This argument misses that, to the extent that the observed difference is caused by selection-induced confounding rather than by a causal effect of treatment, the transportability [10] of the estimate will be restricted to populations with the same distribution of confounders. As a result, there is a concern that the standard approaches for analysing truncated outcomes might be misleading. Crucially, important findings in reproductive medicine hinge on analyses of this sort. For example, a recent RCT found that the choice of embryo culture medium used in IVF affected the birthweight of babies born from the treatment [11]. However, if the culture media affect conception or miscarriage rates differently, then the mean observed birthweights correspond to two different populations which may differ with respect to exposures such as smoking. This might be problematic when applying these trial findings to other populations for which this selection does not exist or differs-e.g. if an adjuvant therapy improves conception rates for all subjects, observed differences between media might no longer be applicable. Further examples can be found in systematic reviews published in Cochrane Gynaecology and Fertility, which sometimes report miscarriage rates for trials in which individuals were randomised prior to conception, using the number of women who became pregnant as the denominator [12][13][14].
Outcome truncation due to pregnancy loss has been studied in the context of harmful exposures in pregnancy and long-term outcomes of children, using simulation [3,8] and heuristic argument [15]. The impact on the study of congenital abnormalities has also recently been considered, using analysis of observational data [16]. Although these examples are informative, they are tailored to the investigation of epidemiological questions, and the scenarios they describe are likely to be less relevant for trialists working in ART, since various key parameters (magnitude of intervention effects, event rates, strength of confounding) materially differ in the latter compared to the former context. Moreover, while the relevance of treatment-confounder interactions to outcome truncation has been described [15], their importance has not been empirically evaluated in existing simulation studies. As a result, it is not currently clear whether outcome truncation substantively affects the findings of ART RCTs or their clinical interpretations. A greater understanding of the consequences of outcome truncation would assist in the design of ART RCTs and the reinterpretation of published trials where outcomes were compared in the uncensored subgroup. Additionally, a characterisation of outcome truncation would be useful for researchers developing analytic methods in this area.
To address this need, we developed a simulation platform in R which reproductive medicine trialists can use to inform study design and interpretation in the presence of outcome truncation. We used this platform to investigate the impact of outcome truncation on typical statistical analyses used in assisted reproduction RCTs. We investigated both continuous (e.g. birthweight) and binary (e.g. miscarriage) outcomes subject to truncation, using plausible ranges of parameter values informed by published ART studies.

Simulation study
We developed a simulation platform in R to investigate outcome truncation in two-arm trials where treatment is administered on a single occasion (as opposed to an ongoing regimen), an intervening selection event occurs (e.g. conception or live birth), and the study outcome is measured at a single point in time. This reflects the situation found in many reproductive medicine trials. We then used this to conduct a simulation study. The primary aim of this study was to characterise outcome truncation in relation to bias, coverage, and type 1 error of standard analyses, in realistic scenarios corresponding to ART RCTs. Code to reproduce the study, or conduct novel investigations of outcome truncation, is available at https://osf.io/gzqbr/.
We evaluated both a binary and a continuous outcome and, in the core study, considered two sets of simulations for each. In set 1, we considered a simple, additive data-generating process, without interaction terms (as depicted in Fig. 1). In set 2, we considered the impact of including an interaction between the treatment effect on the intermediate and an unmeasured confounder in the data generating process. Parameter values were informed by several ART RCTs [11,17] and a recent review of power, precision, and sample size in reproductive medicine studies [18]. Table 1 summarises the simulation parameters.

Set 1 Simulation of intermediate response
For both continuous and binary outcomes, let i = 1,…, n index the ith participant with treatment allocation R i = 1 (treatment) or 0 (control) and intermediate response variable S i = 0 or 1. We simulated trials with n participants, with n taking the values of 100, 200, 500, and 1000, divided equally between two study arms. Each patient's probability of having a positive intermediate response was simulated using a logistic model, log(π i /(1 − π i )) = log(0.2) + α R R i + α U u i , and the response was then drawn from a Bernoulli(π i ) distribution. The intercept value was selected to correspond to a control group event rate of 17%, which represents a typical live birth rate in an IVF RCT with an unselected population. The treatment effect on the intermediate response variable took values ranging from no effect (exp(α R ) = 1) to very large (exp(α R ) = 2), with the odds ratio (OR) increasing in increments of 0.05. Based on recent work which looked at the estimated effect sizes in reproductive medicine meta-analyses [18], we would consider odds pregnancy, live birth), U = uncontrolled or unmeasured baseline variables, and Y = study outcome. Arrows depict causal relationships. For parsimony of presentation, we label the paths with the corresponding parameters from the data generating model used in a simulation study. A box drawn around S = 1 indicates that we are conditioning on the intermediate variable: the analysis set includes only participants with S = 1. Bias will occur whenever α R , α U , and β U are all nonzero ratios larger than 1.2 to be reasonably exceptional. However, we included values larger than this in the simulation study in order to characterise the phenomenon further up to a value of exp(α R ) = 5, corresponding to an implausibly high (in a trial context) effect of treatment on conception or survival probability. This extreme setting was included to provide intuition regarding a plausible upper bound to problems caused by outcome truncation. In particular, we reasoned that if the impact was negligible even in such an extreme scenario, then this would provide some reassurance in relation to real studies.
The patient-specific variable u i represents the prognostic baseline characteristics influencing both the intermediate response variable and the study outcome variable Y i and was drawn from a normal (0,1) distribution. We set exp(α U ) equal to 0.8, such that higher values of the prognostic index resulted in reduced probability of the intermediate response and therefore reduced probability of having the outcome observed.
In this context, the magnitude of the effect on the intermediate response variable determines the size and degree of size imbalance of the uncensored cohort, and this was anticipated to affect the performance of standard analyses. We purposefully did not correct this since this is part of the phenomenon under study.

Continuous outcomes
For the study of continuous outcomes, we simulated Y i from a normal (μ i , 580 2 ) distribution, with μ i = 3300 + β R R i + β U u i . The values for the standard deviation (SD) and intercept were based on the point estimates for the SD and the mean for birthweight (in grammes) in a recent trial of embryo culture media [10]. We set β U to − 116, corresponding to − 0.2 standard deviations in the outcome. This represents lower outcome values (e.g. reduced birthweight) for participants with higher values of the confounder u i . The coefficient β R corresponds to the effect of treatment allocation on the outcome, excluding any selection effects arising due to a treatment effect on the intermediate response. We considered values for β R ranging from 0 to 2 SDs in increments of 0.1 SDs, with an additional setting of 5 SDs representing an extreme test case. The outcome measurements for the uncensored cohort were then selected by excluding participants who had S i = 0.

Binary outcomes
For the study of binary outcomes, we simulated p i = Pr(Y i = 1) using a logistic model, log(p i /(1 − p i )) = log(0.1) + β R R i + β u u i , and then drew Y i from a Bernoulli(p i ) distribution. We set exp(β u ) = 1.2, so that a higher value of the confounder u i corresponded to an increased chance of having the outcome. Recall that we set increasing values of the prognostic index to result in a lower probability of the intermediate response occurring-this scenario was chosen to reflect the case where the intermediate response is pregnancy, and the outcome is an adverse pregnancy outcome, such as miscarriage. There may be patient characteristics which make Set 2 (differences from set 1)

Generation of intermediate variable
Interaction between treatment and unmeasured confounding α RU log e (0. 8) pregnancy less likely, while also reducing the chance that the pregnancy will be carried to term (meaning that miscarriage occurs). The intercept corresponds to an event rate of 9%. The treatment effect β R took values ranging from no effect (exp(β R ) = 1) to very large (exp(β R ) = 2), increasing in increments of 0.05 on the OR scale. A value of 5 was included as a test case. Once again, the outcome measurements for the uncensored cohort were then selected by excluding participants who had S i = 0.

Set 2
Set 2 was as for set 1, but with an interaction term α RU between treatment and u i in the data generating model for the intermediate variable. For both the continuous and binary outcome studies, we simulated S i from a Ber- The simulations were computationally cheap, allowing us to simulate and analyse a relatively large number of datasets corresponding to each tested scenario. The number of iterations per scenario was set to 10,000. Simulations were conducted in R [19], and ggplot2 [20], ggpubr [21], and ggthemes [22] were used for visualisation. Random seeds were obtained from random.org.

Sensitivity analyses
We conducted a number of sensitivity analyses, each of which involved making a uniform change to both sets 1 and 2. These explored the impact of (A) increasing the strength of confounding between the intermediate and outcome variables, (B) changing the direction of the treatment effect on the intermediate variable, and (C) increasing event rates for the intermediate and binary outcome. In (A), we increased the strength of the effect of the confounder u i on the intermediate variable to α u = log(0.5). We increased the effect of the confounder on the continuous outcome to β U = −1SD and the effect of the confounder on the binary outcome to be β U = log(1.5). In (B), we considered α R = 1/log e (1, 1.05, 1.1,…2, 5). This was done to check that different influences were not operating in opposing directions, cancelling each other out, and obfuscating performance issues. In (C), we increased the intercepts in the intermediate and binary outcome submodels to be log [1], corresponding to a substantially elevated event rate of 50%.

Estimand
In this context, several estimands could be considered. We compared estimates to β R , representing the effect of treatment on the outcome variable, in the hypothetical case where no censoring would occur (a hypothetical estimand, in the terminology of recent guidance on estimands in clinical trials) [23]. We selected this because this corresponds to a common interpretation given to analyses in this context. For example, in a recent trial investigating embryo culture media, a relative decrease in birthweight associated with one medium was interpreted as demonstrating a physiological effect on the embryo and foetus [11], rather than support for the hypothesis that any increase in live birth rate might be associated with worse perinatal outcomes due to selection effects. We explore this point in more detail in the discussion.

Analysis methods
In the continuous outcome study, we evaluated the difference in the means and associated standard inferential methods (two-sample equal variance t-test and 95% confidence intervals based on the t distribution). In the binary outcome study, we evaluated the sample odds ratio and 95% confidence interval based on the profile likelihood following a logistic regression fit [24]. We also evaluated three statistical tests: a chi-squared test, an adjusted 'N−1' chi-squared test [25], and Fisher's exact test. The adjusted chi-squared test involves multiplying the test statistic by (N−1)/N, with N the overall sample size (in this case, the total sample size in the subgroup with outcome data available for analysis) and has been suggested to perform well in small samples [26].

Performance measures
In the continuous outcome study, we evaluated bias, coverage, type 1 error, model standard error (SE), and empirical SE [27] for the unadjusted difference in the means. In the binary outcome study, we evaluated bias of the log(OR), model SE, empirical SE, and coverage. We also calculated the type 1 error of the chi-squared test, adjusted chi-squared test, and Fisher's exact test. For binary outcomes, we removed separated instances (those where no participants in a treatment arm experienced the outcome event) before calculating the performance measure, including these in the total counts of instances of missing data.

Results
A summary of the simulation results is presented in Table 2.

Continuous outcome study
Missing data due to the inability to compute estimates from simulated datasets did not prove to be a material problem in the continuous outcome study as the greatest amount of missing data in any scenario was 0.05%.

Bias
In the core scenarios, bias was present but very small when there was no interaction between treatment and the intermediate variable (set 1, Fig. 2). In set 1, bias  increased as the treatment effect on the intermediate variable increased, but remained negligible even when this effect became implausibly large; Fig. 2 shows this for an OR as large as 2, but in an extreme test case (OR = 5), the bias still did not exceed − 0.02SDs. In set 2 by contrast, which included an interaction between treatment and the unmeasured confounder in the generation of the intermediate variable, bias can be seen to decrease with increasing treatment effect on the intermediate up to an OR of 2 and was very close to zero for an OR of 5.
Neither sample size (columns in Fig. 2) nor the magnitude of the effect of treatment on the outcome (colours in Fig. 2 Figure 2), such that even in the absence of interactions (set 1), noticeable bias could arise for larger values of the OR. The bias was still reasonably modest for these larger OR values in set 1, however (below 0.1 SDs), and small for more realistic values of the parameter (below 0.05 SDs for OR < 1.2). In the presence of an interaction (set 2), bias was substantial for these realistic OR values however.
Changing the direction of the treatment effect on the intermediate in sensitivity analysis B showed that negative ORs did not result in qualitative changes to the relationship with bias (Additional File 1, S Figure 2). Increasing the incidence of the intermediate variable in sensitivity analysis C reduced bias for the set with an interaction (set 2), since this meant that a reduced proportion of the cohort was subject to outcome truncation (Additional File 1, S Figure 3).   Figure 6), and for set 1 in sensitivity analysis A, where confounding was increased (Additional File 1, S Figure  4). However, in the increased confounding scenario, coverage was reduced for set 2 with increasing sample size, and this was modified by the size of the treatment effect on the intermediate.

Type 1 error
In core scenarios, type 1 error was not adversely affected in set 1 but was noticeably increased with an interaction present (set 2) and a large sample size (Fig. 4). In sensitivity analysis A, with increased confounding, type 1 error was inflated as treatment effect on the intermediate increased in the no interaction set [1] but remained close to the nominal value for lower, more plausible values (Additional File 1, S Figure 7). By contrast, in set 2 (interaction between treatment and u i ), type 1 error was increased when the treatment effect on the intermediate was absent or small. The inflation increased with sample size, almost doubling for n = 1000. Sensitivity analyses B and C showed that neither changing the sign of the effect on the intermediate nor increasing the event rate substantially altered the results compared to the core simulations-type 1 error remained at the nominal level in set 1 and became elevated at larger sample sizes in set 2 (Additional File 1, S Figures 8 and 9).  Figures 13 and 17).

Binary outcome study
For smaller sample sizes (n = 100, 200), there were substantial amounts of missing data arising due to iterations where the treatment effect was inestimable (OR), because the tested scenarios frequently result in small numbers of participants with truncated outcome data, and so there are frequently zero outcome events in at least one arm (Additional File 1, S Figure 18). Clearly, the proportion of missing data depends on the size of the treatment effect (and so is informative), which immediately suggests that the routine analysis of truncated endpoints in smaller (really, typical) trials might be problematic. The only sensitivity analysis for which this differed was C, which increased event rates, such that it was rare for no events to occur (Additional File 1, S Figures 19-21). The amount of missing data caused by an inability to calculate a test statistic was much lower than the amount caused by the inability to estimate a treatment effect (excepting sensitivity analysis C) but remained very high (in the region of 20%) for small sample sizes and modest/realistic treatment effects and was clearly related to the magnitude of the treatment effect on outcome (Additional File 1, S Figures 22-25).

Bias
Bias, expressed as a ROR, fell below 1, substantially so for most of the tested treatment effect sizes, for sample sizes of 100 and 200 in both sets (Fig. 5). As noted above, the smaller sample sizes were subject to informative missing data, offering one explanation for the difference between the smaller and larger trials, for which the bias was much smaller, and in the opposite direction (ratios in the region of 1 to 1.05, or in the region of 1.08 when looking at the implausibly large treatment effect on the outcome, OR = 5). A comparison of the first two columns between Additional File 1, S Figure  18 and Figure 5 actually show that missingness is negatively correlated with the size of the bias (the scenarios with less missing data have a ratio further from unity). However, sensitivity analysis B, where the effect of the treatment on the intermediate is reversed, shows that this is due to the fact that the influence of missing data (causing overestimation of the odds ratio) and of the treatment effect on the intermediate (causing underestimation of the odds ratio) act in opposite directions in the core scenarios (Additional File 1, S Figure 27). Moreover, increasing the event rate, thereby eliminating the missing data issue (sensitivity analysis C) essentially removed the problem (Additional File 1, S Figure  28). For trial sizes of 500 and 1000, increasing the treatment effect on the intermediate variable to an extreme value (OR of 5) did result in substantial bias even in the absence of interactions; ratios around 1.35 and 1.

Coverage and SE
The coverage level was too high at all effect sizes for smaller trials (Fig. 6). This was at least partially attributable to missing data; the model SE was greater than the empirical SE of the computable estimates in the presence of missing data (Additional File 1, S Figures 32 and  Figures 29 and 30), but not C where rates of missingness were minimised due to an increased event rate. For larger trial sizes, estimated coverage levels were generally close to, if not identical to, the nominal level, although discrepancies of several percentage points were apparent for large negative treatment effects on the intermediate (sensitivity analysis B, Additional File 1, S Figure 30).

Type 1 error of statistical tests
The rejection rate of both chi-squared tests were essentially equivalent, and Fisher's test performed consistently poorly. For trial sizes of 100 and 200, both subject to substantial amounts of missing data, type 1 error fell below the nominal level for all three methods (Fig. 7). The chi-squared tests achieved the nominal level for larger sample sizes in both sets, however, regardless of the size of treatment effect on the intermediate variable.
Increasing confounding (sensitivity analysis A) and changing sign of the treatment effect on the intermediate (sensitivity analysis B) did not change things-appropriate type 1 error was observed for the chi-squared tests for larger trial sizes, but not for smaller trial sizes (Additional File 1, S Figures 40 and 41). Increasing the event rate (sensitivity analysis C) resulted in appropriate type 1 error rates for chi-squared tests at all trial sizes and an improvement in the performance of Fisher's test (Additional File 1, S Figure 42) suggesting that issues in the performance of the methods are primarily linked to informative missingness of calculated test statistics.

Discussion
In the present study, we have created a simulation platform for studying the effects of outcome truncation in reproductive medicine trials. Using this platform, we present here a simulation study to characterise the phenomenon in scenarios resembling ART RCTs, although we believe the platform may also be of use for reproductive medicine trials beyond infertility, with similar structure (for example, the study of the effect of iron supplementation prior to pregnancy on perinatal outcomes [28], where treatment precedes the intermediate variable and outcomes are truncated).
We aimed to quantify the magnitude of the problems introduced by outcome truncation in practice, by using both comparative effectiveness and meta-epidemiological research in reproductive medicine to inform the simulation parameters. Our findings in this regard are therefore contingent upon the representativeness of the parameter values [29], and so we begin our discussion by briefly reviewing the motivations for the tested values and our confidence in these selections.
We opted to consider treatment effects on both the intermediate variable and the outcome variable that ranged from no effect up to implausibly large (an OR of 5 or difference in means of 5 SDs), in order to cover all bases with respect to these parameters. We note that we probably expect typical effects on the intermediate variable (for example, clinical pregnancy or live birth) to be less than~1.2 when expressed as an OR. Typical treatment effects on live birth in Cochrane meta-analyses of infertility therapies were in the region of a few percentage points in a recent review [18], with larger estimates tending to arise from metaanalyses containing fewer study participants (generally representing less precise estimates). Moreover, these effect estimates are expected to be inflated to an unknown degree by publication bias. These considerations suggest that modest effects on the intermediate variables are to be expected. Typical effects on outcome variables subject to truncation are harder to establish, in part precisely because they are obscured by the censoring phenomenon under scrutiny in the present study. We have considered an expansive range of values for this parameter, and have found that its magnitude appears to relate to the performance of standard methods for the analysis of truncated binary outcomes. Trialists using the platform to assist in the design of new studies are advised to consider a range of plausible values for this parameter, where plausibility is established in conjunction with clinical experts.
The magnitude of unmeasured confounding between the intermediate and outcome variables may turn out to be an important parameter. Increasing the strength of confounding in a sensitivity analysis modified the impact of increasing the treatment-on-intermediate effect, although neither the modification nor the consequences for performance were dramatic, for either continuous or binary outcomes. By contrast, an earlier study of pregnancy exposures and truncated continuous long-term health outcomes in children indicated substantial problems with naive approaches to analysis [8], and our own attempts to replicate that study suggest that this was because the tested scenarios implied stronger confounding between the intermediate and outcome variable than we have considered here. If we have understated the plausible strength of confounding to a significant degree, we may have undersold the implications of outcome truncation. Speculating about the possible extent of unmeasured confounding between two variables is challenging, and in this case, the correlation between the two cannot be directly observed (since the outcome is defined conditionally on the intermediate taking a particular value). Considering the total unexplained variation in each variable might be one way to start building intuition here, as might identification of known shared prognostic factors in the literature. This too is complicated by potential causal dependency between the intermediate and outcome variables, however, as well as by the potential for distortion by the censoring phenomenon. In RCTs, it is likely that confounding will often be reduced through trial exclusion criteria. For example, restrictions are often placed on smoking and maternal BMI, which are associated with birthweight [30,31]. These restrictions will not eliminate confounding altogether however. For example, in the case of smoking, recent quitters and never smokers might still differ with respect to chances of conception and live birth in addition to birthweight. The potential for confounding might be strong in trials in populations with heterogeneous causes of infertility, where there might be underlying, unmeasured genetic, endocrine, immunological, or metabolic disease in some participants. Where knowledge of the other structural parameters is relatively strong, one use of the simulation platform presented here might be to investigate the strength of unmeasured confounding that would be required to introduce substantial problems or to investigate the robustness of the study design in the event that confounding is strong.
A second function of the simulation study presented here was to elucidate the factors which affect the performance of the standard analyses used in this context, and to the extent that we have allowed these factors to independently vary, these findings might not be so contingent on the particular parameter values used.
The present study is subject to other potential limitations. We have considered the situation where treatment is delivered on a single occasion, and the outcome is established at a single time point. These conditions are commonplace in ART studies, but studies evaluating cumulative outcomes over extended courses of treatment exist and are becoming more popular, since it is now recognised that outcomes after repeated attempts to conceive are particularly relevant to subfertile patients [32,33]. The present study is not directly relevant to these scenarios.
Another point to consider is that we have evaluated methods against a particular estimand, corresponding to the effect on outcome in the (hypothetical) absence of censoring. We selected this on the grounds that it aligned with a common interpretation given to ART trials subject to outcome truncation, e.g. [11], and that it frequently has clinical relevance in this context. Taking the effect of embryo culture medium on birthweight as an example, it would be useful for a clinician to know if a particular advantageous medium (in terms of live birth rate) resulted in reduced birthweights by adversely affecting foetal development, or else if reduced birthweights were an inevitable consequence of improving live birth rates in the population. This knowledge might influence the decision-making process undertaken by patients and clinicians. For instance, the potential harms to offspring might be considered unacceptable, or the availability of a co-intervention known to improve live birth rate might make an alternative less effective (live birth) but safer (birthweight) medium a more attractive choice. We would stress however that analysis of trial outcome data alone is unlikely to provide sufficient insight into this sort of mechanistic hypothesis and must be considered alongside biological evidence.
Other estimands have been described in the context of competing risks however [34,35], and each of these might be more or less attractive depending on the particulars of the research question under evaluation and the assumptions the study team are willing to make. Proposals include (what has been described as) a total effect of treatment, in which a composite outcome is defined for everyone regardless of their status with respect to the intermediate variable; anyone not experiencing the intermediate event is classified as not having the outcome [34,36] see also the treatment policy strategy in [23,37]. For example, any participants who did not become pregnant would be considered not to have had a miscarriage. Under this definition, a treatment could reduce the miscarriage rate by reducing the pregnancy rate, which does not conform to any intuitive notion of therapeutic benefit. Furthermore, this definition cannot be extended to continuous outcomes. Another potential hypothetical estimand is the survivor average causal effect, the effect in patients who would have had the intermediate event under either treatment allocation [6], which raises questions about relevance to real patients. Another proposal is to consider direct and indirect separable effects, which require the analyst to postulate distinct causal pathways including and excluding the intermediate [35]. This requires the intermediate variable to be construed as a mediator. We largely agree with the commentary of Snowden and colleagues [38] however, which clarifies the role of the intermediate variable in this context; the intermediate does not mediate the effect of treatment, so much as determine whether the outcome variable is defined. In light of the conceptual difficulty of interpreting the intermediate as a mediator, we have not considered a causal path from the intermediate to the outcome in the present study, but have included the option to do so in the simulation platform we provide. We have also not considered the potential role of an interaction between treatment effect on the outcome (rather than on the intermediate) and confounding factors here. We include the option to do so in the simulation code, but urge the user to consider whether an alternative estimand might be more appropriate in the presence of such an interaction.
With these considerations in place, we turn to the findings of the simulation study. In the continuous outcome study, the impact of outcome truncation on simple analyses based on the observed difference in means was less severe than had perhaps been anticipated, with reasonable bias, coverage, and type 1 error rates for more realistic treatment effects, except in the scenario with increased confounding, when performance was notably affected in the presence of an interaction between treatment and the unmeasured confounder. These results might therefore be seen as relatively reassuring in relation to continuous outcome measures, depending on the plausible extent of unmeasured confounding and scope for interaction effects for the particular research question at hand. In particular, these results appear supportive of the finding in [11] that choice of embryo culture medium can influence the birthweight of offspring born from ART.
The situation with binary outcomes appears somewhat more nuanced. For larger trial sizes of 500 or 1000, the bias of the odds ratio was present but was relatively modest, and coverage was close to the nominal level, again provided that no interaction between treatment and the intermediate was present. These findings held when we increased the level of confounding and when we changed the direction of the treatment effect on the intermediate variable, to rule out the possibility of effects in opposing directions concealing problems. For these larger trial sizes then, our results appear to be qualitatively concordant with the conclusions of previous authors, e.g. [3,15], at least in the sense that bias was caused by outcome truncation, and was affected by increasing treatment effect on the intermediate variable, strength of unmeasured confounding, and presence of interactions. Quantitatively, however, we find that, within the parameters considered here, outcome truncation might not be so great a cause for concern (at least for large trials) as has previously been suggested. Indeed, close inspection of previous simulation results suggests that substantive performance issues have been observed only under parameter settings that would be quite extreme in the context of ART trials, e.g. large effects of exposure on intermediate [3]. The type 1 error rates for two variants of a chi-squared test were also close to the nominal level for larger trial sizes. Fisher's test performed poorly in this context, which may be attributable to the violation of the assumption of fixed margins, and this was presumably caused by varying numbers of participants entering the analysis set across simulated datasets within any given scenario.
For smaller trial sizes (n = 100, 200), outcome truncation creates serious challenges for the study of binary outcomes, and this appears to be attributable to separation (studies in which all or none of the analysable participants in a study arm have the outcome event). The current study highlights the fact that the likelihood of obtaining an effect estimate is related to the effect of the treatment on both the intermediate and outcome variables. As such, the subset of studies in which an effect estimate is calculable will not produce an unbiassed sample for the purpose of estimation. Notably, small studies in ART are commonplace [18]. Although small studies are unlikely to use a truncated response variable as a primary outcome, they may still be reported as secondary outcomes. There may be implications for systematic reviews here, since truncated binary secondary (as opposed to primary) outcomes, analysed in the postrandomisation subgroup, often appear in meta-analysis [12][13][14]. By design, meta-analyses incorporate all studies, including the smaller ones. Pooled estimates are therefore likely to be based on an informative selection process, leading to bias. This situation is subject to additional complexities compared to the usual case of metaanalysis of sparse events and appears to warrant further investigation. In the interim, it is recommended to follow the advice set out in the Cochrane Handbook, which is to avoid meta-analysis of truncated outcomes wherever possible [5].
Comparisons of outcomes in the subgroup, e.g. [39], cited in [38] have been endorsed on the grounds that this represents the population at risk. A problem with this proposal in principle is that the observed difference in a trial, not representing an effect of treatment per se, will not apply wherever confounding or the effect on the intermediate (for example, due to differences in other aspects of the ART treatment protocol) differ. It may however be a useful quantity to consider from a public health perspective, provided it is based on a representative sample, and the present study suggests that simple analyses will yield reasonable answers in many cases. Nonetheless, it remains reasonable to seek alternative analytic methods that will be robust to outcome truncation under a broader range of data-generating models. We note that unadjusted analyses, as are commonly performed in the field and as considered here, are unlikely to be optimal regardless of outcome truncation, since adjusting for prognostic covariates in a trial will improve precision [40][41][42]. To the extent that the adjustment variables coincide with the confounders of the intermediate-outcome relationship, it is possible that performance might be improved compared to the unadjusted approach in the truncated outcome scenario, although with small samples and binary outcomes, it is possible that covariate adjustment might exacerbate issues relating to sparse data and separation [43]. It remains to examine this empirically, as well as the most suitable approach for adjustment (e.g. regression versus inverse probability weighting approaches) and the implications for interpretation. In the presence of separation, Firth's logistic regression correction [44] has been recommended [45][46][47], and the performance of this approach for truncated binary outcome data warrants investigation. Methods to estimate the survivor average causal effect have been described [6,7], as have sensitivity analyses designed for this context [48][49][50]. Another proposal would be to consider a joint model of the intermediate and outcome variables, although it is not clear that this would be estimable in a point treatment setting. Methods for meta-analysis of truncated outcomes with small studies appears to be another avenue for future research.

Conclusions
In general, proposed approaches to analysis in the presence of outcome truncation require substantial assumptions and relevant data (notably on sufficient confounding sets) to restore unbiased effects and statistical inferences with the correct operating characteristics. Our simulation platform provides a rapid assessment of the implications of outcome truncation given user-input parameters and can be used to assist in the design and interpretation of reproductive medicine trials, particularly in the case of small trials for binary primary outcomes or where there are expected to be strong confounders related to selection (conception or live birth) or interactions with treatment thereof. In relation to design, the simulation platform provides a way to estimate the power implied by the study parameters, which can be used to inform sample size for future trials. It can also inform trialists as to whether outcome truncation is likely to pose a material threat to study validity. For the interpretation of published studies, it is of course not possible to determine whether the results of any individual study are attributable to error or bias. Examining the operating characteristics of studies subject to outcome truncation may nonetheless allow researchers to understand the risks of these errors. We stress that we are not yet able to recommend an optimal analytic strategy for handling outcome truncation however.
Finally, since the code is freely available for modification, we hope that it may serve as a platform for future methodological research in the area.