# Table 3 Summary of pros and cons of potential statistical tests that could be used when there is a time varying mortality difference (non-proportional hazards)

Method Pros Cons
Weighted log-rank test Not model-based Need to formally pre-specify the expected mortality differences over time (functional form of the HR) for the test to have statistical validity. This may prove difficult given that differences will depend on the natural history of the cancer, screening strategy, number of screens, years of follow-up, etc.
Known to improve power in situations of non-PH. There is an associated risk of mis-specifying the form of the HR, and simulations suggest incorrectly assuming a late effect, for example, may incur a greater penalty than assuming PHs under early or late effects [33, 47].
Most widely used and established test for non-PHs in clinical trials Subjects’ deaths are given a differential (and arbitrary) weighting which may be hard to justify. A further conceptual problem with weights based on the data is that if a trial subsequently reports again, the weight allocated to each event will change, likely significantly.
Flexible parametric model such as the Royston-Parmar (RP) model (cubic splines) or fractional polynomial (FP) survival model (joint test of all screen arm related terms) No need to pre-specify specific functional form of the mortality effect No precedence for use as primary analysis in RCTs
Can mimic a non-PH function to almost arbitrary degree. Flexibility makes it easy to over fit and include random data artefacts.
Power properties not well known. Will lose power with too many model parameters.
Allows one to accurately describe the hazards and their ratio over time. Need to pre-specify number of knots/degrees of freedom and placement of knots for RP model. FP model requires choice of selection of powers and degree. Can be guided by information criteria but then data dependent, and may reflect artefacts.
Relatively easy to fit Test, as proposed, considers if mortality curves are “different”. Significant result could theoretically result from crossing curves, even curves with no difference in area under the curve.
Weibull model (with separate shape parameters for group) Can reflect simple time-varying differences in mortality curves succinctly Unlikely to capture more complex curves sufficiently. All hazard functions must be monotonic (constant decrease or increase)
Easy to fit
Cox model with time varying coefficient (TVC) Extension of Cox model, so perhaps more readily acceptable given prior use Need to pre-specify function of time that the non-PHs apply to—usually a simple linear or log function of time
Able to incorporate non-PHs without specifying differences in mortality curves (functional form). For example, choose linear function of time, then time-varying effect could be linear decreasing or increasing. Interpretation not straightforward
Awkward and (very) time-consuming to fit (splits data at each failure)
No definite agreement on test of significance. Could be similar to the joint test on 2 degrees of freedom.
No need to consider baseline hazard function
Difference in restricted mean survival time (RMST) No need to be model-based, can use non-parametric estimation. Need to pre-specify choice of time restriction, possibly including initial time t0, as well as final time limit t1.
Can reflect any time-varying difference in mortality - estimate of RMST difference graphically corresponds to the difference in area between the respective survival curves.
Do not need to speculate on particular form of time varying difference in mortality. However choice of time restriction may depend on expectation of difference (HR functional form). May be time consuming to estimate, including standard error.
Gives a meaningful single summary estimate even with non-PHs As the test looks for differences in area under the curve, survival curves that come back together can result in a significant test result.
Combined test (of Cox test with a permutation test based on RSMTs on 2 df) Simulations suggest power not much lower than Cox alone under PHs and more powerful in more situations than joint test [33, 47]. Difficult to explain
Time-consuming to fit (permutation test).
Issues of RMST (see above)—choice of time restriction
Enhanced power for early effect Simulations suggest not powerful for late effects
Joint test (of Cox proportional screen arm effect + Grambsch-Thurneau non-PH test on 2 df) Test based on results of the Cox model (screen arm effect and the Schoenfeld residuals), so perhaps more readily acceptable given prior use of the Cox model Simulations suggest better under late effects but not good power for early effects [33, 47].
Relatively simple test (with degree of intuitiveness), but more powerful than just screen arm effect under non-PHs
Combination tests such as Versatile Test (maximum test statistic of 3 weighted tests—early, PHs, late effects) or “max-combo” (also includes “middle” effects) Not model-based Appears complicated (need for reference to a correlated multivariate z-distribution for test statistic)
Provides good power in all situations, covers bases with small price in efficiency Not the most powerful test.
Best choice if one wants to be agnostic of specifying the time varying mortality difference Can feasibly reject the null hypothesis both in favour of the study arm and of the control arm using the same data.