Our paper has two components: trial design with associated significance testing, and estimation of results. We discuss these briefly in turn.
There is no particular reason that we are aware of for expecting proportional hazards of the treatment effect. It is a convenient assumption that facilitates sample size calculation in time-to-event data. Our basic design idea is to improve the power to detect a more general, that is, potentially more complex, treatment effect than PH. The motivation is the increasingly frequent occurrence of non-PH in trials, with a concern that the power of the logrank test may be low in some of these cases. The outcome could be a trial declared or regarded as ‘negative’, when in fact a clinically relevant difference in survival curves between treatments was present.
There are costs to generalizing the concept of a treatment effect. Patterns of non-PH are potentially very varied, and it is hard, if not impossible, to design a trial with a convincing prior assumption about the likely pattern. Our proposed solution is to power the trial under PH according to a two-part (‘joint’) test. By combining the usual logrank or Cox test with the Grambsch-Therneau test of non-PH, we incur a loss of power under PH, but we may gain power under non-PH.
We discussed three possible strategies for trial design: (1) power according to the logrank test, with a hit in power of the joint test; (2) power according to the joint test, with an increase in sample size required via the higher power used in the logrank test; or (3) the same strategy as (2), but relaxing the significance level of the joint test to achieve the same power as the logrank test.
We have a slight preference for strategy 2. A frequent choice, for example, in past Medical Research Council (MRC) cancer trials has been to power the logrank test at 90 percent with significance level 5 percent and a target HR of 0.75. As we have seen, such a design guarantees power of 83.5 percent for the joint test under PH, which many would consider adequate. Others may have different preferences. We indicate how to do the relatively straightforward power calculations in the present paper. Sophisticated methodology and software (for example, [12, 13]) are available for implementing complex trial designs under the logrank test. These can of course also be used with the joint test under PH.
The Grambsch-Therneau test is based on scaled Schoenfeld residuals derived from a Cox PH model. Schoenfeld residuals are unsuitable for estimation of the quantities of substantive interest in a survival analysis of trial data. For that reason, for estimation, we suggest using a flexible parametric model with a time-dependent treatment effect. This class of models can be pre-specified in sufficient detail in a protocol and statistical analysis plan. It provides smooth estimates of survival probabilities, hazard ratio functions, restricted mean survival times, and so on. While there is a potential risk of bias due to the FPM failing to fit the data adequately, our experience so far is that noticeable lack of fit to the survival functions is uncommon. Of course, the Cox model can also fit badly.
The time-dependent treatment effect function incorporated in the FPM is log-linear in the follow-up time and therefore of limited flexibility. The fit can be checked by inspecting a plot of smoothed Schoenfeld residuals against the failure times, which gives a ‘non-parametric’ impression of the pattern of the log hazard ratio over time. If necessary, in secondary analysis the FPM can be elaborated with further spline parameters to improve the fit.
A sensible alternative to the joint test we describe is a joint test of the two parameters θ
0 and θ
1 in the FPM. This test, also on 2 d.f., is of the treatment effect and its interaction with (log) time. The global null hypothesis is θ
1=0 (see Equation (3)). In an informal comparison using a database of 25 heterogeneous RCTs, we found good agreement and no consistent differences in the P-values of the two joint tests (data not shown). At this point, we have no empirical evidence to support recommending one test over the other. However, one theoretical consideration favouring the Cox/Grambsch-Therneau joint test is that the Grambsch-Therneau test is more general than the time-dependent function θ
1 lnt. It is conceivable, therefore, that the Cox/Grambsch-Therneau test may tend to have higher power in general than the FPM-based joint test. On the other hand, some researchers may favour congruence between the global test for a treatment effect being based on the FPM and the same FPM being used in the description and interpretation of the trial results.
A key feature of the joint test is that it is sensitive to simple and also to more ‘complex’ treatment effects. In the latter case, assuming the result is not a type 1 error, the test is indicating there is a genuine difference between the survival curves. Even if the overall treatment effect, considered over the entire follow-up time of the trial, is small, the difference between the arms may still be of clinical and/or scientific interest and importance. For example, the difference in the survival curves between the treatment arms may suggest possible mechanisms of action of the treatments.
We are not suggesting that the joint test be adopted routinely. Primarily, we suggest that the trialist choose the preferred test according to the perceived modes of action of the treatments being compared. If the modes are obviously different, for example surgery versus a more conservative approach such as watchful waiting or a non-surgical therapy, the hazard functions will probably differ markedly in shape and non-PH seems more likely. The joint test may then be a good choice. If rather similar treatments are involved, such as various chemotherapy regimens, non-PH may seem less likely and the logrank test may be best. There may be indications of the extent and nature of non-PH from earlier trials or, in cancer for example, from other cancer types in which the treatment has been evaluated. Another consideration is judging how close to PH the ensuing survival curves are likely to be. If a treatment effect is expected to emerge relatively soon after randomization, non-PH is likely to be mild and the logrank test will be the more powerful. If the effect emerges much later in follow-up, the joint test is likely to be more powerful.