"Flogging dead horses": evaluating when have clinical trials achieved sufficiency and stability? A case study in cardiac rehabilitation

Background Most systematic reviews conclude that another clinical trial is needed. Measures of sufficiency and stability may indicate whether this is true. Objectives: To show how evidence accumulated on centre-based versus home-based cardiac rehabilitation, including estimates of sufficiency and stability Methods Systematic reviews of clinical trials of home versus centre-based cardiac rehabilitation were used to develop a cumulative meta-analysis over time. We calculated the standardised mean difference (SMD) in effect, confidence intervals and indicators of sufficiency and stability. Sufficiency refers to whether the meta-analytic database adequately demonstrates that an intervention works - is statistically superior to another. It does this by assessing the number of studies with null results that would be required to make the meta-analytic effect non-statistically significant. Stability refers to whether the direction and size of the effect is stable as new studies are added to the meta-analysis. Results The standardised mean effect difference reduced over fourteen comparisons from a non-significant difference favouring home-based cardiac rehabilitation to a very small difference favouring hospital (SMD -0.10, 95% CI -0.32 to 0.13). This difference did not reach the sufficiency threshold (failsafe ratio 0.039 < 1) but did achieve the criteria for stability (cumulative slope 0.003 < 0.005). Conclusions The evidence points to a relatively small effect difference which was stable but not sufficient in terms of the suggested thresholds. Sufficiency should arguably be based on substantive significance and decided by patients. Research on patient preferences should be the priority. Sufficiency and stability measures are useful tools that need to be tested in further case studies.


Background
Any one clinical trial is seldom definitive by itself. Few innovative technologies have sufficient effect to be adopted on the basis of a single trial (or even without). The FDA normally authorises market access for a new drug or device based on two (or more) confirmatory trials. Evidence usually accumulates in systematic reviews and meta-analyses. Public decision making bodies such as National Institute for Health & Clinical Excellence (NICE) rely heavily on these methods. Most systematic reviews conclude that more clinical trials are needed.
Funders of non-commercial trials [1] need to consider both the state of existing knowledge and the contribution any proposed trial would make. Cumulative metaanalysis which shows the contribution of each trial has been used since 1981 [2]. Lau and Schmid et al in 1995 used it to show that more than 34,000 patients had been unnecessarily randomised into streptokinase trials for acute myocardial infarction [3].
Methods to aid the interpretation of cumulative metaanalysis have aimed to show when has sufficient information been obtained. Pogue and Yusuf in 1997 proposed using sequential monitoring boundaries with cumulative meta-analysis to assess when evidence is "statistically significant and medically convincing" implying additional research is not needed [4]. Such boundaries mitigate the multiplicity issues that arise from cumulative meta-analysis doing repeated analyses. Muellerleile et al in 2006 proposed an alternative method which involves calculating indicators of sufficiency and stability. Sufficiency refers to "whether the meta-analytic database adequately demonstrates that a public health intervention works" and stability "refers to the shifts over time in the accruing evidence about whether a public health intervention works" [5]. Muellerleile et al argued that stability (whether an effect has become stable across waves in a cumulative meta-analysis) was not covered by Pogue and Yusuf and their method is simpler due to not requiring prior specification of the optimum information size (which requires a researcher to have extensive knowledge of the observed results of the accumulated research before undertaking a cumulative meta-analysis). More recently, in 2008 Wetterslev et al developed Pogue and Yusuf's method by recommending ways of calculating the optimum information size for sequential monitoring boundaries [6].
This paper applies cumulative meta-analysis and Muellerleile's indicators of sufficiency and stability to twelve randomised clinical trials comparing centre-based to home-based cardiac rehabilitation between 1985 and 2007. Cardiac rehabilitation was the subject of two systematic reviews [7,8] as well as a large clinical trial in which several of the authors of this paper were involved [9,10]. We were interested in identifying the contribution of that trial to the meta-analysis and exploring research priorities.

1.
To apply cumulative meta-analysis to trials of centre-based versus home based cardiac rehabilitation, and 2. To explore using indicators of sufficiency and stability in assessing research priorities

Identification of studies and data extraction
We included trials identified in two previous systematic reviews [7,8,11], on home based versus centre based cardiac rehabilitation. Data extracted from these reviews included details of the trial design, participants, interventions, outcome measures, method of measurement of exercise capacity, results for each arm and standardised mean difference and 95% confidence interval. All data were checked against the original articles and the standardised mean difference and associated standard error re-calculated to check it was correct, with additional details provided by the authors. Country of trial and funder was extracted from the original trial articles. The Cochrane review defined home-based cardiac rehabilitation as "a structured programme, with clear objectives for the participants, including monitoring, follow-up, visits, letters, telephone calls from staff, or at least self monitoring diaries" and. centre-based cardiac rehabilitation as "a supervised group based programme undertaken in a hospital or community setting such as a sports centre" [7].

Outcome measures
Our analysis was exercise capacity, the only outcome common to all trials identified in the systematic reviews. As trials reported exercise capacity in different ways, following the Cochrane review, we calculated the standardised mean difference in exercise capacity at follow up for home based rehabilitation compared to centre based rehabilitation using hedges adjusted g [12]. As some of the studies included were relatively small Hedges adjusted g was used [8].

Cumulative meta-analysis
Cumulative meta-analysis involved updating the metaanalysis as each trial reported to show how the evidence evolved over time.

Statistical-analysis
As in the Cochrane systematic review, the betweengroup differences in exercise capacity were pooled using a random effects model because of the significant clinical and statistical heterogeneity across trials. A subgroup analysis looked at those RCTs conducted in the UK.
Sufficiency was assessed by calculating the failsafe ratio as each new trial joined the cumulative meta-analysis [5]. The failsafe ratio is a measure of the number of studies with null results required to make the metaanalytic result non-statistically significant, versus the statistical significance (weight) of the evidence available already (see appendix 1 for further details). We used Muellerleile's suggested threshold for sufficiency, that is a Failsafe ratio exceeding 1 implied sufficient evidence that one form of rehabilitation was more effective than the other and that additional research was unlikely to change the weight of the evidence.
Stability was assessed by calculating the cumulative slope of the regression line of the cumulative meta-analysis results repeated over time [5]. Muellerleiles' suggested criteria, that is the cumulative slope estimate from the linear regression was less than 0.005, was used to decide if the meta-analysis was stable.
Publication bias was assessed using Begg funnel plots and by testing for funnel plot asymmetry using the Egger weighted regression test.
All statistical analyses were performed in Stata 10 (StataCorp, College Station, TX, USA).
Most of the trials included patients at low risk of another event following an acute myocardial infarction or revascularisation, excluding those with severe arrhythmias, ischemia, or heart failure [7]. Two studies included patients with New York Heart Association class 2 or 3 heart failure [18,21] The trials involved a wide range of cardiac rehabilitation programmes which differed in frequency, duration and session length. The centre based programmes usually involved supervised exercise on cycles and treadmills. Home based rehabilitation typically focused on walking with support from a nurse or exercise specialist on the telephone. Seven studies compared comprehensive programmes whereas five included exercise only based programmes (table 1). A detailed description of the interventions included in each study can be found in Dalal et al 2010  The twelve studies included 14 comparisons involving 1,557 patients ( Table 1). The individual study results ( Figure 1) varied with six favouring centre-based cardiac rehabilitation, six favouring home and 2 favouring neither. The pooled standardised mean difference in exercise capacity was not statistically significant (random effects: SMD -0.11, 95% CI -0.35 to 0.13) (figure 1). There was evidence of high levels of statistical heterogeneity between the study results across trials. The funnel plot and associated egger regression test did not indicate evidence of small study publication bias (p-value = 0.77).

Evolution of evidence -cumulative meta-analysis
The cumulative meta-analysis of the 14 comparisons showed the effect size and confidence interval narrowing over time, with the effect size initially favouring centrebased cardiac rehabilitation and reducing over time towards the line of no difference (figure 2). All trials except Kassaian [18] contributed to the narrowing of the confidence interval. The trend over time highlights Kassaian as a potential outlier a point also highlighted by the authors of the Cochrane review. They stated there was uncertainty due to lack of detailed reporting as to whether Kassaian compared hospital based rehabilitation to usual care instead of home based rehabilitation.
Despite BRUM being the largest and latest trial its contribution to the meta-analysis was limited to reducing the width of the confidence interval without changing the point estimate (the pooled SMD in exercise capacity prior to BRUM being published in 2007 was -0.11 (95% CI -0.39 to 0.17), post BRUM was -0.11 (95% CI -0.35 to 0.13)). The difference of -0.11 is equivalent to approximately -0.34 of a MET. . This conclusion is consistent with how the point estimate in the cumulative meta-analysis changed with the addition of new studies, BRUM was the first study where the point estimate didn't change (was stable at -0.11, before and after inclusion of BRUM). Stability means that further studies are unlikely to change the aggregate picture, in this case, of a small difference.

Sufficiency and Stability
The sufficiency indicator in figure 3 highlights two key trials: Kassaian and Dalal. Dalal's significant result favouring home based rehabilitation compensated for Kassaian's significant result favouring centre-based. The weight of evidence against the null hypothesis was strongest after the publication of Kassaian, although not sufficient (failsafe ratio = 0.321 < 1) or statistically significant and reduced greatly after inclusion of Dalal (failsafe ratio = -0.045). Sufficiency in figure 3 did not achieve Muellerleile's threshold (failsafe ratio < 1 throughout), which is unsurprising given the lack of statistical significance.

Sensitivity analysis excluding Kassaian
Because of the uncertainty as to whether Kassaian included usual care or home based rehabilitation we conducted all analyses with and without this trial (Additional File 1 and Additional File 2). The only difference was that the cumulative meta-analysis became stable earlier (after inclusion of Gordon-Community) and then unstable after inclusion of Dalal (Additional File 1 -stability indicator = 0.0063 just greater than 0.005). All trials except Dalal contributed to the narrowing of the confidence interval over time centering on zero. Both of these are perhaps unsurprising given Dalal is the trial included with the most extreme result (except for DeBusk-Extended whose result is slightly more extreme but less precise).

What does this study show?
This is the first attempt to apply these methods (cumulative meta-analysis and indicators of sufficiency and stability) to trials of cardiac rehabilitation. The trials included in this meta-analysis all contributed to reducing the difference and uncertainty in exercise capacity of home versus centre based rehabilitation. Kassaian and Dalal were the most influential because they had the most extreme results. The standardised mean difference continued to favour centre over home based rehabilitation but the size of that difference has narrowed over time, from 0.27 to 0.11. The confidence interval remained wide enough to include small to moderate differences [24]. The decision by the NIHR HTA programme to commission the BRUM trial in 2000 appears justified given that evidence available on the relative benefits of centre versus home based cardiac rehabilitation at the time was neither stable nor sufficient and included the possibility of large effect sizes (+/-0. 8 Cohen's criteria [24] -figure 2 cumulative SMD in 2000 -0.27, 95% CI -0.81,0.26). Of the six randomised controlled trials involving 517 patients that were available at this time (figure 1 studies before 2001) [14,17,18,20,22] only one of these had been conducted in the UK [14]. In 2007 just before BRUM published a further six similar trials had reported from six different countries [13,15,16,19,21,23]: USA, Canada, Italy, Turkey, UK and China. The HTA programme could not have known about these trials since trial registration of clinical trials only commenced on a voluntary basis in 2004 [25]. However even if these trials had been known of, it is not clear that they could have substituted for BRUM due to differences such as the form of home based rehabilitation [7], and trial size and duration.

Comparisons with other studies
The indicators of sufficiency and stability presented here have been previously applied to five other case studies [5,26]. Two of these found the results were sufficient and stable. In the other three, the results were similar to ours -stable but not sufficient, with the estimate centering around the null effect and accumulating evidence simply narrowing the confidence interval around that null effect. In these three cases the authors concluded that carrying out further research in the area would be paramount to "flogging a dead horse", with further studies unlikely to change the aggregate picture of a small effect.

Limitations
The methodological focus of this paper had to do with assessing sufficiency and stability indicators using a case study in cardiac rehabilitation. The case study had two main limitations. The first was the heterogeneity of the types of home and centre based rehabilitation included within the trials. However, these trials were combined in a meta-analysis of exercise capacity in exactly the same way in the Cochrane systematic review and a meta-analysis confined to the three UK trials which used the heart manual [27] for home based rehabilitation [9,14,15] found three was too few for the sufficiency and stability analysis. The second limitation was the focus on a single outcome, exercise capacity, the only outcome common to all trials identified in the Cochrane review. However, exercise capacity is arguable the most plausible and key outcome for rehabilitation trials. Mortality data was only available from four studies and is likely to be confounded by drug treatment/uptake. The results from a single case study have limited generalisability. More research is needed to better understand these indicators and the usefulness of the sufficiency indicator when applied to superiority comparisons showing differences close to zero. More case studies, simulations and Bayesian methods may be useful for this.

Unanswered questions and future research
As all the trials included in this analysis were designed as superiority trials, we cannot conclude home based and centre based cardiac rehabilitation are equivalent. However, the above analyses show the difference in effect were relatively small and stable. Other factors have been shown to be important such as patient preferences [15]. The key question is what effect patients would consider worthwhile. Is the standardised mean difference of 0.11 sufficient for patients to choose one form of rehabilitation or the other. Only a study of patient preferences could answer this question.

Conclusions
The methods used here seem promising and have implications for researchers, treating clinicians, payers, funders, sponsors, editors, ethics boards, patients, and the public. Sufficiency and stability measures can be calculated simply and shown graphically on a cumulative meta-analysis figure. They provide useful tools for considering whether further research is needed and the impact individual trials had on the meta-analysis. They are relatively straight-forward to calculate but not yet widely used. As with all meta-analyses, only published studies available at the time searches are conducted can be included. A policy-maker/funder wanting to use these methods to make funding decisions/assess research priorities would need to identify and consider ongoing studies. The thresholds suggested by Muellerleile et al are arbitrary and require further testing. In particular, rather than defining sufficiency mathematically with the focus on statistical significance, the benchmark should be based on substantive significance [28] set by patients' preferences. More case studies and further work to develop the sufficiency indicator would be helpful.

Appendix 1 -Calculation of the failsafe ratio for assessment of sufficiency
The failsafe ratio is calculated as the sum of the Z values from individual study results, compared to the number of studies with null results that would be required to make the meta-analytic result non-significant. It was derived by Muellerleile and Mullen based on Rosenthal's file drawer analysis [5,29]. It provides information about the amount of evidence against the null hypothesis and whether this weight of evidence is sufficient and unlikely to be changed with additional research. It is calculated as follows: