Stroke aetiological classification reliability and effect on trial sample size: systematic review, meta-analysis and statistical modelling

Background Inter-observer variability in stroke aetiological classification may have an effect on trial power and estimation of treatment effect. We modelled the effect of misclassification on required sample size in a hypothetical cardioembolic (CE) stroke trial. Methods We performed a systematic review to quantify the reliability (inter-observer variability) of various stroke aetiological classification systems. We then modelled the effect of this misclassification in a hypothetical trial of anticoagulant in CE stroke contaminated by patients with non-cardioembolic (non-CE) stroke aetiology. Rates of misclassification were based on the summary reliability estimates from our systematic review. We randomly sampled data from previous acute trials in CE and non-CE participants, using the Virtual International Stroke Trials Archive. We used bootstrapping to model the effect of varying misclassification rates on sample size required to detect a between-group treatment effect across 5000 permutations. We described outcomes in terms of survival and stroke recurrence censored at 90 days. Results From 4655 titles, we found 14 articles describing three stroke classification systems. The inter-observer reliability of the classification systems varied from ‘fair’ to ‘very good’ and suggested misclassification rates of 5% and 20% for our modelling. The hypothetical trial, with 80% power and alpha 0.05, was able to show a difference in survival between anticoagulant and antiplatelet in CE with a sample size of 198 in both trial arms. Contamination of both arms with 5% misclassified participants inflated the required sample size to 237 and with 20% misclassification inflated the required sample size to 352, for equivalent trial power. For an outcome of stroke recurrence using the same data, base-case estimated sample size for 80% power and alpha 0.05 was n = 502 in each arm, increasing to 605 at 5% contamination and 973 at 20% contamination. Conclusions Stroke aetiological classification systems suffer from inter-observer variability, and the resulting misclassification may limit trial power. Trial registration Protocol available at reviewregistry540. Electronic supplementary material The online version of this article (10.1186/s13063-019-3222-x) contains supplementary material, which is available to authorized users.


Background
Stroke is a syndrome with heterogeneous aetiologies. Grouping these aetiologies together has been effective when developing acute interventions such as intravenous thrombolysis, but improved access to imaging, rhythm monitoring and biomarkers may support a more individualised approach to treatment and research. We increasingly acknowledge the relevance of aetiology as a determinant of prognosis, as a risk factor for recurrence and as a potential treatment effect moderator.
Various classification systems have been developed to define stroke aetiology using clinical features and the results of ancillary investigations. These aetiological classification tools attempt to categorise stroke, and the subtypes usually include cardioembolism, large vessel atheroma and small vessel disease. Robust classification of aetiology is essential to guide treatment decisions when the optimal pharmacological treatment differs between aetiological groups. This situation is seen with prevention of cardioembolic (CE) versus large vessel stroke, where the differing pathologies require differing treatment strategies.
The reliability of a classification tool is a measure of the degree to which results are reproducible when repeated observations are made, either by the same physician on repeated assessments (intra-observer reliability) or between physicians (inter-observer reliability). Several factors may impair classification reliability, including inherent properties of the classification system itself (such as its complexity), properties of the patient population (such as the spectrum of disease aetiology), the quality and completeness of ancillary data and the expertise of the physicians using the classification algorithm.
Impaired reliability, regardless of cause, will result in misclassification error. This misclassification is problematic for a number of reasons. It will lead to potentially biased estimates of the prevalence of disease aetiologies and misdirected treatment decisions. Misclassification error may compromise the statistical power and efficiency of research studies, inflating their costs and reducing sensitivity. Finally, misclassification may undermine the estimate of effect size and reduce apparent efficacy and cost effectiveness [1,2].
All of these issues are particularly pertinent to the emerging literature on embolic stroke of undetermined source (ESUS) [3,4]. Ongoing and completed large trials in this area were based on a trial entry classification paradigm where patients with atrial fibrillation (AF) or ESUS receive anticoagulation and patients with other, non-cardioembolic (non-CE) aetiologies receive antiplatelet treatment. Baseline misclassification of stroke subtype and subsequent misallocation to treatment arms could compromise the power of these trials to demonstrate utility. For example, if a patient with an arterial atheromatous cause of stroke is erroneously recruited into an ESUS treatment arm, they will not benefit from the treatment given, and this will dilute the power to see a treatment effect.
We designed a programme of work to model the potential effect of aetiological misclassification on sample size in a hypothetical trial of anticoagulant in CE stroke. We first describe the reliability (inter-observer variability) of stroke classification systems and then model the effect of this misclassification on sample size required to show the effect of chosen treatments in terms of survival and stroke recurrence.

Methods
We used an iterative approach to our analyses. First, we performed a systematic review and meta-analysis to quantify potential misclassification rates across different stroke classification systems. We then performed a scoping analysis, using aggregate data from published trials to estimate the potential effect of misclassification. Finally, we used individual patient-level data to model the impact of misclassification on a hypothetical trial.

Systematic review
We followed, where appropriate, Preferred Reporting Items for Systematic Review and Meta-Analyses (PRISMA) best practice guidance for design, conduct and reporting of our systematic review. We worked according to a pre-defined protocol (Research Registry Unique Identifying Number reviewregistry540, accessible on URL https://www.researchregistry.com/). Our primary aim for the systematic review was to describe the inter-observer reliability of stroke classification systems. The metric of interest was reliability of the classification tool, i.e. agreement between observers. We did not pre-define the classification scales of interest.
We devised a sensitive search strategy using validated search terms across multidisciplinary electronic databases, from database inception to December 2018 inclusive (Additional file 1: Table S1). We used citation searching (backwards searching) and assessed all articles that had cited the index article (forward searching). There were no restrictions relating to date of publication, the number of participants or assessors. Only papers published in peer-reviewed, English language scientific journals were considered.
Title and abstracts generated from the electronic database searches were screened for relevance, irrelevant titles and abstracts were excluded and full-text articles were inspected to determine eligibility. Data from studies meeting our inclusion criteria were extracted to a proforma. All aspects of the title searching, assessment and data extraction were performed by two independent researchers. Decisions were made by consensus with recourse to a third arbitrator as necessary. No authors were contacted for the study.
We assessed risk of bias using Guidelines for Reporting Reliability and Agreement Studies (GRRAS) [5]. The main characteristics analysed were stroke classification system, stroke population, assessor population, sample size calculation, sampling methods, blinding and reporting of reliability with a corresponding measure of uncertainty.
We used the summary data from the systematic reviews to inform estimates of potential rates of misclassification in a trial that recruited based on stroke aetiology. We used 'best case' and 'worst case' reliability estimates from the systematic review. There is no agreed approach for converting kappa values (our summary agreement metric from meta-analysis) to a measure of percentage agreement/disagreement (the measure needed for our modelling exercise). We used a conversion that has been previously described [8]. We opted for cautious estimates of misclassification (disagreement) based on the conversion (5% and 20%). We ensured that these values were broadly in keeping with the data presented in papers that describe reliability in terms of percentage agreement as well as kappa.

Effect of stroke classification on trial sample size: aggregate analyses
We used our summary estimates of misclassification to inform a series of statistical models. As an initial scoping exercise we created a hypothetical study of anticoagulant versus antiplatelet in CE stroke. We used survival outcomes data from historical stroke trials that included the use of anticoagulant and antiplatelet treatment in CE and non-CE stroke, to give proportional survival in each [9,10]. We then replaced a proportion of the CE patients with non-CE patients in each treatment arm. We used the Pearson chi-square test for proportion difference, first assuming perfect classification rates and then factoring in differing rates of misclassification. For a given rate of misclassification, we substituted that proportion of non-CE stroke patients into the CE treatment arm and vice versa ( Fig. 1).

Effect of stroke classification on trial sample size: individual patient-level analyses
We then explored the effect of misclassification on a hypothetical trial involving patients with CE stroke using individual patient-level data. We used data from the Virtual International Stroke Trials Archive (VISTA), http:// www.virtualtrialsarchives.org/vista/, as the base-case data to inform our models. VISTA is a not-for-profit repository for stroke trial data, containing study quality, anonymised, individual patient-level data on thousands of participants [11,12]. These data have been used to investigate novel hypotheses, including analyses of stroke assessment scale properties [13,14]. We tested a hypothetical misclassification scenario; a trial that assesses the efficacy of an oral anticoagulant versus antiplatelet in patients with CE stroke contaminated by patients with non-CE stroke (aetiological misclassification).
From VISTA, we selected populations of CE stroke treated with anticoagulant or antiplatelet agent, and populations of non-CE stroke treated with anticoagulant or antiplatelet. We assumed that patients with known AF and neither large nor small vessel disease were CE. We assumed that patients with no AF and proven large or small vessel disease aetiology of stroke were non-CE. We calculated an initial sample size for outcomes of death and stroke recurrence using aggregate VISTA data from CE-anticoagulant and CE-antiplatelet groups.
We then used bootstrapping simulations with random repetition sampling to create models to our specified sample size, n = 7000, using the RAND function in Matrix Laboratory (MATLAB) software. In the hypothetical 'treatment' arm, we randomly selected the relevant number of correctly classified patients from the CE cohort treated with anticoagulation and then randomly selected the corresponding number of incorrectly classified patients from the non-CE cohort treated with anticoagulant. We created a 'control' arm using the same process, sampling from CE treated with antiplatelet and contaminating with non-CE treated with antiplatelet. We summarised outcomes across 5000 permutations.
We were able to describe reliability for classification scales in general and at the level of individual aetiology. For the different TOAST classification systems, study-level inter-observer reliability varied from 'fair' to 'very good' ( Table 2). Across the eight studies where data were suitable for pooled analysis, overall kappa was moderate (κ = 0.53; 95% confidence interval [CI] 0.49-0.56). For the 'classic' version of TOAST, pooled reliability was also moderate (κ = 0.55; 95% CI 0.51-0.59) (Additional file 1: Figures S2 and S3).
For the different subtypes of CCS classifications, study-level inter-observer reliability ranged from 'good' to 'very good'. The inter-observer reliability of different subtypes of CCS suggested that 5 subtype CSS and 8 subtype CCS had very good overall reliability and 16 subtype CCS had good overall reliability ( Table 2). Overall reliability for 5 major CCS subtype was good (κ = 0.81; 95% CI 0.80-0.83) ( Table 2 and Additional file 1: Figures S4 and S5).
In the case of ASCO, inter-observer variability was described according to each potential phenotype and varied from perfect (κ = 1) for the 'C' (cardiac) phenotype to good (κ = 0.66) for the 'O' (other) phenotype (Table 2).
Based on these summary reliability measures, we estimated proportional misclassification extremes at 5% and 20%, representing the approximate misclassification that may be seen with the most favourable (seen with CCS and ASCO systems) and the least favourable (TOAST) reliability estimates.

Effect of stroke classification on trial sample size: aggregate analyses
In our initial scoping analyses, using aggregate data from historical trials of anticoagulant in CE and non-CE, we started with a zero misclassification, base-case scenario trial of n = 392 to detect an outcome difference in treatment effect in terms of survival (power = 0.8, alpha 0.05). With a misclassification of 5% the required sample size to demonstrate the same effect was n = 444 (13% increase). With a misclassification of 20% the required sample size to demonstrate the same effect was n = 663 (69% increase). Detailed analysis results are included in Additional file 1: Supplemental Results.

Effect of stroke classification on trial sample size: individual patient-level analyses
We obtained data for 2066 patients with acute ischaemic stroke from VISTA, of whom 514 had CE on baseline assessment (n = 207, 40% were on antiplatelet, the remainder anticoagulant) and 1545 patients with non-CE (n = 1171, 76% were on antiplatelet treatment, the remainder anticoagulant). The baseline characteristics of these patients are shown in Additional file 1: Table S3.
For an outcome of death at first follow-up, based on proportions seen in the aggregate data, a base-case estimate was of a required sample size of n = 198 for both arms of the trial to detect a between-group difference (power = 0.8, alpha 0.05). The required sample size to demonstrate a statistically significant treatment effect increased to n = 237 in each arm (20% increase) at 5% misclassification and n = 352 (78% increase) at 20% misclassification. For an outcome of stroke recurrence using these same data, the base-case estimate sample size was n = 502, increasing to 605 (21% increase) at 5% contamination and 973 (94% increase) at 20% contamination for each arm of the trial (Table 3).
A 20% contamination of patients with non-CE stroke in a hypothetical trial of anticoagulant versus antiplatelet treatment in patients with CE stroke would underestimate the effect of anticoagulant treatment by at least 10% for the outcome of death and 15% for the outcome of recurrent stroke, compared to a trial without any contamination. (See Additional file 1: Supplemental Results.)

Discussion
Our analyses have confirmed that even advanced classification systems for stroke aetiology harbour residual inter-observer variability of at least 5% and potentially much greater. Based on this variability in classification, we have shown that the resulting misclassification contributes a sample size penalty of at least 20% and potential incorrect estimation of the treatment effect size by at least 10%. It seems plausible that stroke trials targeted at a particular aetiological subgroup may have been underpowered to demonstrate a treatment effect. To take a high-profile example, the New Approach riVaroxaban Inhibition of Factor Xa in a Global trial versus Aspirin to prevenT Embolism in Embolic Stroke of Undetermined Source (NAVIGATE-ESUS) study was terminated early because of futility and excess bleeding on rivaroxaban [4]. One can speculate whether this neutral result was at least partly due to the study being underpowered as a result of baseline misclassification.
Our results align with previous research looking at the effect of misclassification of treatment outcomes in stroke trials [1]. The modified Rankin Scale (mRS) is the most commonly used outcome measure in stroke research [29]. Typically, mRS assessment is based on a clinician's rating of a patient interview, and inter-observer variability is common [30]. Meta-analysis suggests that mRS assessments have an overall reliability of κ = 0.62 (weighted κ = 0.9) [30], but this may be less (κ = 0.25) in multicentre studies [31]. However, increasing the reliability of the mRS assessments by central adjudication (including across international centres) has been shown to significantly reduce the required trial sample size and to increase trial power [1]. Given the substantial and increasing per patient cost of randomisation into a clinical trial, funders, trialists and industry have been keen to limit the potential effect of misclassification on required sample size. Training in mRS assessment is now mandatory for many trials, and several contemporary international stroke trials are using a system of central expert adjudication of mRS assessments [32][33][34][35].  Comparable approaches could be employed to limit aetiological misclassification, with the anticipation of greater trial efficiency. There are many other aspects of trials within stroke and in other disease areas where misclassification could compromise power. More research around the properties of assessment tools could be useful to help us understand the potential impact on study results. Research in stroke suggests that poor reliability is not inherent and some assessment tools have greater reliability than others [36].

Strengths and limitations of the study
We present a novel analysis on an increasingly pertinent methodological issue in clinical trials. Our estimates of misclassification are based on a comprehensive review of the literature following best practice in systematic review and meta-analysis. Our modelling was based on individual patient-level data from completed clinical trials.
There are limitations to this analysis. Our misclassification modelling analysis makes a number of assumptions. For example, we assume that the historical event data are still relevant to contemporary patients with acute stroke. We also assume 'perfect' aetiological classification within the VISTA data that inform our modelling. Converting kappa values from our meta-analyses to rates of misclassification comes with certain caveats. We used an accepted 'rule of thumb' [8], but this conversion is imperfect. Based on these criteria, TOAST agreement could be anything from 30 to 80%. We opted for cautious estimates of misclassification (disagreement), although arguably our misclassification rates could have been much higher. In all our analyses, we assume that the misclassification described and modelled above affects the cardioembolic (CE) and the non-CE groups. Finally, it is possible that misclassification may affect one treatment arm more than the other. These limitations are all likely to underestimate any deleterious effects of misclassification on sample size, and so we believe our message remains valid.
Our analysis modelled outcomes of survival and stroke recurrence, as these are the endpoints most commonly described in anticoagulant trials. We acknowledged that other outcomes may also be relevant and subject to misclassification effects, for example functional recovery or non-fatal adverse events.
We use anticoagulation in the context of AF as a model, as these were the data available. It would be unwise to directly extrapolate these data to ESUS trials. The natural history, effect sizes and adverse event risks will differ between ESUS and the proven AF used in our models, but our analysis is designed to be illustrative of the potential statistical effect rather than a direct comment on ESUS trials.

Implications for future practice and research
The stroke classification systems we studied are imperfect, but defining an underlying aetiology for stroke may still be important for personalised stroke treatment and research. Our findings should not deter clinicians and trialists from trying to classify patients; rather, we believe that strategies to improve the reliability of aetiological classification are needed. Some may argue that the move towards precision in therapeutics is less relevant in stroke, as recurrence is not unique to the aetiology of the index event. In the secondary stroke prevention subgroups of the non-vitamin K antagonist anticoagulant trials, 50% of the patients in addition to AF had atherosclerosis in terms of stable coronary artery disease, peripheral artery disease or plaques in the carotid arteries [37][38][39]. In addition, we know from the long-term electrocardiographic (ECG) monitoring trials that 10-15% of patients per year develop silent paroxysmal AF [40,41]. However, trials of aetiologically specific intervention continue, and it seems sensible to minimise and account for the effect of any aetiological misclassification. Perhaps future trials should have the aetiological classification done by independent physicians at study entry, halfway through the study and at the end of the study.
Previous work on mRS assessments showed that introduction of training, structured assessment and consensus review can reduce misclassification in outcome adjudication [1]. It seems plausible that the same may be true for aetiological classification. Further work to quantify misclassification in contemporary stroke practice is needed. Our work offers some rough estimates of misclassification effect that stroke trialists could factor into the design of trials and estimates of required sample size.

Conclusions
Aetiological classification systems are associated with inter-observer variability. The resulting misclassification of stroke aetiology may reduce trial power to adequately identify effective stroke prevention therapy, reduce the effect size and increase associated trial costs.

Additional file
Additional file 1: Table S1. Search strategy. Table S2. Reporting quality and risk of bias assessment of the studies. Table S3. Baseline characteristics of the VISTA cohort used to analyse the effect of stroke classification on trial sample size. Figure S1. PRISMA 2009 flow diagram: literature search strategy. Figure S2. Forest plot describing inter-observer reliability (κ) across studies with different versions of TOAST. CI confidence interval. Figure S3. Forest plot describing inter-observer reliability (κ) across the studies for classic version of TOAST. CI confidence interval. Figure S4. Forest plot comparing inter-observer reliability (κ) across the studies for CCS subtypes. CI confidence interval. Figure S5. Forest plot describing inter-observer reliability (κ) across studies for CCS 5 subtype.

Availability of data and materials
The VISTA datasets analysed during the current study are not publicly available as per VISTA's standard operating procedure, but they are available from the corresponding author on reasonable request, subject to approval by the VISTA Steering Committee.