Clustering in surgical trials - database of intracluster correlations

Background Randomised trials evaluation of surgical interventions are often designed and analysed as if the outcome of individual patients is independent of the surgeon providing the intervention. There is reason to expect outcomes for patients treated by the same surgeon tend to be more similar than those under the care of another surgeon due to previous experience, individual practice, training, and infrastructure. Such a phenomenon is referred to as the clustering effect and potentially impacts on the design and analysis adopted and thereby the required sample size. The aim of this work was to inform trial design by quantifying clustering effects (at both centre and surgeon level) for various outcomes using a database of surgical trials. Methods Intracluster correlation coefficients (ICCs) were calculated for outcomes from a set of 10 multicentre surgical trials for a range of outcomes and different time points for clustering at both the centre and surgeon level. Results ICCs were calculated for 198 outcomes across the 10 trials at both centre and surgeon cluster levels. The number of cases varied from 138 to 1370 across the trials. The median (range) average cluster size was 32 (9 to 51) and 6 (3 to 30) for centre and surgeon levels respectively. ICC estimates varied substantially between outcome type though uncertainty around individual ICC estimates was substantial, which was reflected in generally wide confidence intervals. Conclusions This database of surgical trials provides trialists with valuable information on how to design surgical trials. Our data suggests clustering of outcome is more of an issue than has been previously acknowledged. We anticipate that over time the addition of ICCs from further surgical trial datasets to our database will further inform the design of surgical trials.


Background
Patients under the care of the same surgeon will be influenced in a similar manner due to the surgeon's practice, skill and experience [1]. Outcomes for those treated by the same surgeon tend to be more similar than those under the care of another surgeon due to previous experience, individual practice, training, and infrastructure [2]. This phenomenon is referred to as the clustering effect. While the impact of clustering of outcome has been widely acknowledged for cluster randomised controlled (C-RCTs) trials for some time [1,3,4], its potential impact upon individually randomised controlled trials (RCTs) evaluating therapist dependent interventions, such as surgical interventions, has only been highlighted more recently [1,5]. Models which allow for clustering have been used to analyse surgical trials though this is not commonly done [2,[6][7][8].
Clustering has implications for the required sample size of a RCT; the impact depends upon the design and analysis adopted. For example, a RCT comparing two surgical interventions which adopts an expertise-based trial design, where each participating surgeon delivers only one of the two surgical interventions under evaluation, clustering is incorporated into the design at the surgeon-level in a similar manner to a C-RCT [9]. Surgical versus medical trials (e.g. laparoscopic surgery versus medical management [10]) have naturally been conducted using an expertise-based design where relevant health professional only deliver one or the other of the interventions [11]. Such a design, other factors being equal, potentially leads to a relative loss of precision and increase in the required sample size. In contrast, the adoption of a stratified within-surgeon design can lead to a reduction in the sample size [12]. A further trial design option is a hybrid of these two approaches, such as a surgeon preference trial, where each participating surgeon opts to deliver either one of the two interventions or both. A variety of statistical methods which allow for clustering are available including both fixed and random effects approaches [13].
Recruitment of participants to a RCT across multiple centres (multicentre RCT) is commonly adopted to increase both generalisability and the rate of recruitment. Similar reasons, even if only implicitly recognised, lead to the participation of multiple surgeons within and across centres. Clustering in multicentre surgical trials, as with other therapist dependent trials, could in principle, address clustering at the centre and/or surgeon (therapist) level. A design consideration is whether randomisation and the analysis should account for clustering at the centre and/or surgeon in a multi-centre surgical RCT.
The statistical measure of the clustering between participants under the care of a surgeon or centre is known as the intracluster correlation coefficient [14], or ICC. The ICC can be defined as the proportion of the total variation in the participant outcome that can be attributed to the difference between clusters (e.g. surgeon) and is often represented by ρ. The magnitude of clustering could be influenced by a number of factors such as cluster type (e.g. centre), setting and type of outcome and the time since receiving the intervention [15].
Where a clustering effect exists, this has direct implications for the sample size calculations and the statistical analysis that is required. Standard sample size calculations and analysis techniques assume that the outcome for individual participants will be independent and consequently they will incorrectly estimate (typically underestimate) the true sample size required to detect a pre-specified difference with the desired precision and power. Correspondingly, statistical analyses which ignore the presence of clustering will likely result in overly precise, and potentially misleading, results. Whereas the impact is typically of an inflation in the sample size in the case of C-RCTs, for individually randomised trials the required sample size may be reduced [12].
Trialists have little data upon which to assess the impact of clustering and appropriately modify trial design. Quantifying the clustering effect would aid the design of surgical trials [4]. There is, however, little information available on the likely magnitude of ICCs in surgical trials and it is very rare for surgical trials to use such estimates during the design stage though there is a growing awareness of the need to do so [7,9]. The aim of this work was to inform trial design by quantifying clustering effects (at both centre and surgeon level) for various outcomes using a database of surgical trials.

Methods
ICCs were calculated for outcomes from a set of 10 multicentre surgical trials for a range of outcomes and different time points (where applicable). Clustering was assessed at both the centre and surgeon level independently of each other. Trials recruited participants from centres across the UK and Ireland, Germany or Europe. Interventions under evaluation included general (abdominal, endocrine, pancreatic and upper gastrointestinal), ophthalmology and orthopaedic (hip and knee) surgical specialties. Of the 10 trials, five each included centre and surgeon respectively in the randomisation algorithm. One study [10] which evaluated a surgical versus medical comparison had an expertise-based trial design. Trials varied in size from 138 to 1370 participants; the median (range) number of centres and surgeon were 19 (8,27) and 49 (16,191) respectively. Outcomes evaluated included perioperative (e.g. operation time), surgical (e.g. length of stay and recurrence of hernia), functional (e.g. visual function) and both overall (e.g. EQ-5D and SF-36) and disease-specific (e.g. Oxford knee score) measures. The length of follow-up available varied from short-term (six months or less) to longterm (five years).
The ANOVA method was used to estimate an outcome's ICC along with bootstrapped 95% confidence intervals (CI) [16,17]; this was done separately for each trial. Two ANOVA models were used for every outcome; one where centre and one where surgeon was the clustering factor. These analyses were carried out in Stata 11.1 utilising in combination the bootstrap and loneway commands [18]. The bootstrap process allowed for the clustered nature of the data and 1000 replications were sampled. Both Bias Corrected (BC) and Bias Corrected and Accelerated (BCA) 95% Bootstrapped CIs were calculated for each outcome. If these bootstrapped CIs were not calculable, a CI based upon the percentile bootstrap method was used for the ICC. The operating surgeon was used to define the cluster if surgeon was not included in the randomisation algorithm. Post-intervention data from the surgical interventions arms were used to calculated the ICCs without adjustment for treatment. Clustering information (cluster size distribution and outcome prevalence/mean) were generated [19].
The design effect (or variance inflation factor) is the value which the standard sample size needs to be multiplied by to account for the impact of clustering. For a continuous outcome the impact upon a stratified within-surgeon design has been shown to be 1-ρ reflecting a potential reduction in sample size over a standard analysis [12]. Under an expertise-based trial, the formula 1+(average cluster size-1)*ρ can be used reflecting the need to inflate the size to compensate for loss of information. To illustrate the possible impact of adopting an expertise-based trial design or stratified (or minimised) within-surgeon design using the data from the 10 trials to present plausible scenarios for two common outcomes -one surgical (operation time) and one patientreported (EQ-5D at 12 months). In addition to the actual cluster sizes, an adjusted cluster size using the formula (∑n i 2 )/∑n i ) was also used which allows for the impact of the variation in cluster size to be taken into account [20].
Exploration of the relative contribution of the three levels (1. participant, 2. surgeon and 3. centre) to the overall variance was carried out using a three level model (xtmixed command in Stata) for the EQ-5D at 12 months. The three corresponding ICCs (Level 2 ICC, Level 3 ICC and Levels 2 and 3 ICC) for this model were calculated along with BCA 95% CIs.

Results
Details of the 10 trials and information on the cluster sizes are reported in Table 1. The median (range) average cluster size was 32 (9 to 51) and 6 (3 to 30) for centre and surgeon levels respectively. Surgeon cluster size was smaller than centre size as expected for surgeons nested within centres.
ICCs were calculated for 198 difference outcomes across the 10 trials at both centre and surgeon cluster levels. For 21 outcomes it was not possible to calculate bias corrected bootstrapped CIs at centre and/or surgeon level and a CI based upon the bootstrap percentile method was used instead. ICC estimates and corresponding CIs of 48 outcomes (selected based upon primary outcomes of the included trials and other commonly reported outcomes in the surgical literature) are given in Table 2. Full details are available online at http://www.abdn.ac.uk/hsru/research/research-tools/ study-design.
ICC estimates varied substantially between outcome type, though uncertainty around individual ICC estimates was substantial; this is reflected in generally wide confidence intervals. A summary of the ICC estimates by outcome is given in Table 3. Follow-up may also impact upon the ICC estimate as the largest values occurred when the outcome was measured closer in time to the intervention ( Table 3). Most CIs were consistent with small or no clustering effect. There was evidence of a substantial clustering effect for some outcomes (e.g. operation time and length of stay). For others, there appeared to be little or no clustering (e.g. EQ-5D). ICC estimates appeared to be generally similar for surgeon and centre level clustering.
Plausible impact on sample size under an expertisebased design and stratified within-surgeon design are shown in Table 4 for EQ-5D (12 months or longer) and operation time. For EQ-5D adoption of a stratified within-surgeon design protected against loss of information while the impact of an expertise-based trial design was dependent upon the anticipated cluster size if small (e.g. less than 10) the inflation of sample size was under 10%. However for large cluster sizes, as occurred in some of the trials, substantial increases in the required samples could be anticipated. For operation time, the large estimate ICC leads to large design effects even for very small average cluster size. Large design effects were plausible for an expertise-based trial design.
The results of the three level multilevel model are shown in Table 5. The variance at the surgeon and centre levels appeared to be similar though the contribution of the surgeon level was slightly higher though there was a large amount of uncertainty regarding the relative proportioning of variance between these two levels. There was evidence of clustering when the variance of levels 2 and 3 were considered together.

Discussion
Our data on clustering effect for multicentre trials of surgical interventions suggests it is more of an issue than has been previously acknowledged. Despite the uncertainty intrinsic to estimating the magnitude of the ICC, there was evidence of clustering effect for a number of outcomes. As the outcomes with the highest ICC estimates (e.g. operation time and length of stay) are typically cost rather than clinical outcomes, clustering is likely to have the greatest impact on the economic evaluation. This database provides trialists with valuable information on how to design surgical trials. In particular it supports the wisdom of including either centre or surgeon (if a within-surgeon design is used) in the randomisation algorithm. The failure to analyse accordingly can result in a loss of precision [7,12].
The individual ICC estimates were suggestive of clustering for a number of outcomes. The ICC estimates for centre and surgeon level did not markedly differ as might be anticipated given that surgeons are typically nested within a centre. It is likely that the observed clustering by surgeon is driven by a number of factors and not just the surgeon per se. Furthermore where surgeon is used in the randomisation algorithm in practice this may function as a sub-centre (e.g. surgeons in the same surgical team) as opposed to reflecting an individual surgeon and hence can be in between centre and pure surgeon grouping. The latter is often more difficult to achieve than might be initially expected, particularly in a routine health care setting, as surgical trainees often undertake elements of the whole operation under the supervision of a senior surgeon or more senior surgeons work in a team environment.
The difficulties in estimating the uncertainty around a ICC estimate are well known [21]. We used the ANOVA method (along with bootstrapped confidence intervals) as it has been shown not to require any strict distributional assumptions and can be used for both continuous and binary outcomes [22]. Following other authors, we consider a negative ICC implausible; the ICC estimates were censored at zero [17,22]. Where the ICC estimate was close to zero, the reported ICC confidence interval limits may be slightly inflated as a consequence [15,17]. As surgical trial datasets do not tend to be large enough for precise estimate of ICC, the utilization of routinely collected data, perhaps in conjunction with surgical trial datasets, could be considered. Formal meta-analysis of ICC estimates would in principle provide the optimal use of available data and achieve greater precision [23]. Furthermore, ICC estimates can be calculated with adjustment for other important factors (e.g. baseline values for quality of life measures) which are likely to reduce the ICC estimates. Our estimates are unadjusted and therefore may be an overestimate of clustering provided the statistical analysis adjusts for such factors. The exploratory three level analysis suggested variance might be contributed from both the surgeon and centre levels as might be considered   intuitively the case. However, even for the largest dataset in the database the uncertainty around the estimates was substantial. Expertise-based trials have been used and promoted as a preferable design to the standard within-surgeon (stratified) design. Purported benefits of this design include increased surgical participation and compliance with randomisation, addressing the learning curve effect along with desirability from a patient perspective. However, expertise-based designs have been criticized on a number of grounds [24] including methodological considerations and particularly the required sample size.
The data presented provides clarification on the potential impact which would appear to be related to the outcome(s) of interest. Expertise-based design, perhaps contrary to intuition, seems a (statistically) suboptimal choice for a comparison of surgical interventions where surgical outcomes (e.g. operation time, short recovery) are of interest. Of the trials included in the database, four focussed upon surgical primary outcomes. A better option would be surgeons with expertise in both surgical interventions delivering both interventions. In contrast an expertise-based trials seems a reasonable choice if long-term quality of life was the primary focus of the study as small, perhaps even zero, clustering was plausible for such outcomes. A caveat may be appropriate where stratification by centre was undertaken (and analysed accordingly) despite surgeons only delivering one or the other of the interventions. The impact of such an approach is unclear and empirical evaluation of statistical analysis options of an expertise-based trial is needed to evaluate the impact upon required sample size.
Where the clustering is anticipated to be small e.g. a longer term quality of life outcome, the potential recruitments benefits (particular of surgeons) of an expertise-based design could be seen as a reasonable offset to the loss of precision and need to recruit slightly more participants. It has been suggest that the effect size under an expertise-based design might be larger though there is little evidence to support such a premise at present. A hybrid [9] (or more specifically, surgeon preference) design, where surgeons are allowed to perform either one or both surgical interventions in a comparison of two surgical interventions, might therefore be the optimal design where the two surgical interventions )/∑n i ) where n i is the number of observations in the i th cluster. χ Design effect was calculated using 1+(average cluster size-1)*ρ and 1-ρ for expertise-based trial and stratified design respectively. substantially differ and the focus is on longer term quality of life outcomes.
Typically the statistical analysis of RCTs which allow for clustering across multicentre or therapists enable the underlying intercept level to vary between cluster but maintains a common treatment effect. Methods which allow for the treatment effect to vary between cluster in place of or in addition to underlying level has been proposed [25]. The relative impact of such options is unclear and further evaluation, specifically regarding the impact on sample size, is needed.
Differential clustering or clustering for only one intervention may be plausible. The method of analysis we undertook implicitly assumed a common ICC across the surgical interventions. For some of the settings represented by the surgical trials in our database (e.g. total knee arthroplasty with/without metal backed tibial component) a common ICC is very plausible where as for others (open versus laparoscopic hernia repair) this is perhaps less so. Due to the relatively small number of cases (as reflected in some CIs not being calculable) we choose to only calculate the common ICC across the interventions. This might be viewed as the most appropriate approach in the presence of any treatment effect.
There is a need for ICC estimates and providing data on cluster sizes to be routinely published [4,15]. This database of surgical ICC provides information to guide trialists in the design of trials evaluating surgical interventions as has been done for other areas [15]. Further research is needed into ICC estimates, both in their determinants and the optimal method of calculation (including consideration of meta-analysis). We anticipate that over time the addition of ICCs from further surgical trial datasets to our database will further inform the design of surgical trials; trialists are invited to submit surgical trial ICCs for inclusion in the database.

Conclusions
Sizeable clustering effects in multicentre trials of surgical interventions at both centre and surgeon levels were plausible for some outcomes. A stratified design (by either centre or surgeon) with corresponding analysis provides optimal benefit with regard to sample size and protects against a potentially large loss of precision for surgical outcomes. Further research is needed into surgical ICCs, into both their determinants and the optimal method of calculation.