 Research
 Open Access
 Published:
A comparative study of R functions for clustered data analysis
Trials volume 22, Article number: 959 (2021)
Abstract
Background
Clustered or correlated outcome data is common in medical research studies, such as the analysis of national or international disease registries, or clusterrandomized trials, where groups of trial participants, instead of each trial participant, are randomized to interventions. Withingroup correlation in studies with clustered data requires the use of specific statistical methods, such as generalized estimating equations and mixedeffects models, to account for this correlation and support unbiased statistical inference.
Methods
We compare different approaches to estimating generalized estimating equations and mixed effects models for a continuous outcome in R through a simulation study and a data example. The methods are implemented through four popular functions of the statistical software R, “geese”, “gls”, “lme”, and “lmer”. In the simulation study, we compare the mean squared error of estimating all the model parameters and compare the coverage proportion of the 95% confidence intervals. In the data analysis, we compare estimation of the intervention effect and the intraclass correlation.
Results
In the simulation study, the function “lme” takes the least computation time. There is no difference in the mean squared error of the four functions. The “lmer” function provides better coverage of the fixed effects when the number of clusters is small as 10. The function “gls” produces close to nominal scale confidence intervals of the intraclass correlation. In the data analysis and the “gls” function yields a positive estimate of the intraclass correlation while the “geese” function gives a negative estimate. Neither of the confidence intervals contains the value zero.
Conclusions
The “gls” function efficiently produces an estimate of the intraclass correlation with a confidence interval. When the withingroup correlation is as high as 0.5, the confidence interval is not always obtainable.
Introduction
Clustered data
Clustered data arise when the study population can be classified into different groups (referred to as clusters), and the measurements of subjects, in particular the response, within the same cluster are more alike than those in other clusters. For instance, in clusterrandomized trials, entire groups of participants such as classrooms, clinics, communities, or hospitals, rather than individuals, are randomly assigned to intervention arms [1, 2].
The key feature of clustered data is that the similarity (or homogeneity) of measurements within the same cluster induces a correlation. That is, measurements within a cluster are likely to be correlated, whereas those from separate clusters are regarded as independent. The intraclass correlation coefficient (ICC) measures this similarity of the responses within a cluster and can be defined as a function of the variance components in the model: variation between clusters and within clusters [3, 4]. Since the responses within a cluster do not contribute completely independent information, the “effective” sample size is less than the total number of subjects from all clusters [5–7].
Statistical methods
Classical statistical methods such as ordinary least squares regression assume that each individuals’ data is independent. The clustered data have a hierarchical structure where individuals are not likely independent within the same cluster (i.e., ICC > 0). Thus, methods taking the correlation into account, such as generalized estimating equation (GEE) and mixedeffects models, are well suited for the analysis of clustered data [8–10].
GEE models can be viewed as an extension of generalized linear models for correlated data where a withincluster correlation structure is specified [11]. Parameter estimates are then obtained as solutions of the estimating equations [12]. In mixedeffects models, the cluster effect is a random variable representing a random deviation for a given cluster from the overall fixed effects [13–15]. Maximum likelihood estimation is often used to obtain estimates of parameters via iterative algorithms such as the expectationmaximization (EM) algorithm and the NewtonRaphson algorithm [16–21]. As shown in [22], for normal outcomes, GEE reduces to the score equation of the maximum likelihood estimation only when there are no missing observations and the correlation is unstructured. A further comparison shows that GEE and mixedeffects models produce the same generalized least squares estimator of the fixed effects [23]. We will review the derivation of the fixed effects estimator in the next section.
Simulation studies have been conducted to compare the two methods for analyzing continuous outcomes with an emphasis on the fixed effects components. In the comparison of the estimation and the coverage probability of the confidence intervals, Park [22] found that the GEE estimation was more sensitive to missing observations. In the study [24], the authors compared the estimation and the nominal level of hypothesis testing and made several recommendations. For instance, if knowledge is available to specify the covariance structure correctly, the maximum likelihood estimation is slightly more efficient for balanced or near balanced data. When there is concern about the misspecification of the covariance structure, GEE is preferred when the number of clusters is larger than 20. For hypothesis testing, Kahan et al. [25] and Leyrat et al. [26] found that without an appropriate correction, both methods can lead to inflated type I error rates (finding a statistically significant treatment effect when it does not exist) when the number of clusters is smaller than 40. R, SAS, and Stata commands to correct the type I error rate are provided in [26].
R functions
In this study, we compared the performance of the GEE method and the linear mixedeffects model to analyze clustered data through the implementation of both popular and newer packages of the statistical software R [27]. Specifically, the “geese” function of the geepack package (1.3.2) fits a GEE model [28, 29]. The “gls” function of the nlme package (3.1.149) [30, 31] fits a linear model using generalized least squares where the errors are allowed to be correlated. Two frequently used functions for conducting linear mixedeffects model analysis are “lme” of the nlme package [30, 31] and “lmer” of the lme4 package [32] (1.1.25). Detailed implementation of these functions is provided in the “Implementation” subsection of the next section and also summarized in Table 1.
Methods
We compared the performance of the four functions via a simulation study and through a real data example. In the simulation study, we compared the computation time; the mean squared error (MSE) of estimating all the model parameters, including the ICC; and the coverage proportion of the 95% confidence intervals. Parameter estimates in the linear mixedeffects models are found by maximizing the likelihood of the data. In the following, we review the model setup followed by the simulation study and the example dataset.
Model review
Let n_{i} be the number of subjects in the ith cluster and let 1_{i} be a n_{i}×1 vector of one’s. I_{i} is a square identity matrix of dimension n_{i} and let J_{i} be a square matrix of one’s of dimension n_{i}. Let u_{i} be the random effect associated with the ith cluster. The linear mixedeffects modeling of the ith cluster’s response y_{i} is
The matrix X_{i} is the design matrix and β is the unknown fixed effects. The random vector ε_{i} represents the error and is independent of u_{i}. Independence is also assumed between u_{i} and u_{j}, and between ε_{i} and ε_{j} whenever i≠j. It follows that y_{i}∼N(X_{i}β,Σ_{i}), where \({\boldsymbol {\Sigma }}_{i}=\sigma ^{2}_{u}\mathbf {J}_{i}+\sigma _{\epsilon }^{2}\mathbf {I}_{i}\). Let elements of y_{i} be y_{ij}, j=1,…,n_{i}. We obtain \(\text {Var}(y_{ij})=\sigma ^{2}_{u}+\sigma _{\epsilon }^{2}\) and \(\text {Cov}(y_{ij},y_{ik})=\sigma ^{2}_{u}\) where j≠k. The intraclass correlation coefficient ρ is naturally defined by the variance components \(\sigma ^{2}_{u}\) and \(\sigma _{\epsilon }^{2}\) as
By definition, 0<ρ<1 since both \(\sigma ^{2}_{u}\) and \(\sigma _{\epsilon }^{2}\) are positive.
The above modeling uses the random effect u_{i} explicitly to explain within cluster correlation. From a marginal model perspective, one can instead start with the modeling y_{i}∼N(X_{i}β,Σ_{i}) directly with a special structure for Σ_{i}. Let ρ=Corr(y_{ij},y_{ik}) and σ^{2}=Var(y_{ij}), and sequently we get Σ_{i}=σ^{2}[I_{i}+ρ(J_{i}−I_{i})]. Using this marginal parametrization {σ^{2},ρ}, the matrix Σ_{i} is positive definite if − 1/(n_{i}−1)<ρ<1 [30, 33–35]. That is, ρ does not have to be positive as it was defined from a variance components perspective. At the same time, we also observe that when n_{i} is large, the boundary − 1/(n_{i}−1) can be very close to 0. Starting with this parameterization, we derive the corresponding relationship of (1) as
It is clear that \({\boldsymbol {\Sigma }}_{i}^{1}=\{\mathbf {I}_{i}\rho \mathbf {J}_{i}/[1+(n_{i}1)\rho ]\}/\sigma ^{2}(1\rho)\). Let \(\mathbf {T}_{i}(n_{i},\rho) \equiv \mathbf {T}_{i}(\rho)=\mathbf {I}_{i}\rho \mathbf {J}_{i}/[1+(n_{i}1)\rho ]=\sigma ^{2}(1\rho){\boldsymbol {\Sigma }}_{i}^{1}\). The estimate of β follows as \(\hat {{\boldsymbol {\beta }}}=\left \{ \sum _{i=1}^{n_c} \mathbf {X}_{i}^{\prime } {\boldsymbol {\Sigma }}_{i}^{1}\mathbf {X}_{i}\right \}^{1}\left \{\sum _{i=1}^{n_{c}}\mathbf {X}_{i}^{\prime } {\boldsymbol {\Sigma }}_{i}^{1}\mathbf {y}_{i}\right \}\) which reduces to
The variancecovariance matrix of the estimate has the form
We notice that if there is no within cluster correlation, i.e. ρ=0, then T_{i}(0)=I_{i} and \(\hat {{\boldsymbol {\beta }}}\) is simply the ordinary least squares model. In the extreme case of perfect correlation, i.e., ρ=1, then T_{i}(1)=I_{i}−J_{i}/n_{i} and we get
In this scenario, y_{ij}=y_{ik} and \(\mathbf {y}_{i}=\bar {y}_{i}\), so \(\hat {{\boldsymbol {\beta }}}=\mathbf {0}\).
Simulation
Our simulation setups are similar to those in the literature. In the simulation, Feng et al. [24] tried the number of clusters n_{c}=10, 20, and 50 with cluster sizes 10, 30, and 100, and ρ=0.1, 0.5. The simulation study [25] considered two scenarios of 5 patients per cluster with ρ=0.15, and of 100 patients per cluster with ρ=0.01. The number of clusters n_{c} varied from 6, 10, 20, …, 90, and 100. The simulation scenarios in [26] include ρ=0.001, 0.01, and 0.05, and n_{c}=4, 6, 8, 10, 20, 30, 40, and 200. The average cluster size ranges from 7 to 300. A review of published clusterrandomized trials by Kahan et al. [25] shows that the median number of clusters was 25 with interquartile range 15 to 44, 14% of the trials had fewer than 10 clusters, and 9% of the trials had more than 100 clusters. The cluster size had a median of 31 and an interquartile range 14 to 94.
In our study, we tried n_{c}=10, 30, 50, and 100. In each cluster, the number of subjects was simulated from a normal distribution after rounding with mean m=50 and 100, and standard deviation 5. The sample size calculation in [36] preassumed an ICC of 0.05. Some studies also found large ICC values such as 0.47 with 95% confidence interval [0.29,0.65] [6, Table I] and 0.60 [37, Table 4.4.2]. In our setup, we considered ρ=0.05, 0.1, and 0.5. We simulated two covariates independently from a Bernoulli distribution with a probability of success of 0.5, and from a standard Normal distribution. The associated regression coefficients are respectively β_{1}=−2, and β_{2}=1.5, and the intercept in the regression model is β_{0}=1. We simulated the outcome from a marginal model with the variance parameter σ^{2}=0.6. In each of the settings, we examined 2000 simulations.
Example dataset
We reanalyzed data from the clusterrandomized controlled trial in [36]. In this study, participants with hypertension from 15 clusters in rural India were recruited and randomized to the intervention or usual care in a 1:2 ratio. The study hypothesis was that a CHW (community health worker)led groupbased education and monitoring intervention would result in improved blood pressure control. Outcomes were assessed approximately two months after completion of the intervention.
One of the main outcomes in this trial was the change in diastolic blood pressure (DBP), defined as follows: the DBP at baseline minus the DBP at followup. Fixed effects in the analysis include the following variables: age, sex, diastolic blood pressure at baseline (mm Hg), education, use of antihypertensive medications, change in BMI (body mass index) defined as BMI at followup minus the BMI at baseline, number of serves of fruit per week at followup, and selfreported drinking alcohol at least once in the 30 days prior to followup. The education variable has four categories: no formal schooling, class 1 to 6, class 7 to 11, and class 12 or more.
We analyzed the data from the 1428 participants with no missing values. The histogram of the outcome shows a bell shaped pattern (Fig. 1). The intervention group has a larger proportion of a positive difference than the usual care group suggesting more DBP decline at the followup. The normal quantilequantile plot in Fig. 2 shows the normal distribution assumption of the outcome is plausible which provides a justification of the application of linear mixedeffects models.
Implementation
All the four R functions compute \(\hat {{\boldsymbol {\beta }}}\) and the corresponding confidence interval (CI), but they adopt different parameterizations for the variancecovariance matrix. The function “geese” uses {σ^{2},ρ} though a specification of an “exchangeable” correlation structure. The function “gls” uses a compound symmetry structure of the parameters {σ,ρ}. Both “lme” and “lmer” find estimates and CI’s of {σ_{u},σ_{ε}}. It is straightforward to obtain estimates of σ^{2}, or \(\sigma ^{2}_{u}\) and \(\sigma _{\epsilon }^{2}\) from their corresponding square root estimates. We can then find estimates of another parameterization using Eqs. (1) or (2). As the “geese” method does not fit in the framework of hierarchical modeling with random effects, it is not appropriate to find its estimates of \(\{\sigma ^{2}_{u},\sigma _{\epsilon }^{2}\}\) using Eq. (2) due to a possible negative \(\hat {\rho }\). Thus, we do not include it for the comparison of estimating {σ_{u},σ_{ε}}. Methods obtaining the CI’s of ρ have been discussed in [38, 39]. In the modelbased setup, the CI of ρ is readily available from the output of the “geese” or “gls” fitted object. Below we explain how to get the CI’s of the model parameters with the summary presented in Table 1.
From the “geese” output, we apply the estimate ± 1.96 standard deviation rule to obtain the CI’s. We apply two generic functions “confint” and “intervals” to “gls”, “lme”, or “lmer” fitted objects. The “confint” function assumes normality and has two options. The option method=“Wald” returns approximate CI’s of the fixed effects based on the estimated local curvature of the likelihood surface. The other option method=“profile” computes a likelihood profile and find the appropriate cutoffs based on the likelihood ratio test [32]. The “intervals” function calculates approximate confidence intervals for the parameters in the linear model using a normal approximation to the distribution of the maximum likelihood estimators. The estimators are assumed to have a normal distribution centered at the true parameter values and with covariance matrix equal to the negative inverse Hessian matrix of the loglikelihood evaluated at the estimated parameters [30, 31].
Results
Simulation
With detailed comparison results presented in the supplementary material, we summarize our findings in the following text. The “lme” approach took the least computation time, 1.4 h, followed by the “gls” approach, 2.41 h (Table S1). Sometimes the “gls” approach failed to construct the confidence intervals for the variancecovariance parameters when ρ=0.5 (Table S2). The number of failure increases with the number of clusters n_{c} and also the cluster size m. When n_{c}=100 and m=100, there were 528 failures out of 2000 simulations. Occasionally, the “lme” and “lmer” approaches also failed to construct the confidence intervals. The performance of the four approaches of estimating the model parameters is very similar with almost identical standard deviation and MSE (Tables S3S6).
Next we summarize the performance of the different functions on the coverage proportion of the model parameters. First, in general, the coverage proportions of the fixed effects are very similar among “gls”, “lme”, and “lmer” approaches (Tables S7S9). They are very close to the nominal 95% level except for β_{0} when n_{c}=10 (Table S7). In that case, confidence intervals obtained by specifying the “profile” option of the “confint” function to a “lmer” fitted object outperforms the others. Their coverage proportions are generally closer to the nominal 95% level than the “geese” approach when n_{c} is less than 100. Second, though the coverage proportions of {σ^{2},ρ} of “geese” or {σ,ρ} of “gls” are usually below the 95% nominal level, the “gls” method generally provided better coverage (Table S10). Third, both the “lme” and “lmer” produced coverage proportions about 95% for σ_{ε}, and for σ_{u} when n_{c}≠10. When n_{c}=10, we observed unstable performance of over coverage or under coverage (Table S11).
Example dataset
All the four methods produced similar results that suggested more DBP reduction in the intervention group, 2.161 mm Hg by “geese” and 2.252 by the other three methods (Table 2). The 95% confidence interval bounds of the intervention effect are slightly different. However, our conclusion is consistent with the finding in [36] where the analysis was conducted in Stata (Stata IC/11.2, StataCorp, College Station, TX, USA). DBP declined 2.1 mm Hg more in the intervention group with a 95% confidence interval of (0.6,3.6), and the estimation of ICC was 0.02. The “gls” method gives a positive \(\hat {\rho }\) and a confidence interval does not contain 0. The “geese” method produces a negative \(\hat {\rho }\) with negative confidence interval bounds.
Discussion and conclusion
Throughout our study, we compare the performance of the four R functions, “geese”, “gls”, “lme”, and “lmer”, of analyzing single level clustered data. The “exchangeable” correlation structure of the “geese” function and the compound symmetry structure of the function “gls” both provide a singlelevel cluster model. We note that the “lme” and the “lmer” function can model multilevel data, and the “lmer” function is capable of modeling crossed random effects. The lme4 package also includes generalized linear mixed model capability via the “glmer” function. It does not currently implement nlme’s features for modeling heteroscedasticity of residuals or offer the same flexibility for composing complex variancecovariance structures.
Our simulation study found that all four methods perform equally well for model parameters estimation. This result is consistent with the study in [24]. It was found that the MSEs are very similar except when the number of clusters is 10 where the linear mixedeffects model method has slightly smaller MSE than the GEE method using SAS PROC MIXED. We observe similar coverage proportions of the fixed effects among “gls”, “lme,” and “lmer” approaches. They are generally closer to the nominal 95% level than the “geese” approach when the number of clusters is less than 100.
The estimated ICC from the “geese” method can be negative, and the confidence interval of the ICC from the “gls” method provides better coverage. However, when the ICC is large as ρ=0.5, confidence intervals is not always obtainable from the “gls” method. In our comparison of the coverage of the variancecovariance parameters in the model, the “lme” and the “lmer” methods have similar performance while the former is considerably faster. The latter provides better coverage of the intercept in the model when the number of clusters is 10.
In the simulation settings, we examined that the “gls” function is preferable to analyze singlelevel clustered data. The limitations of our simulation study include the lack of the scenario of the very large number of clusters (e.g., 200 as in [26]) or the scenario of small ICC values (e.g., 0.01 as in [25, 26]). It may also be of interest to further compare the performance of the four functions for complex trials such as steppedwedge clusterrandomized trials. In a steppedwedge clusterrandomized trial, all clusters begin in the control phase and then are randomized to interventions at different time points [40–43]. Simulations have been conducted to investigate the effect of varying degrees of imbalance in cluster size on the power [44].
Availability of data and materials
The data analyzed in our example were originally studied in [36]. The data file and the data dictionary are publicly available from the figshare database (accession number https://doi.org/10.26180/5cc80de987113). The data were reanalyzed without collaboration with the original study authors.
Abbreviations
 BMI:

Body mass index
 CHW:

Community health worker
 CI:

Confidence interval
 DBP:

Diastolic blood pressure
 EM:

Expectationmaximization
 GEE:

Generalized estimating equation
 ICC:

Intraclass correlation coefficient
 MSE:

Mean squared error
References
 1
Murray DM, Varnell SP, Blitstein JL. Design and analysis of grouprandomized trials: a review of recent methodological developments. Am J Public Health. 2004; 94(3):423–32.
 2
Campbell M, Donner A, Klar N. Developments in cluster randomized trials and statistics in medicine. Stat Med. 2007; 26(1):2–19.
 3
Fisher RA. Statistical Methods for Research Workers, 5th edn. Edinburgh: Oliver and Boyd Ltd.; 1934.
 4
Eldridge SM, Ukoumunne OC, Carlin JB. The intracluster correlation coefficient in cluster randomized trials: a review of definitions. Int Stat Rev. 2009; 77(3):378–94.
 5
Baio G, Copas A, Ambler G, Hargreaves J, Beard E, Omar RZ. Sample size calculation for a stepped wedge trial. Trials. 2015; 16(354). https://pubmed.ncbi.nlm.nih.gov/26282553/.
 6
Campbell MK, Mollison J, Grimshaw JM. Cluster trials in implementation research: estimation of intracluster correlation coefficients and sample size. Stat Med. 2001; 20(3):391–9.
 7
Rutterford C, Copas A, Eldridge S. Methods for sample size determination in cluster randomized trials. Int J Epidemiol. 2015; 44(3):1051–67.
 8
Lee KJ, Thompson SG. The use of random effects models to allow for clustering in individually randomized trials. Clin Trials. 2005; 2(2):163–73.
 9
Barker D, McElduff P, D’Este C, Campbell M. Stepped wedge cluster randomised trials: a review of the statistical methodology used and available. BMC Med Res Methodol. 2016; 16(69). https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4895892/.
 10
Turner EL, Prague M, Gallis JA, Li F, Murray DM. Review of recent methodological developments in grouprandomized trials: part 2analysis. Am J Public Health. 2017; 107(7):1078–86.
 11
Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986; 73(1):13–22.
 12
Zeger SL, Liang KY. Longitudinal data analysis for discrete and continuous outcomes. Biometrics. 1986; 1:121–30.
 13
Laird NM, Ware JH. Randomeffects models for longitudinal data. Biometrics. 1982; 38:963–74.
 14
Ware JH. Linear models for the analysis of longitudinal studies. Am Stat. 1985; 39(2):95–101.
 15
Laird N, Lange N, Stram D. Maximum likelihood computations with repeated measures: application of the em algorithm. J Am Stat Assoc. 1987; 82(397):97–105.
 16
Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B. 1977; 39:1–38.
 17
Wu JCF. On the convergence properties of the EM algorithm. Ann Stat. 1983; 11(1):95–103.
 18
Jennrich RI, Schluchter MD. Unbalanced repeatedmeasures models with structured covariance matrices. Biometrics. 1986; 42:805–20.
 19
Lindstrom M, Bates D. NewtonRaphson and EM algorithms for linear mixed effects models for repeated measures data. J Am Stat Assoc. 1988; 83:1014–22.
 20
Meng XL, Rubin DB. Maximum likelihood estimation via the ECM algorithm: A general framework. Biometrika. 1993; 80:267–78.
 21
Liu CH, Rubin DB. The ECME algorithm  a simple extension of EM and ECM with faster monotone convergence. Biometrika. 1994; 81(4):633–48.
 22
Park T. A comparison of the generalized estimating equation approach with the maximum likelihood approach for repeated measurements. Stat Med. 1993; 12(18):1723–32.
 23
Wu CT, Gumpertz ML, Boos DD. Comparison of GEE, MINQUE, ML, and REML estimating equations for normally distributed data. Am Stat. 2001; 55(2):125–30.
 24
Feng Z, McLerran D, Grizzle J. A comparison of statistical methods for clustered data analysis with gaussian error. Stat Med. 1996; 15(16):1793–806.
 25
Kahan BC, Forbes G, Ali Y, Jairath V, Bremner S, Harhay MO, Hooper R, Wright N, Eldridge SM, Leyrat C. Increased risk of type I errors in cluster randomised trials with small or medium numbers of clusters: a review, reanalysis, and simulation study. Trials. 2016; 17. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5013635/.
 26
Leyrat C, Morgan KE, Leurent B, Kahan BC. Cluster randomized trials with a small number of clusters: which analyses should be used?. Int J Epidemiol. 2018; 47(1):321–31.
 27
R Core Team. R: A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing; 2020. R Foundation for Statistical Computing. https://www.Rproject.org/.
 28
Yan J, Fine JP. Estimating equations for association structures. Stat Med. 2004; 23:859–80.
 29
Halekoh U, Højsgaard S, Yan J. The R package geepack for generalized estimating equations. J Stat Softw. 2006; 15(2):1–11.
 30
Pinheiro J, Bates D. MixedEffects Models in S and SPLUS. New York: Springer; 2009.
 31
Pinheiro J, Bates D, DebRoy S, Sarkar D, R Core Team. nlme: Linear and Nonlinear Mixed Effects Models. 2020. R package version 3.1149. https://cran.rproject.org/web/packages/nlme/ChangeLog. Accessed Aug 2020.
 32
Bates D, Mächler M, Bolker B, Walker S. Fitting linear mixedeffects models using lme4. J Stat Softw. 2015; 67(1):1–48. doi:10.18637/jss.v067.i01.
 33
Wang W. Identifiability of linear mixed effects models. Electron J Stat. 2013; 7:244–63.
 34
Wang W. Identifiability of covariance parameters in linear mixed effects models. Linear Algebra Appl. 2016; 506:603–13.
 35
Wang W. Checking identifiability of covariance parameters in linear mixed effects models. J Appl Stat. 2017; 44(11):1938–46.
 36
Gamage DG, Riddell MA, Joshi R, Thankappan KR, Chow CK, Oldenburg B, Evans RG, Mahal AS, Kalyanram K, Kartik K, Suresh O, Thomas N, Mini GK, Maulik PK, Srikanth VK, Arabshahi S, Varma RP, Guggilla RK, D’Esposito F, Sathish T, Alim M, Thrift AG. Effectiveness of a scalable groupbased education and monitoring program, delivered by health workers, to improve control of hypertension in rural india: A cluster randomised controlled trial. PLoS Med. 2020; 17:1–22.
 37
Pal N, Lim WK. On intraclass correlation coefficient estimation. Stat Pap. 2004; 45:369–92.
 38
Ukoumunne OC. A comparison of confidence interval methods for the intraclass correlation coefficient in cluster randomized trials. Stat Med. 2002; 21(24):3757–74.
 39
Demetrashvili N, Wit EC, van den Heuvel ER. Confidence intervals for intraclass correlation coefficients in variance components models. Stat Methods Med Res. 2016; 25(5):2359–76.
 40
Hussey MA, Hughes JP. Design and analysis of stepped wedge cluster randomized trials. Contemp Clin Trials. 2007; 28(2):182–91.
 41
Beard E, Lewis JJ, Copas A, Davey C, Osrin D, Baio G, Thompson JA, Fielding KL, Omar RZ, Ononge S, et al.Stepped wedge randomised controlled trials: systematic review of studies published between 2010 and 2014. Trials. 2015; 16(353). https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4538902/.
 42
Kristunas C, Morris T, Gray L. Unequal cluster sizes in steppedwedge cluster randomised trials: a systematic review. BMJ Open. 2017; 7(11). http://dx.doi.org/10.1136/bmjopen2017017151. https://bmjopen.bmj.com/content/7/11/e017151.full.pdf.
 43
Li F, Hughes JP, Hemming K, Taljaard M, Melnick ER, Heagerty PJ. Mixedeffects models for the design and analysis of stepped wedge cluster randomized trials: An overview. Stat Methods Med Res. 2020; 30(2):612–39. https://pubmed.ncbi.nlm.nih.gov/32631142/.
 44
Kristunas CA, Smith KL, Gray LJ. An imbalance in cluster sizes does not lead to notable loss of power in crosssectional, steppedwedge cluster randomised trials with a continuous outcome. Trials. 2017; 18(109). https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5341460/.
Acknowledgements
The authors would like to acknowledge the availability of the data in the study of [36], which were made publicly accessible by the PLOS Medicine journal. PLOS journals require authors to make all data necessary to replicate their study’s findings publicly available without restriction at the time of publication (https://journals.plos.org/plosmedicine/s/dataavailability).
Funding
Michael O. Harhay was supported by grant R00 HL141678 from the National Heart, Lung, and Blood Institute (NHLBI) of the US National Institutes of Health (NIH). The funding body had no role in the study design, data collection and analysis, or decision to publish.
Author information
Affiliations
Contributions
Both authors made substantial contributions to the manuscript. WW conducted the data analysis, simulation studies, and drafted the manuscript. MOH reviewed and revised the article. Both authors approved the manuscript to be submitted.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1
The supplementary material in a pdf format contains the tables of the simulation study results.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Wang, W., Harhay, M.O. A comparative study of R functions for clustered data analysis. Trials 22, 959 (2021). https://doi.org/10.1186/s13063021059007
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13063021059007
Keywords
 Clustered data analysis
 Cluster randomized trials
 Generalized estimating equations
 Mixed effects models