Introduction to Clustered Data

Exploratory Data Analysis For Epidemiology

Learning objectives for this lesson:

Recognize and describe different types of clustered (hierarchical) data structures in epidemiology
Explain why observations within clusters are correlated and how this affects standard statistical analyses
Calculate and interpret the intraclass correlation coefficient (ICC) and design effect (deff)
Understand the impact of clustering on standard errors and inference for both continuous and discrete outcomes
Describe key methods for dealing with clustering, including fixed effects, robust variance estimators, and survey methods
Evaluate the consequences of ignoring clustering in epidemiologic analyses

This course was developed by Dr. Kiffer G. Card, Faculty of Health Sciences, Simon Fraser University based on Dohoo, I. R., Martin, S. W., & Stryhn, H. (2012). Methods in Epidemiologic Research. VER Inc.

Reference

Glossary: Key Terms, People & Concepts

📚 Reference page, available throughout the lesson

This glossary collects the key concepts, people, and ideas you will meet in this lesson. Use it as a reference while you work through the material, or as a review before assessments. Type in the search box to filter entries.

Key Concepts & Ideas

Clustered dataData in which observations are grouped within higher-level units (e.g., students within schools, patients within clinics, repeated measures within subjects). Observations within a cluster tend to be more similar than observations between clusters, violating the standard independence assumption of ordinary regression.

Hierarchical (multilevel) dataClustered data with a clear nesting structure across two or more levels (e.g., level-1 = pupils nested within level-2 = classrooms nested within level-3 = schools).

Nested designLower-level units belong to one and only one higher-level unit (e.g., a particular pupil belongs to exactly one classroom). Most common in epidemiology.

Crossed designLower-level units are observed across multiple higher-level groupings simultaneously (e.g., the same students rated by multiple teachers; the same animals seen by multiple veterinarians).

Cluster (level-2 unit)The grouping unit within which level-1 observations are correlated. Examples: clinic, household, herd, school, neighborhood.

ExchangeabilityAn assumption that any two observations within the same cluster have the same correlation regardless of position. Underlies the simplest random-intercept and compound symmetry covariance models.

Fixed effectsParameters representing the population-average effect of a covariate. Fixed effects are estimated as single values; their interpretation does not depend on a distribution of cluster-level deviations.

Random effectsCluster-specific deviations modeled as draws from a (typically normal) distribution. They allow the model to borrow strength across clusters and to quantify between-cluster variability.

Ecological fallacyDrawing conclusions about individuals from group-level associations. A reminder that level-1 and level-2 effects can differ in direction or magnitude.

Contextual effectAn effect attributable to the cluster (e.g., school climate, neighborhood deprivation) that is not captured by individual-level covariates.

Non-informative cluster sizeAn assumption that the size of a cluster is unrelated to the outcome distribution within it. When violated, standard mixed-model and GEE inferences can be biased.

Methods & Statistical Concepts

Intraclass correlation coefficient (ICC)The proportion of total outcome variance attributable to the cluster: ICC = σ²_between / (σ²_between + σ²_within). Measures how similar observations within the same cluster are. ICC = 0 means no clustering; ICC = 1 means observations within a cluster are identical.

Design effect (DEFF)DEFF = 1 + (m − 1) · ICC, where m is the average cluster size. The factor by which standard errors must be inflated (or sample size increased) to compensate for clustering relative to a simple random sample.

Effective sample sizeThe total sample size divided by the design effect. Reflects the equivalent number of independent observations after accounting for within-cluster correlation.

Naive (independence) analysisAn analysis that ignores clustering and treats observations as independent. Typically produces standard errors that are too small for between-cluster effects (anti-conservative) and too large for within-cluster effects.

Random-intercept modelA mixed model that allows each cluster's mean (intercept) to deviate from the overall mean by a random amount drawn from a normal distribution. The simplest way to introduce clustering into a regression.

Robust (sandwich) variance estimatorA method for obtaining standard errors that are valid even if the assumed correlation structure is wrong. Often used with GEE or with cluster-robust adjustments to OLS or GLM fits.

Generalized estimating equations (GEE)A marginal (population-average) approach to clustered data that specifies a working correlation structure and uses sandwich variance estimators for inference. Robust to misspecification of the correlation but does not separate between- and within-cluster effects.

Survey methods (complex sampling)Estimation procedures that account for stratification, clustering, and weights from a complex sample design. Often used as an alternative to mixed models for descriptive estimates.

Simulation studyA computational experiment in which artificial data are generated under known conditions to evaluate how a method performs (e.g., type-I error, coverage of confidence intervals) when clustering is ignored or modeled.

Variance componentsThe separate variances at each level of a hierarchical model (e.g., between-cluster σ²_u and within-cluster σ²_e). Their sum gives the total outcome variance.

Cluster bootstrapA bootstrap procedure that resamples whole clusters (rather than individual observations) to obtain valid standard errors and confidence intervals under clustering.

Key People

Nan Laird & James WareCo-authors of the foundational 1982 paper formalizing the linear mixed-effects model for longitudinal and clustered data. Their framework underpins modern multilevel modeling in epidemiology (Diez Roux, 2000).

Kung-Yee Liang & Scott ZegerJohns Hopkins biostatisticians who introduced generalized estimating equations (GEE) in 1986, providing a flexible marginal approach to correlated data with robust variance estimation (see also Zeger & Liang, 1986).

Harvey Goldstein (1939–2020)British statistician who pioneered multilevel modeling in education and social science research. His textbook and the MLwiN software broadly disseminated the methods.

Stephen Raudenbush & Anthony BrykAmerican social scientists whose textbook on hierarchical linear models (HLM) became a standard reference for multilevel analysis in education and the social sciences.

No matching entries. Try a different search term.

Section 2

Effects of Clustering on Statistical Analysis

⏱ Estimated time: 20 minutes

Section 2 of 4

Effects of Clustering on Statistical Analysis

The intraclass correlation, design effect, and effective sample size: measuring the cost of within-group similarity.

The core consequence

Why naive analyses are anti-conservative

Positively correlated observations within clusters mean the analysis overestimates effective sample size. The result is systematic undercoverage of confidence intervals and inflated Type I error.

Worked example

1,000 patients in 20 hospitals (50 per hospital); ICC = 0.05
Design effect = 1 + (50 − 1)(0.05) = 3.45
Effective sample size = 1,000 / 3.45 ≈ 290

A naive analysis treats this study as if it had 1,000 independent observations. It actually carries the information of roughly 290.

Intraclass correlation

The ICC: proportion of variance between clusters

Intraclass correlation coefficient (Eq 20.2)

\[ \text{ICC} = \color{#0B7B6B}{\rho} = \frac{\color{#C2410C}{\sigma^2_g}}{\color{#C2410C}{\sigma^2_g} + \color{#6D28D9}{\sigma^2}} \]

ρ (ICC) intraclass correlation σ²_g between-cluster variance σ² within-cluster variance

ICC = 0

No clustering effect. All variance is within clusters. Observations are effectively independent.

ICC = 1

All variance is between clusters. All observations within a cluster are identical. One cluster is one data point.

Design effect

Translating ICC into precision loss

Design effect (Eq 20.3)

\[ \color{#0B7B6B}{\text{deff}} = 1 + (\color{#C2410C}{\bar{m}} - 1)\,\color{#6D28D9}{\rho} \]

deff design effect m̄ average cluster size ρ intraclass correlation

Corrected variance (Eq 20.4)

\[ \color{#0B7B6B}{\widehat{\text{Var}}_{\text{corrected}}} = \color{#C2410C}{\text{deff}} \times \color{#6D28D9}{\widehat{\text{Var}}_{\text{naive}}} \]

Var̂_corrected correct variance deff design effect Var̂_naive variance assuming independence

When ICC = 0.05 and average cluster size = 20, the design effect is 1 + (19)(0.05) = 1.95. Every variance estimate from a naive analysis must be multiplied by 1.95 to recover valid inference.

R activity preview

ICC, deff, and cluster-robust standard errors

The R activity uses phaa_clinics.csv: 30 clinics, ~960 patients, outcome = systolic blood pressure (sbp).

Step 1

Estimate ICC from a one-way ANOVA on clinic membership.

Step 2

Confirm with a null random-intercept model via lme4::lmer(). The two values should agree closely.

Step 3

Compute deff, effective sample size, and compare naive versus cluster-robust standard errors using sandwich.

Note: the narration does not read R code aloud. Use the written activity to follow along in R.

Carry forward

Three quantities to carry into the next section

ICC: the proportion of total variance between clusters. Typical range in epidemiology: 0.01 to 0.20.
Design effect: deff = 1 + (¯m − 1)ρ. Grows with both ICC and average cluster size.
Effective sample size: nominal n divided by deff. The real information your dataset carries once clustering is accounted for.

Introduction and Overview

From recognising clustering to quantifying its damage. An earlier section catalogued where clustered structures come from. This section answers the natural follow-up question: so what? When observations within a cluster are even mildly correlated, the precision implied by a naive analysis, one that pretends every observation is independent, is far too optimistic. Here we develop the formal vocabulary (the intraclass correlation, the design effect, the effective sample size) that makes the size of that distortion measurable. These quantities reappear in every method we cover in a later section and in later lessons, so it is worth slowing down on the intuition.

Learning Objectives

Explain why ignoring clustering produces standard errors that are too small and Type I error rates that are too large.
Compute the intraclass correlation coefficient (ICC) from between- and within-cluster variance components.
Calculate the design effect (deff) and effective sample size from the ICC and average cluster size.
Translate a given ICC and cluster size into a quantitative statement about precision loss.

Impact on Standard Errors

The most important consequence of clustering is its effect on standard errors. When observations within clusters are positively correlated (as is almost always the case), treating them as independent leads to standard errors that are too small. This in turn produces test statistics that are too large, P-values that are too small, and confidence intervals that are too narrow, all of which inflate the Type I error rate.

Why correlated observations carry less information

Imagine interviewing five people from the same household about what they eat. Because they shop and cook together, the second through fifth answers mostly echo the first, so you learn far less than you would from five unrelated people. Observations within a cluster behave the same way: when they are correlated, each additional member adds only a fraction of the fresh information that a genuinely independent observation would. A naive analysis still counts every observation as fully informative, so it credits itself with more information than it actually has, which is why its standard errors come out too small.

How Ignoring Clustering Inflates Significance

Imagine a study of 1,000 patients in 20 hospitals (50 per hospital). If the ICC is 0.05, the design effect is 1 + (50 − 1)(0.05) = 3.45. The effective sample size is only 1,000/3.45 ≈ 290 rather than 1,000. A naive analysis treating all 1,000 observations as independent would dramatically overstate the precision of estimates.

The Intraclass Correlation Coefficient (ICC)

For continuous outcomes, the ICC (intraclass correlation coefficient) measures the proportion of total variance that is attributable to between-cluster differences. It quantifies the degree of similarity among observations within the same cluster, a quantity formalised by the intraclass correlation (Killip, Mahfoud, & Pearce, 2004).

Two scatter plots of systolic blood pressure across eight clinics. On the left, clustered data: points within each clinic sit close to a coloured cluster-mean line, so clinics differ noticeably. On the right, independent data with the same overall spread but no within-clinic similarity. — When the intraclass correlation is high (left), observations cluster tightly around their cluster means, so each extra observation in a cluster adds little new information. With no clustering (right) the same number of observations is fully informative. This gap is what the design effect quantifies.

Intraclass correlation coefficient (Eq 20.2)

\[ \color{#0B7B6B}{\rho} = \frac{\color{#C2410C}{\sigma^2_g}}{\color{#C2410C}{\sigma^2_g} + \color{#6D28D9}{\sigma^2}} \]

The ICC is the between-cluster variance as a share of the total: between-cluster plus within-cluster variance.

Here, σ²_g is the between-cluster variance and σ² is the within-cluster variance. An ICC of 0 means no clustering effect (all variance is within clusters), while an ICC of 1 means all variance is between clusters.

R Activity: ICC and design effect from a clustered dataset

The dataset phaa_clinics.csv contains 30 primary-care clinics with 18-45 patients each (~960 patients total). The continuous outcome sbp is influenced by both patient-level covariates (age, smoker, bmi) and clinic-level covariates (clinic_size, clinic_urban). The full annotated script is in r-activities/HSCI_410_Lesson_9_Introduction_to_Clustered_Data.R.

library(lme4);  library(performance);  library(sandwich);  library(lmtest)
clinics <- read.csv("phaa_clinics.csv", stringsAsFactors = FALSE)
clinics$clinic_id <- factor(clinics$clinic_id)

# 1. ICC from a one-way ANOVA
aov_fit <- aov(sbp ~ clinic_id, data = clinics)
ms     <- summary(aov_fit)[[1]][, "Mean Sq"]
n_per  <- mean(table(clinics$clinic_id))
icc_h  <- (ms[1] - ms[2]) / (ms[1] + (n_per - 1) * ms[2])
icc_h

# 2. Same ICC, conveniently, from a null mixed model
m_null <- lmer(sbp ~ 1 + (1 | clinic_id), data = clinics)
icc(m_null)

# 3. Design effect
deff <- 1 + (n_per - 1) * icc_h;  deff
nrow(clinics) / deff                          # effective sample size

# 4. Cluster-robust SEs as a quick fix for naive OLS
naive <- lm(sbp ~ age + smoker + bmi + clinic_urban, data = clinics)
coeftest(naive, vcov. = vcovCL, cluster = ~ clinic_id)

Reading the output. An ICC of about 0.08 to 0.10 means roughly 8 to 10 percent of total variation in SBP is between clinics. The design effect (here around 3.5 to 4 with average cluster size 32) tells you cluster sampling has inflated by roughly three to four times the variance you'd get under SRS. Cluster-robust SEs are the easiest fix when you can't refit the analysis as a mixed model.

R Reflect on what you just ran

Use the questions below to interpret the output you produced. Look at your console / plot before answering.

1. Report your icc_h value (from the one-way ANOVA) and the ICC from icc(m_null). Are they the same (or very close)? In one sentence, what fraction of total variance in SBP lies between clinics?

Model answericc_h from the one-way ANOVA and icc(m_null) from the random-intercept model return nearly identical values, because both estimate the same quantity: the ratio of between-cluster variance to total variance. For this clinic dataset the ICC lands in the neighbourhood of 0.08 to 0.10, so roughly 8 to 10 percent of total variance in SBP lies between clinics and the remaining 90 percent or so is within-clinic (between individuals). This is a moderate ICC, and not unusual for a physiological measure in a multi-clinic cohort.

2. Compute and report the design effect deff and the effective sample size nrow(clinics) / deff. In one sentence, explain what the deff means in practical terms (e.g., "each observation carries the information of...").

Model answerDesign effect deff = 1 + (m−1)*ICC. With an average cluster size near 32 (about 960 patients across 30 clinics) and an ICC near 0.09, deff = 1 + 31×0.09 ≈ 3.8. Effective sample size = nrow(clinics) / deff ≈ 960 / 3.8 ≈ 250. Practical meaning: each clinic-clustered observation carries the information of only about 0.26 of an independent observation, so the roughly 960 rows are worth only about 250 genuinely independent ones. The design effect quantifies how much statistical power is ‘wasted’ because nearby observations within a clinic share information.

3. Compare the naive OLS SEs from summary(naive)$coef[, "Std. Error"] to the cluster-robust SEs from coeftest(naive, vcov. = vcovCL, cluster = ~ clinic_id). Which predictor sees the biggest SE inflation, and why does ignoring clustering produce SEs that are too small?

Model answerCluster-robust SEs are typically 1.5–2.5x larger than naive OLS SEs, with the biggest inflation on cluster-level predictors (e.g., clinic-level covariates like clinic_urban). Naive OLS SEs are too small because they assume observations are independent, but observations within a clinic are correlated, so each new observation provides less new information than OLS ‘thinks.’ The result is overly narrow CIs, inflated test statistics, and Type-I error rates well above 5%. Cluster-robust SEs (or, better, mixed-model SEs) restore valid inference.

Saved.

The Design Effect (deff)

The design effect (also called the variance inflation factor in the clustering context) quantifies how much the variance of an estimate is inflated due to clustering, compared to what it would be under simple random sampling (Campbell, Elbourne, & Altman, 2004).

Design effect (Eq 20.3)

\[ \color{#0B7B6B}{\text{deff}} = 1 + (\color{#C2410C}{\bar{m}} - 1)\,\color{#6D28D9}{\rho} \]

The design effect grows with the average cluster size and the intraclass correlation: it is the factor by which clustering inflates the variance.

Where m̄ is the average cluster size and ρ is the ICC. The practical meaning: if ICC = 0.05 and cluster size = 20, then deff = 1 + (20 − 1)(0.05) = 1.95, meaning the effective sample size is roughly halved.

Corrected variance (Eq 20.4)

\[ \color{#0B7B6B}{\widehat{\text{Var}}_{\text{corrected}}} = \color{#C2410C}{\text{deff}} \times \color{#6D28D9}{\widehat{\text{Var}}_{\text{naive}}} \]

The correct variance is the design effect times the naive variance that wrongly assumed independence.

🏘 Interactive: ICC, Design Effect & Type I Error Inflation

Generate a clustered dataset (e.g., students within schools, patients within hospitals) under the null hypothesis: the treatment has no real effect. Run many simulated studies and watch how often a naive (clustering-ignoring) test wrongly declares significance. The Type I error inflation is the cost of pretending clustered data is independent.

One simulated dataset

Each color = a cluster. Within-cluster similarity grows with ICC.

Type I error: naive vs. cluster-aware test

Number of clusters K 20

Cluster size m 25

True ICC (ρ) 0.10

α 0.05

Design effect (deff)

n/a

Effective n

n/a

Naive Type I rate

n/a

Cluster-aware rate

n/a

Try this: set ICC = 0.05, K = 20, m = 50. Run the simulation. The naive Type I error climbs well above α, often 15–25%, because the test thinks it has 1,000 independent points but really has only ~290. The cluster-aware test stays at α.

Effects on Continuous Outcomes

For continuous outcomes, the ICC directly measures the proportion of total variance due to between-cluster differences. The design effect formula deff = 1 + (m̄ − 1)ρ applies straightforwardly. The corrected standard error is obtained by multiplying the naive SE by √deff.

Example: With 20 clusters of 50 subjects each (n = 1,000), ICC = 0.05, the deff = 3.45. A naive SE of 0.50 would become 0.50 × √3.45 = 0.93, nearly double the naive estimate.

Effects on Discrete Outcomes

With binary or other discrete outcomes, the effects of clustering are analogous but more complex. Clustering affects the standard error estimation and can also influence point estimates. The design effect concept extends to discrete outcomes, but the ICC for binary data is defined differently and its estimation is more involved.

For binary outcomes, the variance of a proportion under clustering is inflated by a factor analogous to the deff. The practical consequence is the same: ignoring clustering leads to underestimated SEs and inflated Type I error.

Analysis Approach	SE Estimate	95% CI Width	P-value
Naive (ignoring clustering)	0.50	1.96	0.001
Cluster-adjusted (deff = 3.45)	0.93	3.64	0.077

This table illustrates how accounting for clustering nearly doubles the standard error, widens the confidence interval, and can change a “significant” result to a non-significant one.

Reflection

A study reports p = 0.03 for a treatment effect, but the data come from 10 hospitals with 50 patients each. If the ICC is 0.05, calculate the design effect and discuss whether the finding might still be significant after accounting for clustering.

Model answerDesign effect = 1 + (m−1)*ICC = 1 + 49×0.05 = 3.45. Effective sample size = 500/3.45 ≈ 145. With 10 hospitals and effective n ≈ 145, the test statistic that gave p = 0.03 needs re-evaluation. The originally-reported p assumed independence; correcting for clustering inflates the SE by √3.45 ≈ 1.86. The z-statistic that gave p = 0.03 (z ≈ 2.17) shrinks to z ≈ 2.17/1.86 ≈ 1.17, yielding p ≈ 0.24, which is no longer significant. Lesson: a small ICC and modest cluster size can still produce design effects that overturn naive significance tests. Always report ICC, design effect, and cluster-corrected p-values alongside the naive ones.

Reflection saved!

* Complete the quiz and reflection to continue.

Section 4

Methods for Dealing with Clustering

⏱ Estimated time: 20 minutes

Section 4 of 4

Methods for Dealing with Clustering

A roadmap from simple corrections to mixed models and generalized estimating equations.

First step

Detecting and quantifying clustering

Visual inspection

Plot outcomes by cluster. Pronounced between-cluster differences in means indicate clustering.

ICC estimation

Fit a null random-intercept model. The ICC from the variance components quantifies clustering strength.

Likelihood ratio test

Compare the null random-intercept model to a fully fixed model. A significant test confirms that the random effect term adds meaningful fit.

Methods overview

Fixed effects and correction factors

Fixed effects / stratification

Include cluster indicators in the model. Eliminates all cluster-level confounding. Best with few clusters. Cannot estimate cluster-level predictor effects.

Design-effect correction

Multiply standard errors by √deff. A quick, transparent post-hoc fix. Requires an estimate of ICC and assumes constant cluster size and correlation.

Corrected standard error = naive SE × √deff. Corrected variance = naive variance × deff.

Robust variance and survey methods

Flexible approaches for complex data

Robust (sandwich) variance

Consistent standard errors without specifying the within-cluster correlation structure. Introduced by Liang & Zeger (1986). Requires ≥20–30 clusters.

Survey methods

Design-based inference accounting for stratification, clustering, and selection weights. Use when the data come from a formal complex-survey design with known probabilities.

Sandwich variance estimator (schematic)

\[ \widehat{\text{Var}}_{\text{sandwich}}(\hat{\beta}) = (X^\top X)^{-1} \left( \sum_{c=1}^{C} X_c^\top \hat{u}_c \hat{u}_c^\top X_c \right) (X^\top X)^{-1} \]

Comparison

Choosing among the methods

Method	Handles confounding	Min. clusters	Key assumption
Fixed effects	All cluster-level	Few OK	None for cluster effects
deff correction	No	Any	Known ICC
Robust variance	No	≥20–30	None for correlation structure
Survey methods	Design-based	Varies	Known sampling design

Carry forward

Two distinctions that run through later lessons

Marginal vs. conditional

GEE targets population-averaged effects. Mixed (random-effects) models target within-cluster effects. The choice depends on the scientific question.

Report ICC and deff

Always report the ICC and design effect alongside results from clustered data analyses. These numbers let readers calibrate the study's effective precision.

Introduction and Overview

From the problem to the toolkit. Earlier sections established that clustering is common, that ignoring it inflates Type I error, and that simulations make the magnitude undeniable. This section turns to the menu of solutions: fixed effects, correction factors, robust (sandwich) variance estimation, design-based survey methods, generalised estimating equations (GEE) (Liang & Zeger, 1986; Zeger & Liang, 1986), and mixed (random-effects) models (Laird & Ware, 1982; Diez Roux, 2000). Each has different strengths, different assumptions, and a different way of “paying back” the variance that clustering takes away. Treat this section as a roadmap. We will return to mixed models for continuous outcomes in a later lesson, mixed models for discrete outcomes in a later lesson, and repeated-measures designs in a later lesson.

Learning Objectives

Detect clustering using visual inspection, ICC estimation, and likelihood ratio tests for random-effects terms.
Compare fixed-effect, correction-factor, robust-variance, and survey-weighted approaches to clustered data.
Distinguish marginal (population-averaged) from conditional (cluster-specific) interpretations of regression coefficients.
Outline how generalised estimating equations (GEE) and mixed models address clustering and where each is preferred.
Choose an analytical strategy based on the number of clusters, the question, and whether cluster-level predictors are of interest.

Detecting Clustering

Before choosing a method for handling clustering, you must first detect and quantify it. Common approaches include visual inspection (e.g., plotting outcomes by cluster), ICC estimation (fitting a random-intercept model to estimate the between-cluster variance), and likelihood ratio tests (comparing models with and without cluster-level random effects).

Methods for Handling Clustered Data

Fixed Effects & Stratification

Include cluster indicators as fixed effects in the model. This effectively stratifies the analysis by cluster, adjusting for all cluster-level confounders (both measured and unmeasured). However, this approach uses many degrees of freedom (one for each cluster minus one) and does not allow estimation of cluster-level predictor effects.

Best when: There are relatively few clusters, cluster-level confounding is the primary concern, and you do not need to estimate effects of cluster-level variables.

Correction Factor Methods

deff-based correction: Divide test statistics by √deff or multiply standard errors by √deff. This is a simple post-hoc adjustment that requires an estimate of the ICC and the average cluster size.

Overdispersion-based correction: A similar principle using a dispersion parameter estimated from the data. The Pearson or deviance goodness-of-fit statistic divided by its degrees of freedom provides a scale factor that can be applied to the variance-covariance matrix.

Best when: A quick adjustment is needed and a more sophisticated approach is not feasible.

Robust (Sandwich) Variance Estimator

The robust variance estimator does not assume a specific correlation structure within clusters. It provides valid standard errors even if the within-cluster correlation is misspecified (Liang & Zeger, 1986). This makes it very attractive for practical use.

However, it requires a moderate-to-large number of clusters (rule of thumb: at least 20–30). With too few clusters, the sandwich estimator can underestimate the true variance.

Best when: You have enough clusters and want valid inference without specifying the exact correlation structure.

Survey Methods

Survey methods account for complex sampling designs including stratification, clustering, and unequal selection probabilities (weighting). They use design-based inference rather than model-based inference, which means the validity of the analysis depends on the sampling design rather than distributional assumptions.

Available in most statistical software: Stata’s svy commands, SAS PROC SURVEY procedures, R’s survey package.

Best when: The data come from a complex survey design with known selection probabilities.

Method	Handles Confounding	Min. Clusters	Assumptions	Software
Fixed Effects	All cluster-level	Few OK	None for cluster effects	All packages
deff Correction	No	Any	Known ICC	Manual calculation
Robust Variance	No	≥20–30	None for correlation	Stata, R, SAS
Survey Methods	Design-based	Varies	Known design	Stata svy, SAS PROC SURVEY, R survey

Few Clusters?Click to explore

Many Clusters?Click to explore

Survey Design?Click to explore

Practical Recommendations

Always check for clustering before finalising your analysis. Estimate the ICC, calculate the design effect, and choose a method appropriate to your study design and the number of clusters. When in doubt, use multiple methods and compare results. If the conclusions are consistent across approaches, you can be more confident in your findings.

Reflection

You are analyzing data from a multi-site clinical trial with 25 sites and approximately 40 patients per site. Which method(s) for handling clustering would you recommend, and why?

Model answerFor a 25-site trial with 40 patients per site: random-effects model (mixed) is the recommended approach: it accounts for between-site variability, allows for site-level covariates (urbanicity, size, regional resources), and supports between- and within-site decompositions of effects. GEE is a valid alternative if the question is population-averaged effect (rather than within-site). Cluster-robust SEs on OLS is a quick fix that works when the question is mean-difference style but loses efficiency. Fixed site effects work if < 10 sites but with 25 sites the loss of degrees of freedom is meaningful. Best practice for a multi-site trial: pre-specify site as a random effect in the primary analysis, with cluster-robust SE as a sensitivity check. Report the ICC explicitly.

Reflection saved!

* Complete the quiz and reflection to continue.

HSCI 410, Lesson 9

Exploratory Data Analysis For Epidemiology

Introduction to Clustered Data

Learning objectives for this lesson:

Glossary: Key Terms, People & Concepts

Introduction & Types of Clustered Data

Introduction to Clustered Data

Independence: the assumption that rarely holds

Introduction & Types of Clustered Data

What makes data clustered?

Natural clustering

Design-induced clustering

Sources of clustering in public health

Common environment

Spatial

Repeated measures

Hierarchical

Cross-classified

When the exposure also clusters

Ecological fallacy

Two landmark contributions

Laird & Ware (1982)

Liang & Zeger (1986)

Three ideas into the next section

Introduction and Overview

Learning Objectives

What Is Clustered Data?

Why Clustering Matters

Types of Clustered Data

Sources of Variation & Predictor Clustering

Reflection

Effects of Clustering on Statistical Analysis

Effects of Clustering on Statistical Analysis

Why naive analyses are anti-conservative

Worked example

The ICC: proportion of variance between clusters

ICC = 0

ICC = 1

Translating ICC into precision loss

ICC, deff, and cluster-robust standard errors

Step 1

Step 2

Step 3

Three quantities to carry into the next section

Introduction and Overview

Learning Objectives

Impact on Standard Errors

Why correlated observations carry less information

The Intraclass Correlation Coefficient (ICC)

R Reflect on what you just ran

The Design Effect (deff)

🏘 Interactive: ICC, Design Effect & Type I Error Inflation

One simulated dataset

Type I error: naive vs. cluster-aware test

Effects on Continuous Outcomes

Effects on Discrete Outcomes

Reflection

Simulation Studies & Impact of Clustering

Simulation Studies & Impact of Clustering

From formula to empirical false-positive rate

What simulation tests

Why it matters

Binary outcome simulation results

Scenario 1

Scenario 2

Scenario 3

Clustering can bias point estimates, not just standard errors

Example: confounding by region

Explore the simulation yourself

Try this

Then vary ICC

Two problems, one remedy

Introduction and Overview

Learning Objectives

Why Simulation Studies?

Warning: Ignoring Clustering Can Be Dangerous

Binary Outcome Simulations

Confounding by Cluster

Reflection

Methods for Dealing with Clustering