HSCI 410 — Lesson 9

Introduction to Clustered Data

Exploratory Data Analysis For Epidemiology

Kiffer G. Card, PhD, Faculty of Health Sciences, Simon Fraser University

Learning objectives for this lesson:

  • Recognize and describe different types of clustered (hierarchical) data structures in epidemiology
  • Explain why observations within clusters are correlated and how this affects standard statistical analyses
  • Calculate and interpret the intraclass correlation coefficient (ICC) and design effect (deff)
  • Understand the impact of clustering on standard errors and inference for both continuous and discrete outcomes
  • Describe key methods for dealing with clustering, including fixed effects, robust variance estimators, and survey methods
  • Evaluate the consequences of ignoring clustering in epidemiologic analyses

This course was developed by Kiffer G. Card, PhD, as a companion to Dohoo, I. R., Martin, S. W., & Stryhn, H. (2012). Methods in Epidemiologic Research. VER Inc.

Reference

Glossary — Key Terms, People & Concepts

📚 Reference page — available throughout the lesson

This glossary collects the key concepts, people, and ideas you will meet in this lesson. Use it as a reference while you work through the material, or as a review before assessments. Type in the search box to filter entries.

Key Concepts & Ideas
Clustered dataData in which observations are grouped within higher-level units (e.g., students within schools, patients within clinics, repeated measures within subjects). Observations within a cluster tend to be more similar than observations between clusters, violating the standard independence assumption of ordinary regression.
Hierarchical (multilevel) dataClustered data with a clear nesting structure across two or more levels (e.g., level-1 = pupils nested within level-2 = classrooms nested within level-3 = schools) (Wikipedia contributors, 2026).
Nested designLower-level units belong to one and only one higher-level unit (e.g., a particular pupil belongs to exactly one classroom). Most common in epidemiology.
Crossed designLower-level units are observed across multiple higher-level groupings simultaneously (e.g., the same students rated by multiple teachers; the same animals seen by multiple veterinarians).
Cluster (level-2 unit)The grouping unit within which level-1 observations are correlated. Examples: clinic, household, herd, school, neighborhood.
ExchangeabilityAn assumption that any two observations within the same cluster have the same correlation regardless of position. Underlies the simplest random-intercept and compound symmetry covariance models.
Fixed effectsParameters representing the population-average effect of a covariate. Fixed effects are estimated as single values; their interpretation does not depend on a distribution of cluster-level deviations.
Random effectsCluster-specific deviations modeled as draws from a (typically normal) distribution. They allow the model to borrow strength across clusters and to quantify between-cluster variability.
Ecological fallacyDrawing conclusions about individuals from group-level associations. A reminder that level-1 and level-2 effects can differ in direction or magnitude.
Contextual effectAn effect attributable to the cluster (e.g., school climate, neighborhood deprivation) that is not captured by individual-level covariates.
Non-informative cluster sizeAn assumption that the size of a cluster is unrelated to the outcome distribution within it. When violated, standard mixed-model and GEE inferences can be biased.
Methods & Statistical Concepts
Intraclass correlation coefficient (ICC)The proportion of total outcome variance attributable to the cluster: ICC = σ²between / (σ²between + σ²within). Measures how similar observations within the same cluster are. ICC = 0 means no clustering; ICC = 1 means observations within a cluster are identical.
Design effect (DEFF)DEFF = 1 + (m − 1) · ICC, where m is the average cluster size. The factor by which standard errors must be inflated (or sample size increased) to compensate for clustering relative to a simple random sample.
Effective sample sizeThe total sample size divided by the design effect. Reflects the equivalent number of independent observations after accounting for within-cluster correlation.
Naive (independence) analysisAn analysis that ignores clustering and treats observations as independent. Typically produces standard errors that are too small for between-cluster effects (anti-conservative) and too large for within-cluster effects.
Random-intercept modelA mixed model that allows each cluster's mean (intercept) to deviate from the overall mean by a random amount drawn from a normal distribution. The simplest way to introduce clustering into a regression.
Robust (sandwich) variance estimatorA method for obtaining standard errors that are valid even if the assumed correlation structure is wrong. Often used with GEE or with cluster-robust adjustments to OLS or GLM fits.
Generalized estimating equations (GEE)A marginal (population-average) approach to clustered data that specifies a working correlation structure and uses sandwich variance estimators for inference. Robust to misspecification of the correlation but does not separate between- and within-cluster effects (Wikipedia contributors, 2026).
Survey methods (complex sampling)Estimation procedures that account for stratification, clustering, and weights from a complex sample design. Often used as an alternative to mixed models for descriptive estimates.
Simulation studyA computational experiment in which artificial data are generated under known conditions to evaluate how a method performs (e.g., type-I error, coverage of confidence intervals) when clustering is ignored or modeled.
Variance componentsThe separate variances at each level of a hierarchical model (e.g., between-cluster σ²u and within-cluster σ²e). Their sum gives the total outcome variance.
Cluster bootstrapA bootstrap procedure that resamples whole clusters (rather than individual observations) to obtain valid standard errors and confidence intervals under clustering.
Key People
Nan Laird & James WareCo-authors of the foundational 1982 paper formalizing the linear mixed-effects model for longitudinal and clustered data. Their framework underpins modern multilevel modeling in epidemiology (Diez Roux, 2000).
Kung-Yee Liang & Scott ZegerJohns Hopkins biostatisticians who introduced generalized estimating equations (GEE) in 1986, providing a flexible marginal approach to correlated data with robust variance estimation (see also Zeger & Liang, 1986).
Harvey Goldstein (1939–2020)British statistician who pioneered multilevel modeling in education and social science research. His textbook and the MLwiN software broadly disseminated the methods.
Stephen Raudenbush & Anthony BrykAmerican social scientists whose textbook on hierarchical linear models (HLM) became a standard reference for multilevel analysis in education and the social sciences.
No matching entries. Try a different search term.
Section 1

Introduction & Types of Clustered Data

⏱ Estimated time: 15 minutes

Introduction and Overview

Where this lesson fits

Lessons 3–8 systematically built up the regression toolkit for different outcome types — continuous, binary, ordered, multi-category, count, and time-to-event — but every one of those models rested on a single shared assumption: observations are independent. Real epidemiologic data almost never satisfy that assumption. Patients are nested within clinics, students within classrooms, repeated measurements within people, and households within neighbourhoods. Lesson 9 opens the final arc of the course by taking that assumption head-on.

The four content sections build the case in stages. Section 1 defines clustered data and catalogues the common ways it arises in public health research. Section 2 shows analytically why clustering breaks standard inference — standard errors collapse, Type I error rates inflate, and conclusions drift. Section 3 uses simulation to make the consequences tangible: you will see what happens to confidence intervals and p-values when you ignore clustering at different intra-cluster correlations. Section 4 previews the family of methods designed to handle clustering — mixed effects, GEE, robust/sandwich variances, and design-based survey adjustments — setting up the deeper treatment in Lessons 10–12.

Learning Objectives

  • Define clustered (hierarchical) data and contrast it with the independent-observations assumption of standard regression.
  • Identify common sources of clustering in public-health research, including environmental, spatial, repeated-measures, and design-induced grouping.
  • Explain how within-cluster similarity arises from shared exposures, contexts, or stable subject-level traits.
  • Distinguish predictor clustering from outcome clustering and recognise both in study descriptions.

What Is Clustered Data?

In many epidemiologic studies, observations are not independent. Instead, they are grouped within higher-level units—these groups are called clusters. Clustered (or hierarchical) data arises whenever the study design or the natural structure of the population creates groupings such that observations within the same group tend to be more similar to each other than to observations in other groups (Galbraith, Daniel, & Vissel, 2010; Killip, Mahfoud, & Pearce, 2004).

Why Clustering Matters

Standard statistical methods assume observations are independent. When data are clustered, this assumption is violated: observations within the same cluster share common influences (e.g., the same hospital, the same household, the same geographic region). Ignoring clustering can lead to underestimated standard errors, inflated Type I error rates, and potentially biased point estimates (Donner & Klar, 2004).

Types of Clustered Data

Clustering arises from many different sources. Understanding the type of clustering present in your data is the first step toward choosing an appropriate analytical strategy.

🏠
Common Environment
Click to learn more
🌎
Spatial Clustering
Click to learn more
🕑
Repeated Measures
Click to learn more
📈
Hierarchical
Click to learn more
🔀
Cross-Classified
Click to learn more
Common environment: Animals in herds, patients in hospitals

In veterinary epidemiology, animals within the same herd share management practices, nutrition, housing, and disease exposure. Similarly, patients within the same hospital share institutional protocols, staffing levels, and local disease ecology. These shared exposures create within-cluster correlation—outcomes for subjects in the same cluster are more alike than for subjects in different clusters.

Spatial clustering: Geographic proximity and shared exposures

People living near each other are often exposed to similar environmental factors: air pollution, water quality, neighbourhood safety, socioeconomic deprivation, and access to healthcare services. This means that health outcomes for individuals in the same geographic area are correlated. Studies that sample from defined geographic areas (e.g., census tracts, postal codes) must account for this spatial clustering.

Repeated measures: Longitudinal and crossover designs

When the same subjects are measured multiple times (e.g., before and after treatment, or at regular intervals in a cohort study), each subject forms a cluster. The repeated observations within a subject are correlated because stable individual characteristics (genetics, baseline health, behaviour) influence all measurements. The correlation structure depends on the timing and spacing of measurements.

Hierarchical structures: Multi-level nesting

Many real-world data structures have multiple levels. For example, students are nested within classrooms, classrooms within schools, and schools within districts. Each level contributes its own source of variation. In epidemiology, patients may be nested within physician practices, practices within health regions, and regions within provinces. Split-plot designs from experimental settings also create hierarchical structures where treatments are applied at different levels of the hierarchy.

Cross-classified and split-plot designs

Cross-classified structures arise when subjects belong to multiple grouping factors that do not nest within each other. For example, students may be classified by both school and neighbourhood—students from the same neighbourhood may attend different schools, and students from the same school may live in different neighbourhoods. Split-plot designs, common in agricultural experiments, have some factors applied at the whole-plot (cluster) level and others at the subplot (individual) level.

Sources of Variation & Predictor Clustering

In clustered data, the total variation in the outcome can be decomposed into between-cluster variation (differences between groups) and within-cluster variation (differences among individuals within the same group). The relative magnitude of these two sources of variation determines the strength of the clustering effect.

An important consideration is predictor clustering—when predictor variables also vary between clusters. If both the exposure and the outcome vary at the cluster level, group-level associations may differ from individual-level associations. This is the basis of the ecological fallacy: inferring individual-level relationships from aggregate (group-level) data.

Correlation Between Observations in the Same Cluster (Eq 20.1)
ρ = cov(Yij, Yik) / √(var(Yij) × var(Yik))
Knowledge Check — Section 1

1. Which is an example of clustered data?

Clustering occurs when observations are grouped within higher-level units (like hospitals), creating potential correlation among observations within the same group.

2. What is cross-classified data?

Cross-classified structures occur when units belong to multiple grouping factors simultaneously (e.g., students classified by both school and neighbourhood), unlike nested/hierarchical structures.

3. Why does predictor clustering matter?

When predictors are clustered, the relationship observed at the group (ecological) level may not reflect the individual-level relationship—this is the ecological fallacy.

Reflection

Think of a research study in your field. What natural clustering structures might exist in the data? How might ignoring this clustering affect your conclusions?

Model answerPick a field-relevant study (e.g., school-based mental health intervention). Natural clustering: students within classrooms, classrooms within schools, schools within districts. Three-level clustering with ICC at each level. Ignoring clustering: (a) SEs too small at student-level analyses, inflated false-positive rate; (b) treatment effects appear stronger than they are because the variance is mis-attributed to individuals rather than schools; (c) sample-size calculations underestimate the n needed for adequate power; (d) interpretation is wrong — you cannot make individual-level claims when the intervention varies only between schools. Consequences: meta-analyses and replication efforts find smaller effects than the original, leading to a misleading impression of replication failure when the original analysis was overconfident.
Reflection saved!
* Complete the quiz and reflection to continue.
Section 2

Effects of Clustering on Statistical Analysis

⏱ Estimated time: 20 minutes

Introduction and Overview

From recognising clustering to quantifying its damage. Section 1 catalogued where clustered structures come from. This section answers the natural follow-up question: so what? When observations within a cluster are even mildly correlated, the precision implied by a naive analysis — one that pretends every observation is independent — is far too optimistic. Here we develop the formal vocabulary (the intraclass correlation, the design effect, the effective sample size) that makes the size of that distortion measurable. These quantities reappear in every method we cover in Section 4 and in Lessons 10–12, so it is worth slowing down on the intuition.

Learning Objectives

  • Explain why ignoring clustering produces standard errors that are too small and Type I error rates that are too large.
  • Compute the intraclass correlation coefficient (ICC) from between- and within-cluster variance components.
  • Calculate the design effect (deff) and effective sample size from the ICC and average cluster size.
  • Translate a given ICC and cluster size into a quantitative statement about precision loss.

Impact on Standard Errors

The most important consequence of clustering is its effect on standard errors. When observations within clusters are positively correlated (as is almost always the case), treating them as independent leads to standard errors that are too small. This in turn produces test statistics that are too large, P-values that are too small, and confidence intervals that are too narrow—all of which inflate the Type I error rate.

How Ignoring Clustering Inflates Significance

Imagine a study of 1,000 patients in 20 hospitals (50 per hospital). If the ICC is 0.05, the design effect is 1 + (50 − 1)(0.05) = 3.45. The effective sample size is only 1,000/3.45 ≈ 290 rather than 1,000. A naive analysis treating all 1,000 observations as independent would dramatically overstate the precision of estimates.

The Intraclass Correlation Coefficient (ICC)

For continuous outcomes, the ICC (intraclass correlation coefficient) measures the proportion of total variance that is attributable to between-cluster differences. It quantifies the degree of similarity among observations within the same cluster (Killip, Mahfoud, & Pearce, 2004; Wikipedia contributors, 2026).

Intraclass Correlation Coefficient (Eq 20.2)
ρ = σ²g / (σ²g + σ²)

Here, σ²g is the between-cluster variance and σ² is the within-cluster variance. An ICC of 0 means no clustering effect (all variance is within clusters), while an ICC of 1 means all variance is between clusters.

R Activity — ICC and design effect from a clustered dataset

The dataset phaa_clinics.csv contains 30 primary-care clinics with 18-45 patients each (~960 patients total). The continuous outcome sbp is influenced by both patient-level covariates (age, smoker, bmi) and clinic-level covariates (clinic_size, clinic_urban). The full annotated script is in r-activities/HSCI_410_Lesson_9_Introduction_to_Clustered_Data.R.

library(lme4);  library(performance);  library(sandwich);  library(lmtest)
clinics <- read.csv("phaa_clinics.csv", stringsAsFactors = FALSE)
clinics$clinic_id <- factor(clinics$clinic_id)

# 1. ICC from a one-way ANOVA
aov_fit <- aov(sbp ~ clinic_id, data = clinics)
ms     <- summary(aov_fit)[[1]][, "Mean Sq"]
n_per  <- mean(table(clinics$clinic_id))
icc_h  <- (ms[1] - ms[2]) / (ms[1] + (n_per - 1) * ms[2])
icc_h

# 2. Same ICC, conveniently, from a null mixed model
m_null <- lmer(sbp ~ 1 + (1 | clinic_id), data = clinics)
icc(m_null)

# 3. Design effect
deff <- 1 + (n_per - 1) * icc_h;  deff
nrow(clinics) / deff                          # effective sample size

# 4. Cluster-robust SEs as a quick fix for naive OLS
naive <- lm(sbp ~ age + smoker + bmi + clinic_urban, data = clinics)
coeftest(naive, vcov. = vcovCL, cluster = ~ clinic_id)

Reading the output. An ICC of ~0.10-0.15 means about 10-15% of total variation in SBP is between clinics. The design effect (here ~3.6 with average cluster size 32) tells you cluster sampling has roughly tripled the variance you'd get under SRS. Cluster-robust SEs are the easiest fix when you can't refit the analysis as a mixed model.

R Reflect on what you just ran

Use the questions below to interpret the output you produced. Look at your console / plot before answering.

1. Report your icc_h value (from the one-way ANOVA) and the ICC from icc(m_null). Are they the same (or very close)? In one sentence, what fraction of total variance in SBP lies between clinics?

Model answericc_h from the one-way ANOVA and icc(m_null) from the random-intercept model typically return nearly identical values, around 0.15–0.20. They agree because both estimate the same quantity: the ratio of between-cluster variance to total variance. Interpretation: about 17% of total variance in SBP lies between clinics — the remaining 83% is within-clinic (between individuals). This is a moderate ICC; it would not be unusual in a multi-clinic cohort.

2. Compute and report the design effect deff and the effective sample size nrow(clinics) / deff. In one sentence, explain what the deff means in practical terms (e.g., "each observation carries the information of...").

Model answerDesign effect deff = 1 + (m−1)*ICC. With m = average cluster size (e.g., 50) and ICC = 0.17, deff = 1 + 49×0.17 ≈ 9.3. Effective sample size = total n / deff. So 500 observations from 10 clinics of 50 each have effective n ≈ 500/9.3 ≈ 54. Practical meaning: each clinic-clustered observation carries the information of approximately 0.11 independent observations — the design effect quantifies how much statistical power is ‘wasted’ because nearby observations within a clinic share information.

3. Compare the naive OLS SEs from summary(naive)$coef[, "Std. Error"] to the cluster-robust SEs from coeftest(naive, vcov. = vcovCL, cluster = ~ clinic_id). Which predictor sees the biggest SE inflation, and why does ignoring clustering produce SEs that are too small?

Model answerCluster-robust SEs are typically 1.5–2.5x larger than naive OLS SEs, with the biggest inflation on cluster-level predictors (e.g., clinic-level covariates like clinic_urban). Naive OLS SEs are too small because they assume observations are independent — but observations within a clinic are correlated, so each new observation provides less new information than OLS ‘thinks.’ The result is overly narrow CIs, inflated test statistics, and Type-I error rates well above 5%. Cluster-robust SEs (or, better, mixed-model SEs) restore valid inference.
Saved.

The Design Effect (deff)

The design effect (also called the variance inflation factor in the clustering context) quantifies how much the variance of an estimate is inflated due to clustering, compared to what it would be under simple random sampling (Campbell, Elbourne, & Altman, 2004).

Design Effect (Eq 20.3)
deff = 1 + (m̄ − 1)ρ

Where m̄ is the average cluster size and ρ is the ICC. The practical meaning: if ICC = 0.05 and cluster size = 20, then deff = 1 + (20 − 1)(0.05) = 1.95, meaning the effective sample size is roughly halved.

Corrected Variance (Eq 20.4)
varcorrected = deff × varnaive

🏘 Interactive: ICC, Design Effect & Type I Error Inflation

Generate a clustered dataset (e.g., students within schools, patients within hospitals) under the null hypothesis: the treatment has no real effect. Run many simulated studies and watch how often a naive (clustering-ignoring) test wrongly declares significance. The Type I error inflation is the cost of pretending clustered data is independent.

One simulated dataset

Each color = a cluster. Within-cluster similarity grows with ICC.

Type I error: naive vs. cluster-aware test
Design effect (deff)
Effective n
Naive Type I rate
Cluster-aware rate
Try this: set ICC = 0.05, K = 20, m = 50. Run the simulation. The naive Type I error climbs well above α — often 15–25% — because the test thinks it has 1,000 independent points but really has only ~290. The cluster-aware test stays at α.

Effects on Continuous Outcomes

For continuous outcomes, the ICC directly measures the proportion of total variance due to between-cluster differences. The design effect formula deff = 1 + (m̄ − 1)ρ applies straightforwardly. The corrected standard error is obtained by multiplying the naive SE by √deff.

Example: With 20 clusters of 50 subjects each (n = 1,000), ICC = 0.05, the deff = 3.45. A naive SE of 0.50 would become 0.50 × √3.45 = 0.93—nearly double the naive estimate.

Effects on Discrete Outcomes

With binary or other discrete outcomes, the effects of clustering are analogous but more complex. Clustering affects not only the standard error estimation but can also influence point estimates. The design effect concept extends to discrete outcomes, but the ICC for binary data is defined differently and its estimation is more involved.

For binary outcomes, the variance of a proportion under clustering is inflated by a factor analogous to the deff. The practical consequence is the same: ignoring clustering leads to underestimated SEs and inflated Type I error.

Analysis ApproachSE Estimate95% CI WidthP-value
Naive (ignoring clustering)0.501.960.001
Cluster-adjusted (deff = 3.45)0.933.640.087

This table illustrates how accounting for clustering nearly doubles the standard error, widens the confidence interval, and can change a “significant” result to a non-significant one.

Knowledge Check — Section 2

1. If the ICC is 0.10 and average cluster size is 21, what is the design effect?

deff = 1 + (m̄ − 1)ρ = 1 + (21 − 1)(0.10) = 1 + 2.0 = 3.0. This means the effective sample size is only about one-third of the nominal sample size.

2. What happens to Type I error rates when clustering is ignored?

Ignoring clustering leads to artificially small standard errors, which makes test statistics too large and P-values too small, inflating the Type I error rate (false positives).

3. The ICC represents:

The ICC = σ²g / (σ²g + σ²) quantifies the proportion of total variance attributable to differences between clusters. Higher ICC means more clustering effect.

Reflection

A study reports p = 0.03 for a treatment effect, but the data come from 10 hospitals with 50 patients each. If the ICC is 0.05, calculate the design effect and discuss whether the finding might still be significant after accounting for clustering.

Model answerDesign effect = 1 + (m−1)*ICC = 1 + 49×0.05 = 3.45. Effective sample size = 500/3.45 ≈ 145. With 10 hospitals and effective n ≈ 145, the test statistic that gave p = 0.03 needs re-evaluation. The originally-reported p assumed independence; correcting for clustering inflates the SE by √3.45 ≈ 1.86. The z-statistic that gave p = 0.03 (z ≈ 2.17) shrinks to z ≈ 2.17/1.86 ≈ 1.17, yielding p ≈ 0.24 — no longer significant. Lesson: a small ICC and modest cluster size can still produce design effects that overturn naive significance tests. Always report ICC, design effect, and cluster-corrected p-values, not just the naive ones.
Reflection saved!
* Complete the quiz and reflection to continue.
Section 3

Simulation Studies & Impact of Clustering

⏱ Estimated time: 15 minutes

Introduction and Overview

From formulas to felt consequences. Section 2 gave us the mathematical machinery — ICCs, design effects, effective sample sizes — that explains why clustering inflates false-positive rates. This section translates those formulas into something visceral. Simulation studies generate data with a known clustering structure and then analyse it both correctly (accounting for clusters) and incorrectly (ignoring them), so we can measure the gap directly. The takeaway you should carry forward: even small ICCs can produce alarming Type I error inflation when cluster sizes are moderate or large — and that empirical fact is what motivates the methods previewed in Section 4.

Learning Objectives

  • Describe how simulation studies quantify the consequences of misspecified analyses under known data-generating processes.
  • Predict the direction and approximate magnitude of Type I error inflation given an ICC and cluster size.
  • Distinguish confounding by cluster from clustering of outcomes, and explain why each requires a different analytical response.
  • Use simulation results to justify the choice of a clustered-data analysis method on practical, not just theoretical, grounds.

Why Simulation Studies?

Simulation studies allow us to examine the practical consequences of ignoring clustering under controlled conditions. By generating data with known clustering structures and then analysing it both correctly and incorrectly, we can quantify the bias and Type I error inflation that results from ignoring the cluster structure (Galbraith, Daniel, & Vissel, 2010; Donner & Klar, 2004).

Warning: Ignoring Clustering Can Be Dangerous

Even moderate ICC values (e.g., 0.01–0.05) can lead to substantially inflated Type I error rates when cluster sizes are large. A study with ICC = 0.01 and 50 clusters of 50 subjects can have an actual Type I error rate of 10–15% instead of the nominal 5%. Researchers who ignore clustering risk reporting findings that appear statistically significant but are actually false positives.

Binary Outcome Simulations

Simulation studies with binary outcomes demonstrate that the consequences of ignoring clustering can be severe. Even when the ICC is small, the combination of within-cluster correlation and moderate-to-large cluster sizes can inflate the actual Type I error rate well beyond the nominal 5% level.

Scenario 1: Small ICC, large clusters (ICC = 0.01, n = 50 per cluster)

With ICC = 0.01 and 50 subjects per cluster, the design effect is deff = 1 + (50 − 1)(0.01) = 1.49. Although this seems modest, simulation studies show the actual Type I error rate can reach 10–15% depending on the number of clusters and the analysis method. The inflation occurs because the naive analysis assumes 50 independent observations per cluster when, in reality, the effective number is only about 34.

Scenario 2: Moderate ICC, moderate clusters (ICC = 0.05, n = 20 per cluster)

With ICC = 0.05 and 20 subjects per cluster, the design effect is deff = 1 + (20 − 1)(0.05) = 1.95. The effective sample size is nearly halved. Simulation studies show actual Type I error rates of 15–25% when the naive analysis is used. This means that one in four or five “significant” findings may be false positives.

Scenario 3: Large ICC, any cluster size (ICC = 0.10, n = 30 per cluster)

With ICC = 0.10 and 30 subjects per cluster, the design effect is deff = 1 + (30 − 1)(0.10) = 3.90. The effective sample size is reduced to about one-quarter of the nominal size. Simulation studies demonstrate Type I error rates exceeding 30–40% when clustering is ignored. This level of inflation makes the naive analysis essentially unreliable.

Confounding by Cluster

Beyond inflating standard errors, clustering can also introduce confounding. If a cluster-level variable is associated with both the exposure and the outcome, it acts as a confounder. Failure to account for the clustering structure means this confounding is not addressed, which can lead to biased point estimates—not just incorrect standard errors.

Example: Confounding by Region

Suppose disease prevalence varies by region, and exposure to a risk factor also varies by region. If we analyse the data without accounting for region (the cluster), the association between exposure and disease will be confounded by regional differences. The estimated exposure effect may be biased upward or downward depending on the direction and magnitude of the confounding.

The magnitude of this bias depends on the correlation between the predictor and the cluster-level confounder. Stronger correlations produce greater bias.

Knowledge Check — Section 3

1. In simulations with binary outcomes and moderate ICC, ignoring clustering:

Simulation studies consistently show that even moderate ICCs (e.g., 0.01–0.05) combined with moderate-to-large cluster sizes can inflate actual Type I error rates to 10–15% or higher when clustering is ignored.

2. How can clustering lead to confounding?

If a cluster-level variable (e.g., region) is associated with both the exposure and the outcome, it acts as a confounder. Ignoring the clustering structure means this confounding is not addressed.

3. The inflation of Type I error due to clustering depends on:

The design effect formula deff = 1 + (m̄ − 1)ρ shows that both the ICC (ρ) and the average cluster size (m̄) jointly determine how much clustering inflates variance and affects inference.

Reflection

Why might even a small ICC (e.g., 0.02) be problematic in a large cluster randomized trial with 100 participants per cluster? Calculate the design effect and discuss the implications.

Model answerDesign effect = 1 + (m−1)*ICC = 1 + 99×0.02 = 2.98. With 100 participants per cluster, even a tiny ICC of 0.02 nearly triples the variance compared to an independence assumption. Effective sample size is 1/3 of nominal. Implications for a cluster-RCT: the sample size needed to achieve 80% power is 3x what an independence-based calculation predicts. Many cluster trials are underpowered for this reason — planners use the nominal n thinking it's adequate. The lesson generalises: cluster size matters at least as much as ICC. Tight clustering of even modest ICC produces large design effects when cluster size is large. The combination (ICC=0.02, m=100) is exactly the regime where naive analyses look fine but cluster-corrected analyses fail to reach significance.
Reflection saved!
* Complete the quiz and reflection to continue.
Section 4

Methods for Dealing with Clustering

⏱ Estimated time: 20 minutes

Introduction and Overview

From the problem to the toolkit. Sections 1–3 established that clustering is common, that ignoring it inflates Type I error, and that simulations make the magnitude undeniable. Section 4 turns to the menu of solutions: fixed effects, correction factors, robust (sandwich) variance estimation, design-based survey methods, generalised estimating equations (GEE) (Liang & Zeger, 1986; Zeger & Liang, 1986), and mixed (random-effects) models (Laird & Ware, 1982; Diez Roux, 2000). Each has different strengths, different assumptions, and a different way of “paying back” the variance that clustering takes away. Treat this section as a roadmap — we will return to mixed models for continuous outcomes in Lesson 10, mixed models for discrete outcomes in Lesson 11, and repeated-measures designs in Lesson 12.

Learning Objectives

  • Detect clustering using visual inspection, ICC estimation, and likelihood ratio tests for random-effects terms.
  • Compare fixed-effect, correction-factor, robust-variance, and survey-weighted approaches to clustered data.
  • Distinguish marginal (population-averaged) from conditional (cluster-specific) interpretations of regression coefficients.
  • Outline how generalised estimating equations (GEE) and mixed models address clustering and where each is preferred.
  • Choose an analytical strategy based on the number of clusters, the question, and whether cluster-level predictors are of interest.

Detecting Clustering

Before choosing a method for handling clustering, you must first detect and quantify it. Common approaches include visual inspection (e.g., plotting outcomes by cluster), ICC estimation (fitting a random-intercept model to estimate the between-cluster variance), and likelihood ratio tests (comparing models with and without cluster-level random effects).

Methods for Handling Clustered Data

Fixed Effects & Stratification

Include cluster indicators as fixed effects in the model. This effectively stratifies the analysis by cluster, adjusting for all cluster-level confounders (both measured and unmeasured). However, this approach uses many degrees of freedom (one for each cluster minus one) and does not allow estimation of cluster-level predictor effects.

Best when: There are relatively few clusters, cluster-level confounding is the primary concern, and you do not need to estimate effects of cluster-level variables.

Correction Factor Methods

deff-based correction: Divide test statistics by √deff or multiply standard errors by √deff. This is a simple post-hoc adjustment that requires an estimate of the ICC and the average cluster size.

Overdispersion-based correction: A similar principle using a dispersion parameter estimated from the data. The Pearson or deviance goodness-of-fit statistic divided by its degrees of freedom provides a scale factor that can be applied to the variance-covariance matrix.

Best when: A quick adjustment is needed and a more sophisticated approach is not feasible.

Robust (Sandwich) Variance Estimator

The robust variance estimator does not assume a specific correlation structure within clusters. It provides valid standard errors even if the within-cluster correlation is misspecified (Liang & Zeger, 1986). This makes it very attractive for practical use.

However, it requires a moderate-to-large number of clusters (rule of thumb: at least 20–30). With too few clusters, the sandwich estimator can underestimate the true variance.

Best when: You have enough clusters and want valid inference without specifying the exact correlation structure.

Survey Methods

Survey methods account for complex sampling designs including stratification, clustering, and unequal selection probabilities (weighting). They use design-based inference rather than model-based inference, which means the validity of the analysis depends on the sampling design rather than distributional assumptions.

Available in most statistical software: Stata’s svy commands, SAS PROC SURVEY procedures, R’s survey package.

Best when: The data come from a complex survey design with known selection probabilities.

MethodHandles ConfoundingMin. ClustersAssumptionsSoftware
Fixed EffectsAll cluster-levelFew OKNone for cluster effectsAll packages
deff CorrectionNoAnyKnown ICCManual calculation
Robust VarianceNo≥20–30None for correlationStata, R, SAS
Survey MethodsDesign-basedVariesKnown designStata svy, SAS PROC SURVEY, R survey
🔎
Few Clusters?
Click for guidance
📉
Many Clusters?
Click for guidance
📊
Survey Design?
Click for guidance

Practical Recommendations

Always check for clustering before finalising your analysis. Estimate the ICC, calculate the design effect, and choose a method appropriate to your study design and the number of clusters. When in doubt, use multiple methods and compare results. If the conclusions are consistent across approaches, you can be more confident in your findings.

Knowledge Check — Section 4

1. The robust (sandwich) variance estimator:

The sandwich estimator is attractive because it produces consistent SEs regardless of the true correlation structure. However, it requires a sufficient number of clusters (typically ≥20–30) to perform well.

2. When would fixed effects for clusters be most appropriate?

Fixed effects for clusters are most useful when there are relatively few clusters and the goal is to control for cluster-level confounding. With many clusters, this approach uses too many degrees of freedom.

3. Survey methods for clustered data:

Survey methods use design-based inference to properly account for complex sampling features including stratification, clustering, and differential selection/weighting, and are available in most major statistical packages.

Reflection

You are analyzing data from a multi-site clinical trial with 25 sites and approximately 40 patients per site. Which method(s) for handling clustering would you recommend, and why?

Model answerFor a 25-site trial with 40 patients per site: random-effects model (mixed) is the recommended approach — it accounts for between-site variability, allows for site-level covariates (urbanicity, size, regional resources), and supports between- and within-site decompositions of effects. GEE is a valid alternative if the question is population-averaged effect (rather than within-site). Cluster-robust SEs on OLS is a quick fix that works when the question is mean-difference style but loses efficiency. Fixed site effects work if < 10 sites but with 25 sites the loss of degrees of freedom is meaningful. Best practice for a multi-site trial: pre-specify site as a random effect in the primary analysis, with cluster-robust SE as a sensitivity check. Report the ICC explicitly.
Reflection saved!
* Complete the quiz and reflection to continue.
Final Assessment

Lesson 9 — Comprehensive Assessment

⏱ Estimated time: 25 minutes

Bringing It All Together

This lesson opened the final arc of HSCI 410 by retiring the independent-observations assumption that has carried us through Lessons 3–8. We started by cataloguing where clustering comes from in public-health data — common environments, geography, repeated measurements, study design — and showed why it is the rule rather than the exception. Sections 2 and 3 then quantified the damage: the intraclass correlation, the design effect, and effective sample size give a precise, computable handle on how much information naive analyses pretend to have but actually don't, and simulation results made the consequences visceral.

Section 4 sketched the toolkit: fixed effects, correction factors, robust variance estimators, design-based survey methods, GEE, and mixed models. Each pays back the precision clustering takes away in a different currency, and the right tool depends on the number of clusters, whether you care about cluster-level predictors, and whether your interpretation is marginal or conditional. Lessons 10 (mixed models for continuous outcomes), 11 (mixed models for discrete outcomes), and 12 (repeated measures) take each of these approaches deeper.

The final assessment asks you to recognise clustering on sight, compute the design effect for a sketched study, and choose a defensible analytic strategy with its trade-offs named.

Key Takeaways from Lesson 9

  • Clustered data is the norm in public-health research, not a special case — the independence assumption underlying standard regression rarely holds.
  • The intraclass correlation coefficient (ICC) measures how much of total variance is between clusters; the design effect deff = 1 + (m−1)ρ translates ICC and cluster size into precision loss.
  • Even small ICCs (< 0.05) can drive Type I error rates well above 5% when cluster sizes are moderate, as simulation studies make explicit.
  • Methods for handling clustering — fixed effects, robust variance, GEE, mixed models, survey weighting — differ in their assumptions, target estimands, and minimum number of clusters.
  • Choosing among them requires being clear about whether the question is marginal (population-averaged) or conditional (cluster-specific).
  • Detecting and quantifying clustering is a required step before fitting any inferential model on hierarchical data — this lesson sets up Lessons 10–12, which fit those models in detail.

Final Reflection

Reflecting on this lesson, how has your understanding of clustered data changed? Describe a specific analytical situation where accounting for clustering would be critical and explain which method you would choose to address it.

Model answerMy understanding has shifted from "clustering is a nuisance to correct after the fact" to "clustering is structural and must be designed for from the outset." Specific situation where it would be critical: a province-wide evaluation of a Family Health Team policy intervention where outcomes are measured at the patient level but the intervention is delivered at the team level. Each team has ~30 patients; 50 teams enrolled. ICC for outcomes like patient satisfaction or HbA1c is typically 0.05–0.10. Without clustering correction, the analysis would conclude the policy is effective with high confidence; with proper modelling, the effective sample size is much smaller and the same evidence may be far less conclusive. Method choice: linear mixed model with random intercepts for team as the primary analysis; estimate the ICC; report design effect; run sensitivity analyses with GEE and fixed team effects. Pre-register the strategy.
Reflection saved!
Final Assessment — Lesson 9 (15 Questions)

1. Clustered data is characterized by:

Clustered data has a hierarchical structure where observations within the same group (cluster) tend to be more similar to each other than to observations in other groups.

2. Which is NOT a source of clustering?

Simple random sampling from a homogeneous population produces independent observations. The other options all involve natural grouping that creates within-group correlation.

3. The ecological fallacy occurs when:

The ecological fallacy is the error of assuming that relationships observed at the group (aggregate) level apply to individuals. Group-level and individual-level associations can differ substantially.

4. The ICC ranges from:

For most practical applications, the ICC ranges from 0 (no clustering effect) to 1 (all variance is between clusters). Technically it can be slightly negative but this is rare and usually indicates no meaningful clustering.

5. If σ²g = 4 and σ² = 16, the ICC is:

ICC = σ²g / (σ²g + σ²) = 4 / (4 + 16) = 4/20 = 0.20.

6. A design effect of 5 means the effective sample size is approximately:

The design effect inflates the variance by a factor of 5, which means the effective sample size is reduced to approximately 1/5 (one-fifth) of the nominal sample size.

7. Ignoring clustering in analysis primarily leads to:

The most common consequence of ignoring clustering is underestimated SEs, leading to test statistics that are too large, P-values that are too small, and confidence intervals that are too narrow—inflating Type I error.

8. In a study with ICC = 0.05 and 40 subjects per cluster, the design effect is:

deff = 1 + (m̄ − 1)ρ = 1 + (40 − 1)(0.05) = 1 + 1.95 = 2.95.

9. Cross-classified data structures differ from nested structures because:

In cross-classified structures, units are simultaneously classified by two or more grouping factors that are not nested within each other (e.g., students classified by both school and neighbourhood).

10. Confounding by cluster occurs when:

Confounding by cluster happens when a characteristic of the cluster (e.g., hospital quality, regional socioeconomic status) is related to both the predictor and the outcome, creating a spurious or biased association.

11. The robust (sandwich) variance estimator requires:

The sandwich estimator’s consistency relies on having enough clusters for the between-cluster variance to be well estimated. With too few clusters, it can underestimate the true variance.

12. Fixed effects for clusters:

Fixed effects for clusters add indicator variables for each cluster, which controls for all measured and unmeasured cluster-level confounders but consumes degrees of freedom equal to the number of clusters minus one.

13. Which statement about simulation studies on clustering is TRUE?

Simulations show that even ICC values as small as 0.01–0.02, when combined with large cluster sizes, can inflate actual Type I error rates well beyond the nominal level, because deff depends on both ICC and cluster size.

14. Survey methods for analyzing clustered data:

Survey methods use design-based inference that explicitly accounts for the features of the sampling design (stratification, clustering, and differential selection probabilities/weights).

15. The deff-based correction for clustering involves:

The correction involves multiplying naive SEs by √deff (or equivalently, dividing test statistics by √deff) to account for the variance inflation due to clustering.

Lesson 9 Complete!

You have completed Lesson 9: Introduction to Clustered Data. You can now recognise the common sources of clustered structures in epidemiologic data, quantify their statistical impact using ICCs and design effects, articulate why simulation studies show such dramatic Type I error inflation when clustering is ignored, and orient yourself among the methods that exist to deal with it — from fixed effects and robust variance estimation through GEE and mixed models.

What’s next: Lesson 10 — Mixed Models for Continuous Data picks up the random-effects branch of the toolbox and develops it in depth. You will learn how random intercepts and random slopes partition variance across levels, how to fit and interpret these models in R, and how to choose between mixed and marginal approaches when both are reasonable. Lessons 11 and 12 then extend the framework to discrete outcomes (logistic and Poisson mixed models) and to repeated-measures designs, with Lesson 12 closing the course.