Introduction to Clustered Data

Exploratory Data Analysis For Epidemiology

Kiffer G. Card, PhD, Faculty of Health Sciences, Simon Fraser University

Learning objectives for this lesson:

Recognize and describe different types of clustered (hierarchical) data structures in epidemiology
Explain why observations within clusters are correlated and how this affects standard statistical analyses
Calculate and interpret the intraclass correlation coefficient (ICC) and design effect (deff)
Understand the impact of clustering on standard errors and inference for both continuous and discrete outcomes
Describe key methods for dealing with clustering, including fixed effects, robust variance estimators, and survey methods
Evaluate the consequences of ignoring clustering in epidemiologic analyses

This course was developed by Kiffer G. Card, PhD, as a companion to Dohoo, I. R., Martin, S. W., & Stryhn, H. (2012). Methods in Epidemiologic Research. VER Inc.

Section 2

Effects of Clustering on Statistical Analysis

⏱ Estimated time: 20 minutes

Impact on Standard Errors

The most important consequence of clustering is its effect on standard errors. When observations within clusters are positively correlated (as is almost always the case), treating them as independent leads to standard errors that are too small. This in turn produces test statistics that are too large, P-values that are too small, and confidence intervals that are too narrow—all of which inflate the Type I error rate.

How Ignoring Clustering Inflates Significance

Imagine a study of 1,000 patients in 20 hospitals (50 per hospital). If the ICC is 0.05, the design effect is 1 + (50 − 1)(0.05) = 3.45. The effective sample size is only 1,000/3.45 ≈ 290 rather than 1,000. A naive analysis treating all 1,000 observations as independent would dramatically overstate the precision of estimates.

The Intraclass Correlation Coefficient (ICC)

For continuous outcomes, the ICC (intraclass correlation coefficient) measures the proportion of total variance that is attributable to between-cluster differences. It quantifies the degree of similarity among observations within the same cluster.

Intraclass Correlation Coefficient (Eq 20.2)

ρ = σ²_g / (σ²_g + σ²)

Here, σ²_g is the between-cluster variance and σ² is the within-cluster variance. An ICC of 0 means no clustering effect (all variance is within clusters), while an ICC of 1 means all variance is between clusters.

The Design Effect (deff)

The design effect (also called the variance inflation factor in the clustering context) quantifies how much the variance of an estimate is inflated due to clustering, compared to what it would be under simple random sampling.

Design Effect (Eq 20.3)

deff = 1 + (m̄ − 1)ρ

Where m̄ is the average cluster size and ρ is the ICC. The practical meaning: if ICC = 0.05 and cluster size = 20, then deff = 1 + (20 − 1)(0.05) = 1.95, meaning the effective sample size is roughly halved.

Corrected Variance (Eq 20.4)

var_corrected = deff × var_naive

Effects on Continuous Outcomes

For continuous outcomes, the ICC directly measures the proportion of total variance due to between-cluster differences. The design effect formula deff = 1 + (m̄ − 1)ρ applies straightforwardly. The corrected standard error is obtained by multiplying the naive SE by √deff.

Example: With 20 clusters of 50 subjects each (n = 1,000), ICC = 0.05, the deff = 3.45. A naive SE of 0.50 would become 0.50 × √3.45 = 0.93—nearly double the naive estimate.

Effects on Discrete Outcomes

With binary or other discrete outcomes, the effects of clustering are analogous but more complex. Clustering affects not only the standard error estimation but can also influence point estimates. The design effect concept extends to discrete outcomes, but the ICC for binary data is defined differently and its estimation is more involved.

For binary outcomes, the variance of a proportion under clustering is inflated by a factor analogous to the deff. The practical consequence is the same: ignoring clustering leads to underestimated SEs and inflated Type I error.

Analysis Approach	SE Estimate	95% CI Width	P-value
Naive (ignoring clustering)	0.50	1.96	0.001
Cluster-adjusted (deff = 3.45)	0.93	3.64	0.087

This table illustrates how accounting for clustering nearly doubles the standard error, widens the confidence interval, and can change a “significant” result to a non-significant one.

Reflection

A study reports p = 0.03 for a treatment effect, but the data come from 10 hospitals with 50 patients each. If the ICC is 0.05, calculate the design effect and discuss whether the finding might still be significant after accounting for clustering.

Reflection saved!

* Complete the quiz and reflection to continue.

Section 4

Methods for Dealing with Clustering

⏱ Estimated time: 20 minutes

Detecting Clustering

Before choosing a method for handling clustering, you must first detect and quantify it. Common approaches include visual inspection (e.g., plotting outcomes by cluster), ICC estimation (fitting a random-intercept model to estimate the between-cluster variance), and likelihood ratio tests (comparing models with and without cluster-level random effects).

Methods for Handling Clustered Data

Fixed Effects & Stratification

Include cluster indicators as fixed effects in the model. This effectively stratifies the analysis by cluster, adjusting for all cluster-level confounders (both measured and unmeasured). However, this approach uses many degrees of freedom (one for each cluster minus one) and does not allow estimation of cluster-level predictor effects.

Best when: There are relatively few clusters, cluster-level confounding is the primary concern, and you do not need to estimate effects of cluster-level variables.

Correction Factor Methods

deff-based correction: Divide test statistics by √deff or multiply standard errors by √deff. This is a simple post-hoc adjustment that requires an estimate of the ICC and the average cluster size.

Overdispersion-based correction: A similar principle using a dispersion parameter estimated from the data. The Pearson or deviance goodness-of-fit statistic divided by its degrees of freedom provides a scale factor that can be applied to the variance-covariance matrix.

Best when: A quick adjustment is needed and a more sophisticated approach is not feasible.

Robust (Sandwich) Variance Estimator

The robust variance estimator does not assume a specific correlation structure within clusters. It provides valid standard errors even if the within-cluster correlation is misspecified. This makes it very attractive for practical use.

However, it requires a moderate-to-large number of clusters (rule of thumb: at least 20–30). With too few clusters, the sandwich estimator can underestimate the true variance.

Best when: You have enough clusters and want valid inference without specifying the exact correlation structure.

Survey Methods

Survey methods account for complex sampling designs including stratification, clustering, and unequal selection probabilities (weighting). They use design-based inference rather than model-based inference, which means the validity of the analysis depends on the sampling design rather than distributional assumptions.

Available in most statistical software: Stata’s svy commands, SAS PROC SURVEY procedures, R’s survey package.

Best when: The data come from a complex survey design with known selection probabilities.

Method	Handles Confounding	Min. Clusters	Assumptions	Software
Fixed Effects	All cluster-level	Few OK	None for cluster effects	All packages
deff Correction	No	Any	Known ICC	Manual calculation
Robust Variance	No	≥20–30	None for correlation	Stata, R, SAS
Survey Methods	Design-based	Varies	Known design	Stata svy, SAS PROC SURVEY, R survey

🔎

Few Clusters?

Click for guidance

📉

Many Clusters?

Click for guidance

📊

Survey Design?

Click for guidance

Practical Recommendations

Always check for clustering before finalising your analysis. Estimate the ICC, calculate the design effect, and choose a method appropriate to your study design and the number of clusters. When in doubt, use multiple methods and compare results. If the conclusions are consistent across approaches, you can be more confident in your findings.

Reflection

You are analyzing data from a multi-site clinical trial with 25 sites and approximately 40 patients per site. Which method(s) for handling clustering would you recommend, and why?

Reflection saved!

* Complete the quiz and reflection to continue.

HSCI 410 — Lesson 8

Exploratory Data Analysis For Epidemiology

Introduction to Clustered Data

Learning objectives for this lesson:

Introduction & Types of Clustered Data

What Is Clustered Data?

Types of Clustered Data

Sources of Variation & Predictor Clustering

Section 1 Knowledge Check

Reflection

Effects of Clustering on Statistical Analysis

Impact on Standard Errors

The Intraclass Correlation Coefficient (ICC)

The Design Effect (deff)

Effects on Continuous Outcomes

Effects on Discrete Outcomes

Section 2 Knowledge Check

Reflection

Simulation Studies & Impact of Clustering

Why Simulation Studies?

Binary Outcome Simulations

Confounding by Cluster

Section 3 Knowledge Check

Reflection

Methods for Dealing with Clustering

Detecting Clustering

Methods for Handling Clustered Data

Fixed Effects & Stratification

Correction Factor Methods

Robust (Sandwich) Variance Estimator

Survey Methods

Section 4 Knowledge Check

Reflection

Lesson 8 — Comprehensive Assessment

Final Reflection

Final Assessment (15 Questions)

Lesson 8 Complete!