HSCI 410 — Lesson 8

Introduction to Clustered Data

Exploratory Data Analysis For Epidemiology

Kiffer G. Card, PhD, Faculty of Health Sciences, Simon Fraser University

Learning objectives for this lesson:

  • Recognize and describe different types of clustered (hierarchical) data structures in epidemiology
  • Explain why observations within clusters are correlated and how this affects standard statistical analyses
  • Calculate and interpret the intraclass correlation coefficient (ICC) and design effect (deff)
  • Understand the impact of clustering on standard errors and inference for both continuous and discrete outcomes
  • Describe key methods for dealing with clustering, including fixed effects, robust variance estimators, and survey methods
  • Evaluate the consequences of ignoring clustering in epidemiologic analyses

This course was developed by Kiffer G. Card, PhD, as a companion to Dohoo, I. R., Martin, S. W., & Stryhn, H. (2012). Methods in Epidemiologic Research. VER Inc.

Section 1

Introduction & Types of Clustered Data

⏱ Estimated time: 15 minutes

What Is Clustered Data?

In many epidemiologic studies, observations are not independent. Instead, they are grouped within higher-level units—these groups are called clusters. Clustered (or hierarchical) data arises whenever the study design or the natural structure of the population creates groupings such that observations within the same group tend to be more similar to each other than to observations in other groups.

Why Clustering Matters

Standard statistical methods assume observations are independent. When data are clustered, this assumption is violated: observations within the same cluster share common influences (e.g., the same hospital, the same household, the same geographic region). Ignoring clustering can lead to underestimated standard errors, inflated Type I error rates, and potentially biased point estimates.

Types of Clustered Data

Clustering arises from many different sources. Understanding the type of clustering present in your data is the first step toward choosing an appropriate analytical strategy.

🏠
Common Environment
Click to learn more
🌎
Spatial Clustering
Click to learn more
🕑
Repeated Measures
Click to learn more
📈
Hierarchical
Click to learn more
🔀
Cross-Classified
Click to learn more
Common environment: Animals in herds, patients in hospitals

In veterinary epidemiology, animals within the same herd share management practices, nutrition, housing, and disease exposure. Similarly, patients within the same hospital share institutional protocols, staffing levels, and local disease ecology. These shared exposures create within-cluster correlation—outcomes for subjects in the same cluster are more alike than for subjects in different clusters.

Spatial clustering: Geographic proximity and shared exposures

People living near each other are often exposed to similar environmental factors: air pollution, water quality, neighbourhood safety, socioeconomic deprivation, and access to healthcare services. This means that health outcomes for individuals in the same geographic area are correlated. Studies that sample from defined geographic areas (e.g., census tracts, postal codes) must account for this spatial clustering.

Repeated measures: Longitudinal and crossover designs

When the same subjects are measured multiple times (e.g., before and after treatment, or at regular intervals in a cohort study), each subject forms a cluster. The repeated observations within a subject are correlated because stable individual characteristics (genetics, baseline health, behaviour) influence all measurements. The correlation structure depends on the timing and spacing of measurements.

Hierarchical structures: Multi-level nesting

Many real-world data structures have multiple levels. For example, students are nested within classrooms, classrooms within schools, and schools within districts. Each level contributes its own source of variation. In epidemiology, patients may be nested within physician practices, practices within health regions, and regions within provinces. Split-plot designs from experimental settings also create hierarchical structures where treatments are applied at different levels of the hierarchy.

Cross-classified and split-plot designs

Cross-classified structures arise when subjects belong to multiple grouping factors that do not nest within each other. For example, students may be classified by both school and neighbourhood—students from the same neighbourhood may attend different schools, and students from the same school may live in different neighbourhoods. Split-plot designs, common in agricultural experiments, have some factors applied at the whole-plot (cluster) level and others at the subplot (individual) level.

Sources of Variation & Predictor Clustering

In clustered data, the total variation in the outcome can be decomposed into between-cluster variation (differences between groups) and within-cluster variation (differences among individuals within the same group). The relative magnitude of these two sources of variation determines the strength of the clustering effect.

An important consideration is predictor clustering—when predictor variables also vary between clusters. If both the exposure and the outcome vary at the cluster level, group-level associations may differ from individual-level associations. This is the basis of the ecological fallacy: inferring individual-level relationships from aggregate (group-level) data.

Correlation Between Observations in the Same Cluster (Eq 20.1)
ρ = cov(Yij, Yik) / √(var(Yij) × var(Yik))

Section 1 Knowledge Check

1. Which is an example of clustered data?

Clustering occurs when observations are grouped within higher-level units (like hospitals), creating potential correlation among observations within the same group.

2. What is cross-classified data?

Cross-classified structures occur when units belong to multiple grouping factors simultaneously (e.g., students classified by both school and neighbourhood), unlike nested/hierarchical structures.

3. Why does predictor clustering matter?

When predictors are clustered, the relationship observed at the group (ecological) level may not reflect the individual-level relationship—this is the ecological fallacy.

Reflection

Think of a research study in your field. What natural clustering structures might exist in the data? How might ignoring this clustering affect your conclusions?

Reflection saved!
* Complete the quiz and reflection to continue.
Section 2

Effects of Clustering on Statistical Analysis

⏱ Estimated time: 20 minutes

Impact on Standard Errors

The most important consequence of clustering is its effect on standard errors. When observations within clusters are positively correlated (as is almost always the case), treating them as independent leads to standard errors that are too small. This in turn produces test statistics that are too large, P-values that are too small, and confidence intervals that are too narrow—all of which inflate the Type I error rate.

How Ignoring Clustering Inflates Significance

Imagine a study of 1,000 patients in 20 hospitals (50 per hospital). If the ICC is 0.05, the design effect is 1 + (50 − 1)(0.05) = 3.45. The effective sample size is only 1,000/3.45 ≈ 290 rather than 1,000. A naive analysis treating all 1,000 observations as independent would dramatically overstate the precision of estimates.

The Intraclass Correlation Coefficient (ICC)

For continuous outcomes, the ICC (intraclass correlation coefficient) measures the proportion of total variance that is attributable to between-cluster differences. It quantifies the degree of similarity among observations within the same cluster.

Intraclass Correlation Coefficient (Eq 20.2)
ρ = σ²g / (σ²g + σ²)

Here, σ²g is the between-cluster variance and σ² is the within-cluster variance. An ICC of 0 means no clustering effect (all variance is within clusters), while an ICC of 1 means all variance is between clusters.

The Design Effect (deff)

The design effect (also called the variance inflation factor in the clustering context) quantifies how much the variance of an estimate is inflated due to clustering, compared to what it would be under simple random sampling.

Design Effect (Eq 20.3)
deff = 1 + (m̄ − 1)ρ

Where m̄ is the average cluster size and ρ is the ICC. The practical meaning: if ICC = 0.05 and cluster size = 20, then deff = 1 + (20 − 1)(0.05) = 1.95, meaning the effective sample size is roughly halved.

Corrected Variance (Eq 20.4)
varcorrected = deff × varnaive

Effects on Continuous Outcomes

For continuous outcomes, the ICC directly measures the proportion of total variance due to between-cluster differences. The design effect formula deff = 1 + (m̄ − 1)ρ applies straightforwardly. The corrected standard error is obtained by multiplying the naive SE by √deff.

Example: With 20 clusters of 50 subjects each (n = 1,000), ICC = 0.05, the deff = 3.45. A naive SE of 0.50 would become 0.50 × √3.45 = 0.93—nearly double the naive estimate.

Effects on Discrete Outcomes

With binary or other discrete outcomes, the effects of clustering are analogous but more complex. Clustering affects not only the standard error estimation but can also influence point estimates. The design effect concept extends to discrete outcomes, but the ICC for binary data is defined differently and its estimation is more involved.

For binary outcomes, the variance of a proportion under clustering is inflated by a factor analogous to the deff. The practical consequence is the same: ignoring clustering leads to underestimated SEs and inflated Type I error.

Analysis ApproachSE Estimate95% CI WidthP-value
Naive (ignoring clustering)0.501.960.001
Cluster-adjusted (deff = 3.45)0.933.640.087

This table illustrates how accounting for clustering nearly doubles the standard error, widens the confidence interval, and can change a “significant” result to a non-significant one.

Section 2 Knowledge Check

1. If the ICC is 0.10 and average cluster size is 21, what is the design effect?

deff = 1 + (m̄ − 1)ρ = 1 + (21 − 1)(0.10) = 1 + 2.0 = 3.0. This means the effective sample size is only about one-third of the nominal sample size.

2. What happens to Type I error rates when clustering is ignored?

Ignoring clustering leads to artificially small standard errors, which makes test statistics too large and P-values too small, inflating the Type I error rate (false positives).

3. The ICC represents:

The ICC = σ²g / (σ²g + σ²) quantifies the proportion of total variance attributable to differences between clusters. Higher ICC means more clustering effect.

Reflection

A study reports p = 0.03 for a treatment effect, but the data come from 10 hospitals with 50 patients each. If the ICC is 0.05, calculate the design effect and discuss whether the finding might still be significant after accounting for clustering.

Reflection saved!
* Complete the quiz and reflection to continue.
Section 3

Simulation Studies & Impact of Clustering

⏱ Estimated time: 15 minutes

Why Simulation Studies?

Simulation studies allow us to examine the practical consequences of ignoring clustering under controlled conditions. By generating data with known clustering structures and then analysing it both correctly and incorrectly, we can quantify the bias and Type I error inflation that results from ignoring the cluster structure.

Warning: Ignoring Clustering Can Be Dangerous

Even moderate ICC values (e.g., 0.01–0.05) can lead to substantially inflated Type I error rates when cluster sizes are large. A study with ICC = 0.01 and 50 clusters of 50 subjects can have an actual Type I error rate of 10–15% instead of the nominal 5%. Researchers who ignore clustering risk reporting findings that appear statistically significant but are actually false positives.

Binary Outcome Simulations

Simulation studies with binary outcomes demonstrate that the consequences of ignoring clustering can be severe. Even when the ICC is small, the combination of within-cluster correlation and moderate-to-large cluster sizes can inflate the actual Type I error rate well beyond the nominal 5% level.

Scenario 1: Small ICC, large clusters (ICC = 0.01, n = 50 per cluster)

With ICC = 0.01 and 50 subjects per cluster, the design effect is deff = 1 + (50 − 1)(0.01) = 1.49. Although this seems modest, simulation studies show the actual Type I error rate can reach 10–15% depending on the number of clusters and the analysis method. The inflation occurs because the naive analysis assumes 50 independent observations per cluster when, in reality, the effective number is only about 34.

Scenario 2: Moderate ICC, moderate clusters (ICC = 0.05, n = 20 per cluster)

With ICC = 0.05 and 20 subjects per cluster, the design effect is deff = 1 + (20 − 1)(0.05) = 1.95. The effective sample size is nearly halved. Simulation studies show actual Type I error rates of 15–25% when the naive analysis is used. This means that one in four or five “significant” findings may be false positives.

Scenario 3: Large ICC, any cluster size (ICC = 0.10, n = 30 per cluster)

With ICC = 0.10 and 30 subjects per cluster, the design effect is deff = 1 + (30 − 1)(0.10) = 3.90. The effective sample size is reduced to about one-quarter of the nominal size. Simulation studies demonstrate Type I error rates exceeding 30–40% when clustering is ignored. This level of inflation makes the naive analysis essentially unreliable.

Confounding by Cluster

Beyond inflating standard errors, clustering can also introduce confounding. If a cluster-level variable is associated with both the exposure and the outcome, it acts as a confounder. Failure to account for the clustering structure means this confounding is not addressed, which can lead to biased point estimates—not just incorrect standard errors.

Example: Confounding by Region

Suppose disease prevalence varies by region, and exposure to a risk factor also varies by region. If we analyse the data without accounting for region (the cluster), the association between exposure and disease will be confounded by regional differences. The estimated exposure effect may be biased upward or downward depending on the direction and magnitude of the confounding.

The magnitude of this bias depends on the correlation between the predictor and the cluster-level confounder. Stronger correlations produce greater bias.

Section 3 Knowledge Check

1. In simulations with binary outcomes and moderate ICC, ignoring clustering:

Simulation studies consistently show that even moderate ICCs (e.g., 0.01–0.05) combined with moderate-to-large cluster sizes can inflate actual Type I error rates to 10–15% or higher when clustering is ignored.

2. How can clustering lead to confounding?

If a cluster-level variable (e.g., region) is associated with both the exposure and the outcome, it acts as a confounder. Ignoring the clustering structure means this confounding is not addressed.

3. The inflation of Type I error due to clustering depends on:

The design effect formula deff = 1 + (m̄ − 1)ρ shows that both the ICC (ρ) and the average cluster size (m̄) jointly determine how much clustering inflates variance and affects inference.

Reflection

Why might even a small ICC (e.g., 0.02) be problematic in a large cluster randomized trial with 100 participants per cluster? Calculate the design effect and discuss the implications.

Reflection saved!
* Complete the quiz and reflection to continue.
Section 4

Methods for Dealing with Clustering

⏱ Estimated time: 20 minutes

Detecting Clustering

Before choosing a method for handling clustering, you must first detect and quantify it. Common approaches include visual inspection (e.g., plotting outcomes by cluster), ICC estimation (fitting a random-intercept model to estimate the between-cluster variance), and likelihood ratio tests (comparing models with and without cluster-level random effects).

Methods for Handling Clustered Data

Fixed Effects & Stratification

Include cluster indicators as fixed effects in the model. This effectively stratifies the analysis by cluster, adjusting for all cluster-level confounders (both measured and unmeasured). However, this approach uses many degrees of freedom (one for each cluster minus one) and does not allow estimation of cluster-level predictor effects.

Best when: There are relatively few clusters, cluster-level confounding is the primary concern, and you do not need to estimate effects of cluster-level variables.

Correction Factor Methods

deff-based correction: Divide test statistics by √deff or multiply standard errors by √deff. This is a simple post-hoc adjustment that requires an estimate of the ICC and the average cluster size.

Overdispersion-based correction: A similar principle using a dispersion parameter estimated from the data. The Pearson or deviance goodness-of-fit statistic divided by its degrees of freedom provides a scale factor that can be applied to the variance-covariance matrix.

Best when: A quick adjustment is needed and a more sophisticated approach is not feasible.

Robust (Sandwich) Variance Estimator

The robust variance estimator does not assume a specific correlation structure within clusters. It provides valid standard errors even if the within-cluster correlation is misspecified. This makes it very attractive for practical use.

However, it requires a moderate-to-large number of clusters (rule of thumb: at least 20–30). With too few clusters, the sandwich estimator can underestimate the true variance.

Best when: You have enough clusters and want valid inference without specifying the exact correlation structure.

Survey Methods

Survey methods account for complex sampling designs including stratification, clustering, and unequal selection probabilities (weighting). They use design-based inference rather than model-based inference, which means the validity of the analysis depends on the sampling design rather than distributional assumptions.

Available in most statistical software: Stata’s svy commands, SAS PROC SURVEY procedures, R’s survey package.

Best when: The data come from a complex survey design with known selection probabilities.

MethodHandles ConfoundingMin. ClustersAssumptionsSoftware
Fixed EffectsAll cluster-levelFew OKNone for cluster effectsAll packages
deff CorrectionNoAnyKnown ICCManual calculation
Robust VarianceNo≥20–30None for correlationStata, R, SAS
Survey MethodsDesign-basedVariesKnown designStata svy, SAS PROC SURVEY, R survey
🔎
Few Clusters?
Click for guidance
📉
Many Clusters?
Click for guidance
📊
Survey Design?
Click for guidance
Practical Recommendations

Always check for clustering before finalising your analysis. Estimate the ICC, calculate the design effect, and choose a method appropriate to your study design and the number of clusters. When in doubt, use multiple methods and compare results. If the conclusions are consistent across approaches, you can be more confident in your findings.

Section 4 Knowledge Check

1. The robust (sandwich) variance estimator:

The sandwich estimator is attractive because it produces consistent SEs regardless of the true correlation structure. However, it requires a sufficient number of clusters (typically ≥20–30) to perform well.

2. When would fixed effects for clusters be most appropriate?

Fixed effects for clusters are most useful when there are relatively few clusters and the goal is to control for cluster-level confounding. With many clusters, this approach uses too many degrees of freedom.

3. Survey methods for clustered data:

Survey methods use design-based inference to properly account for complex sampling features including stratification, clustering, and differential selection/weighting, and are available in most major statistical packages.

Reflection

You are analyzing data from a multi-site clinical trial with 25 sites and approximately 40 patients per site. Which method(s) for handling clustering would you recommend, and why?

Reflection saved!
* Complete the quiz and reflection to continue.
Final Assessment

Lesson 8 — Comprehensive Assessment

⏱ Estimated time: 25 minutes

This final assessment covers all material from this lesson. You must answer all 15 questions correctly (100%) and complete the final reflection to finish the lesson.

Final Reflection

Reflecting on this lesson, how has your understanding of clustered data changed? Describe a specific analytical situation where accounting for clustering would be critical and explain which method you would choose to address it.

Reflection saved!

Final Assessment (15 Questions)

1. Clustered data is characterized by:

Clustered data has a hierarchical structure where observations within the same group (cluster) tend to be more similar to each other than to observations in other groups.

2. Which is NOT a source of clustering?

Simple random sampling from a homogeneous population produces independent observations. The other options all involve natural grouping that creates within-group correlation.

3. The ecological fallacy occurs when:

The ecological fallacy is the error of assuming that relationships observed at the group (aggregate) level apply to individuals. Group-level and individual-level associations can differ substantially.

4. The ICC ranges from:

For most practical applications, the ICC ranges from 0 (no clustering effect) to 1 (all variance is between clusters). Technically it can be slightly negative but this is rare and usually indicates no meaningful clustering.

5. If σ²g = 4 and σ² = 16, the ICC is:

ICC = σ²g / (σ²g + σ²) = 4 / (4 + 16) = 4/20 = 0.20.

6. A design effect of 5 means the effective sample size is approximately:

The design effect inflates the variance by a factor of 5, which means the effective sample size is reduced to approximately 1/5 (one-fifth) of the nominal sample size.

7. Ignoring clustering in analysis primarily leads to:

The most common consequence of ignoring clustering is underestimated SEs, leading to test statistics that are too large, P-values that are too small, and confidence intervals that are too narrow—inflating Type I error.

8. In a study with ICC = 0.05 and 40 subjects per cluster, the design effect is:

deff = 1 + (m̄ − 1)ρ = 1 + (40 − 1)(0.05) = 1 + 1.95 = 2.95.

9. Cross-classified data structures differ from nested structures because:

In cross-classified structures, units are simultaneously classified by two or more grouping factors that are not nested within each other (e.g., students classified by both school and neighbourhood).

10. Confounding by cluster occurs when:

Confounding by cluster happens when a characteristic of the cluster (e.g., hospital quality, regional socioeconomic status) is related to both the predictor and the outcome, creating a spurious or biased association.

11. The robust (sandwich) variance estimator requires:

The sandwich estimator’s consistency relies on having enough clusters for the between-cluster variance to be well estimated. With too few clusters, it can underestimate the true variance.

12. Fixed effects for clusters:

Fixed effects for clusters add indicator variables for each cluster, which controls for all measured and unmeasured cluster-level confounders but consumes degrees of freedom equal to the number of clusters minus one.

13. Which statement about simulation studies on clustering is TRUE?

Simulations show that even ICC values as small as 0.01–0.02, when combined with large cluster sizes, can inflate actual Type I error rates well beyond the nominal level, because deff depends on both ICC and cluster size.

14. Survey methods for analyzing clustered data:

Survey methods use design-based inference that explicitly accounts for the features of the sampling design (stratification, clustering, and differential selection probabilities/weights).

15. The deff-based correction for clustering involves:

The correction involves multiplying naive SEs by √deff (or equivalently, dividing test statistics by √deff) to account for the variance inflation due to clustering.

Lesson 8 Complete!

Congratulations! You have successfully completed the Introduction to Clustered Data module. Your responses have been downloaded automatically.