Introduction to Clustered Data
Exploratory Data Analysis For Epidemiology
Kiffer G. Card, PhD, Faculty of Health Sciences, Simon Fraser University
Learning objectives for this lesson:
- Recognize and describe different types of clustered (hierarchical) data structures in epidemiology
- Explain why observations within clusters are correlated and how this affects standard statistical analyses
- Calculate and interpret the intraclass correlation coefficient (ICC) and design effect (deff)
- Understand the impact of clustering on standard errors and inference for both continuous and discrete outcomes
- Describe key methods for dealing with clustering, including fixed effects, robust variance estimators, and survey methods
- Evaluate the consequences of ignoring clustering in epidemiologic analyses
This course was developed by Kiffer G. Card, PhD, as a companion to Dohoo, I. R., Martin, S. W., & Stryhn, H. (2012). Methods in Epidemiologic Research. VER Inc.
Introduction & Types of Clustered Data
What Is Clustered Data?
In many epidemiologic studies, observations are not independent. Instead, they are grouped within higher-level units—these groups are called clusters. Clustered (or hierarchical) data arises whenever the study design or the natural structure of the population creates groupings such that observations within the same group tend to be more similar to each other than to observations in other groups.
Standard statistical methods assume observations are independent. When data are clustered, this assumption is violated: observations within the same cluster share common influences (e.g., the same hospital, the same household, the same geographic region). Ignoring clustering can lead to underestimated standard errors, inflated Type I error rates, and potentially biased point estimates.
Types of Clustered Data
Clustering arises from many different sources. Understanding the type of clustering present in your data is the first step toward choosing an appropriate analytical strategy.
In veterinary epidemiology, animals within the same herd share management practices, nutrition, housing, and disease exposure. Similarly, patients within the same hospital share institutional protocols, staffing levels, and local disease ecology. These shared exposures create within-cluster correlation—outcomes for subjects in the same cluster are more alike than for subjects in different clusters.
People living near each other are often exposed to similar environmental factors: air pollution, water quality, neighbourhood safety, socioeconomic deprivation, and access to healthcare services. This means that health outcomes for individuals in the same geographic area are correlated. Studies that sample from defined geographic areas (e.g., census tracts, postal codes) must account for this spatial clustering.
When the same subjects are measured multiple times (e.g., before and after treatment, or at regular intervals in a cohort study), each subject forms a cluster. The repeated observations within a subject are correlated because stable individual characteristics (genetics, baseline health, behaviour) influence all measurements. The correlation structure depends on the timing and spacing of measurements.
Many real-world data structures have multiple levels. For example, students are nested within classrooms, classrooms within schools, and schools within districts. Each level contributes its own source of variation. In epidemiology, patients may be nested within physician practices, practices within health regions, and regions within provinces. Split-plot designs from experimental settings also create hierarchical structures where treatments are applied at different levels of the hierarchy.
Cross-classified structures arise when subjects belong to multiple grouping factors that do not nest within each other. For example, students may be classified by both school and neighbourhood—students from the same neighbourhood may attend different schools, and students from the same school may live in different neighbourhoods. Split-plot designs, common in agricultural experiments, have some factors applied at the whole-plot (cluster) level and others at the subplot (individual) level.
Sources of Variation & Predictor Clustering
In clustered data, the total variation in the outcome can be decomposed into between-cluster variation (differences between groups) and within-cluster variation (differences among individuals within the same group). The relative magnitude of these two sources of variation determines the strength of the clustering effect.
An important consideration is predictor clustering—when predictor variables also vary between clusters. If both the exposure and the outcome vary at the cluster level, group-level associations may differ from individual-level associations. This is the basis of the ecological fallacy: inferring individual-level relationships from aggregate (group-level) data.
Section 1 Knowledge Check
1. Which is an example of clustered data?
2. What is cross-classified data?
3. Why does predictor clustering matter?
Reflection
Think of a research study in your field. What natural clustering structures might exist in the data? How might ignoring this clustering affect your conclusions?
Effects of Clustering on Statistical Analysis
Impact on Standard Errors
The most important consequence of clustering is its effect on standard errors. When observations within clusters are positively correlated (as is almost always the case), treating them as independent leads to standard errors that are too small. This in turn produces test statistics that are too large, P-values that are too small, and confidence intervals that are too narrow—all of which inflate the Type I error rate.
Imagine a study of 1,000 patients in 20 hospitals (50 per hospital). If the ICC is 0.05, the design effect is 1 + (50 − 1)(0.05) = 3.45. The effective sample size is only 1,000/3.45 ≈ 290 rather than 1,000. A naive analysis treating all 1,000 observations as independent would dramatically overstate the precision of estimates.
The Intraclass Correlation Coefficient (ICC)
For continuous outcomes, the ICC (intraclass correlation coefficient) measures the proportion of total variance that is attributable to between-cluster differences. It quantifies the degree of similarity among observations within the same cluster.
Here, σ²g is the between-cluster variance and σ² is the within-cluster variance. An ICC of 0 means no clustering effect (all variance is within clusters), while an ICC of 1 means all variance is between clusters.
The Design Effect (deff)
The design effect (also called the variance inflation factor in the clustering context) quantifies how much the variance of an estimate is inflated due to clustering, compared to what it would be under simple random sampling.
Where m̄ is the average cluster size and ρ is the ICC. The practical meaning: if ICC = 0.05 and cluster size = 20, then deff = 1 + (20 − 1)(0.05) = 1.95, meaning the effective sample size is roughly halved.
Effects on Continuous Outcomes
For continuous outcomes, the ICC directly measures the proportion of total variance due to between-cluster differences. The design effect formula deff = 1 + (m̄ − 1)ρ applies straightforwardly. The corrected standard error is obtained by multiplying the naive SE by √deff.
Example: With 20 clusters of 50 subjects each (n = 1,000), ICC = 0.05, the deff = 3.45. A naive SE of 0.50 would become 0.50 × √3.45 = 0.93—nearly double the naive estimate.
Effects on Discrete Outcomes
With binary or other discrete outcomes, the effects of clustering are analogous but more complex. Clustering affects not only the standard error estimation but can also influence point estimates. The design effect concept extends to discrete outcomes, but the ICC for binary data is defined differently and its estimation is more involved.
For binary outcomes, the variance of a proportion under clustering is inflated by a factor analogous to the deff. The practical consequence is the same: ignoring clustering leads to underestimated SEs and inflated Type I error.
| Analysis Approach | SE Estimate | 95% CI Width | P-value |
|---|---|---|---|
| Naive (ignoring clustering) | 0.50 | 1.96 | 0.001 |
| Cluster-adjusted (deff = 3.45) | 0.93 | 3.64 | 0.087 |
This table illustrates how accounting for clustering nearly doubles the standard error, widens the confidence interval, and can change a “significant” result to a non-significant one.
Section 2 Knowledge Check
1. If the ICC is 0.10 and average cluster size is 21, what is the design effect?
2. What happens to Type I error rates when clustering is ignored?
3. The ICC represents:
Reflection
A study reports p = 0.03 for a treatment effect, but the data come from 10 hospitals with 50 patients each. If the ICC is 0.05, calculate the design effect and discuss whether the finding might still be significant after accounting for clustering.
Simulation Studies & Impact of Clustering
Why Simulation Studies?
Simulation studies allow us to examine the practical consequences of ignoring clustering under controlled conditions. By generating data with known clustering structures and then analysing it both correctly and incorrectly, we can quantify the bias and Type I error inflation that results from ignoring the cluster structure.
Even moderate ICC values (e.g., 0.01–0.05) can lead to substantially inflated Type I error rates when cluster sizes are large. A study with ICC = 0.01 and 50 clusters of 50 subjects can have an actual Type I error rate of 10–15% instead of the nominal 5%. Researchers who ignore clustering risk reporting findings that appear statistically significant but are actually false positives.
Binary Outcome Simulations
Simulation studies with binary outcomes demonstrate that the consequences of ignoring clustering can be severe. Even when the ICC is small, the combination of within-cluster correlation and moderate-to-large cluster sizes can inflate the actual Type I error rate well beyond the nominal 5% level.
With ICC = 0.01 and 50 subjects per cluster, the design effect is deff = 1 + (50 − 1)(0.01) = 1.49. Although this seems modest, simulation studies show the actual Type I error rate can reach 10–15% depending on the number of clusters and the analysis method. The inflation occurs because the naive analysis assumes 50 independent observations per cluster when, in reality, the effective number is only about 34.
With ICC = 0.05 and 20 subjects per cluster, the design effect is deff = 1 + (20 − 1)(0.05) = 1.95. The effective sample size is nearly halved. Simulation studies show actual Type I error rates of 15–25% when the naive analysis is used. This means that one in four or five “significant” findings may be false positives.
With ICC = 0.10 and 30 subjects per cluster, the design effect is deff = 1 + (30 − 1)(0.10) = 3.90. The effective sample size is reduced to about one-quarter of the nominal size. Simulation studies demonstrate Type I error rates exceeding 30–40% when clustering is ignored. This level of inflation makes the naive analysis essentially unreliable.
Confounding by Cluster
Beyond inflating standard errors, clustering can also introduce confounding. If a cluster-level variable is associated with both the exposure and the outcome, it acts as a confounder. Failure to account for the clustering structure means this confounding is not addressed, which can lead to biased point estimates—not just incorrect standard errors.
Suppose disease prevalence varies by region, and exposure to a risk factor also varies by region. If we analyse the data without accounting for region (the cluster), the association between exposure and disease will be confounded by regional differences. The estimated exposure effect may be biased upward or downward depending on the direction and magnitude of the confounding.
The magnitude of this bias depends on the correlation between the predictor and the cluster-level confounder. Stronger correlations produce greater bias.
Section 3 Knowledge Check
1. In simulations with binary outcomes and moderate ICC, ignoring clustering:
2. How can clustering lead to confounding?
3. The inflation of Type I error due to clustering depends on:
Reflection
Why might even a small ICC (e.g., 0.02) be problematic in a large cluster randomized trial with 100 participants per cluster? Calculate the design effect and discuss the implications.
Methods for Dealing with Clustering
Detecting Clustering
Before choosing a method for handling clustering, you must first detect and quantify it. Common approaches include visual inspection (e.g., plotting outcomes by cluster), ICC estimation (fitting a random-intercept model to estimate the between-cluster variance), and likelihood ratio tests (comparing models with and without cluster-level random effects).
Methods for Handling Clustered Data
Fixed Effects & Stratification
Include cluster indicators as fixed effects in the model. This effectively stratifies the analysis by cluster, adjusting for all cluster-level confounders (both measured and unmeasured). However, this approach uses many degrees of freedom (one for each cluster minus one) and does not allow estimation of cluster-level predictor effects.
Best when: There are relatively few clusters, cluster-level confounding is the primary concern, and you do not need to estimate effects of cluster-level variables.
Correction Factor Methods
deff-based correction: Divide test statistics by √deff or multiply standard errors by √deff. This is a simple post-hoc adjustment that requires an estimate of the ICC and the average cluster size.
Overdispersion-based correction: A similar principle using a dispersion parameter estimated from the data. The Pearson or deviance goodness-of-fit statistic divided by its degrees of freedom provides a scale factor that can be applied to the variance-covariance matrix.
Best when: A quick adjustment is needed and a more sophisticated approach is not feasible.
Robust (Sandwich) Variance Estimator
The robust variance estimator does not assume a specific correlation structure within clusters. It provides valid standard errors even if the within-cluster correlation is misspecified. This makes it very attractive for practical use.
However, it requires a moderate-to-large number of clusters (rule of thumb: at least 20–30). With too few clusters, the sandwich estimator can underestimate the true variance.
Best when: You have enough clusters and want valid inference without specifying the exact correlation structure.
Survey Methods
Survey methods account for complex sampling designs including stratification, clustering, and unequal selection probabilities (weighting). They use design-based inference rather than model-based inference, which means the validity of the analysis depends on the sampling design rather than distributional assumptions.
Available in most statistical software: Stata’s svy commands, SAS PROC SURVEY procedures, R’s survey package.
Best when: The data come from a complex survey design with known selection probabilities.
| Method | Handles Confounding | Min. Clusters | Assumptions | Software |
|---|---|---|---|---|
| Fixed Effects | All cluster-level | Few OK | None for cluster effects | All packages |
| deff Correction | No | Any | Known ICC | Manual calculation |
| Robust Variance | No | ≥20–30 | None for correlation | Stata, R, SAS |
| Survey Methods | Design-based | Varies | Known design | Stata svy, SAS PROC SURVEY, R survey |
Always check for clustering before finalising your analysis. Estimate the ICC, calculate the design effect, and choose a method appropriate to your study design and the number of clusters. When in doubt, use multiple methods and compare results. If the conclusions are consistent across approaches, you can be more confident in your findings.
Section 4 Knowledge Check
1. The robust (sandwich) variance estimator:
2. When would fixed effects for clusters be most appropriate?
3. Survey methods for clustered data:
Reflection
You are analyzing data from a multi-site clinical trial with 25 sites and approximately 40 patients per site. Which method(s) for handling clustering would you recommend, and why?
Lesson 8 — Comprehensive Assessment
This final assessment covers all material from this lesson. You must answer all 15 questions correctly (100%) and complete the final reflection to finish the lesson.
Final Reflection
Reflecting on this lesson, how has your understanding of clustered data changed? Describe a specific analytical situation where accounting for clustering would be critical and explain which method you would choose to address it.
Final Assessment (15 Questions)
1. Clustered data is characterized by:
2. Which is NOT a source of clustering?
3. The ecological fallacy occurs when:
4. The ICC ranges from:
5. If σ²g = 4 and σ² = 16, the ICC is:
6. A design effect of 5 means the effective sample size is approximately:
7. Ignoring clustering in analysis primarily leads to:
8. In a study with ICC = 0.05 and 40 subjects per cluster, the design effect is:
9. Cross-classified data structures differ from nested structures because:
10. Confounding by cluster occurs when:
11. The robust (sandwich) variance estimator requires:
12. Fixed effects for clusters:
13. Which statement about simulation studies on clustering is TRUE?
14. Survey methods for analyzing clustered data:
15. The deff-based correction for clustering involves: