Ecological and Group-Level Studies
Evaluating Epidemiological Research
Kiffer G. Card, PhD, Faculty of Health Sciences, Simon Fraser University
Learning objectives for this lesson:
- List the 3 major categories of variable used in ecologic models and describe their attributes
- Describe the constructs of a linear model at the individual and group levels and constraints on estimating incidence rate ratios at the group level
- Describe how within-group misclassification, group-level confounding, and group-level interaction can affect causal inferences
- Describe the basis of the ecologic and atomistic fallacies
- Identify scenarios where ecologic studies are less likely to produce cross-level inferential errors
- Describe how to integrate individual-level studies with ecologic studies to prevent cross-level inferential errors
Dohoo, I. R., Martin, S. W., & Stryhn, H. (2012). Methods in Epidemiologic Research. VER Inc.
Glossary — Key Terms, People & Concepts
📚 Reference page — available throughout the lesson
This glossary collects the key concepts, people, and ideas you will meet in this lesson. Use it as a reference while you work through the material, or as a review before assessments. Type in the search box to filter entries.
Introduction & Rationale for Group-Level Studies
Introduction and Overview
Lessons 3–5 worked one unit of analysis: the individual person. Cross-sectional, case-control, and cohort designs all sample people, measure exposures and outcomes on people, and make inferences about people. Lesson 6 changes the unit of analysis to the group — counties, schools, nations — and immediately introduces a problem the previous designs did not have: even when group-level associations are strong and well-measured, you cannot, in general, conclude anything about individuals. The phrase you will spend most of this lesson learning to recognise is the ecological fallacy — named by Robinson (1950) and elaborated by Selvin (1958). Across the four content sections we move from the rationale for these designs (Section 1), to the kinds of variables they use (Section 2), to the inferential traps they create (Section 3), to the analytic strategies that mitigate those traps and the related-but-distinct world of group-level studies that do not commit the fallacy at all (Section 4).
Learning Objectives
- Define ecologic studies and distinguish exploratory, analytic, and partial-ecologic variants (Morgenstern, 1995).
- Explain why the unit of analysis matters and how it shapes the inferences a design can support.
- Articulate four legitimate reasons to run an ecologic study (measurement constraint, exposure homogeneity, group-level interest, simplicity) and the cost each carries.
- Read group-level examples (arsenic in groundwater, bladder cancer across U.S. states) with the right inferential caveats from the start.
14.1 What Are Ecologic Studies?
Ecologic studies are studies where exposure, outcome, and confounders are all measured at the group level (e.g., townships, counties, nations), but the researcher wants to make inferences about individuals. The groups serve as cluster samples of the population.
Ecologic studies can be exploratory (no direct exposure measurement, looking for associations to guide future research) or analytic (exposure factor is measured and included in the analysis). Some studies are partial ecologic—combining some individual-level variables with group-level variables, which introduces unique inferential challenges.
Key Limitation
The primary limitation of ecologic studies is that we do not know the joint distribution of risk factors and disease within groups. This ignorance about within-group associations creates the potential for severe bias when inferring to the individual level.
14.1.1 Examples of Ecologic Studies
Two short examples make the design concrete before we discuss why anyone would ever use it.
County-level data on cancer incidence and arsenic levels in groundwater were examined. After adjusting for confounders, no significant relationship was found between county-level arsenic exposure and cancer incidence. This illustrates how group-level analyses may fail to detect individual-level associations.
Bladder cancer mortality rates across US states were examined in relation to state-level predictors: smoking prevalence, health insurance coverage, UV index, and water supply type. The ecological analysis identified associations that may or may not reflect individual-level causal mechanisms.
14.1.2 Rationale for Ecologic Studies
Given the inferential limit just stated, why does anyone run an ecologic study at all? Four reasons recur, each unpacked in the accordion below. The first three explain when ecologic data are the only data you have; the fourth is a warning that simplicity has a real cost. Susser (1994) provides the foundational defence of ecologic analysis as both an outlook and a method in modern epidemiology.
Despite their limitations, ecologic studies are sometimes the only practical approach:
Individual-level measurement of some exposures is impractical or impossible. For example, measuring historical pollution levels or dietary intake for an entire population is expensive. Group-level aggregates (e.g., county-level average pollutant concentration, regional disease prevalence) can serve as proxies.
In some situations, exposure is relatively homogeneous within groups. For instance, all residents of a region receive water from the same supply, all schoolchildren in a district receive the same curriculum-based intervention, or all patients in a clinic receive the same standard of care.
Sometimes the research question is fundamentally about group-level phenomena: Do communities with water fluoridation have lower dental caries rates? Do nations with higher vaccination coverage have lower measles incidence? The group itself is the unit of scientific interest.
Ecologic analysis is often simpler and faster than acquiring and analyzing individual-level data across many groups. However, this simplicity may hide serious methodological problems and inferential errors.
Reflection
Think of a public health issue in your community. How might you design an ecologic study to examine it? What would be your unit of analysis (e.g., neighbourhood, city, province)? What group-level variables would you measure?
Minimum 20 characters required.
1. What distinguishes an ecologic study from other observational study designs?
2. What is a "partial ecologic study"?
3. Which of the following is NOT a rationale for conducting ecologic studies?
Types of Ecologic Variables & The Linear Model
Introduction and Overview
Section 1 motivated the design and named its central liability. To use it carefully, we need a vocabulary for the variables it operates on, a model that connects them, and a clear sense of what that model can and cannot say. This section provides all three.
Learning Objectives
- Distinguish aggregate, environmental, and global ecologic variables and explain why the distinction matters for cross-level inference.
- Read and interpret the ecologic linear model Yj = β0 + β1X1j + εj and the group-level incidence rate ratio it implies.
- Explain why estimating IRG requires extrapolation beyond observed exposure ranges and what that means for inference.
- Identify modelling pitfalls: correlation vs regression, standardised outcomes, and cross-level interaction differences.
14.2 Categories of Ecologic Variables
Three major categories of variables can be used in ecologic models, each with different attributes and interpretations. The three flip cards below define each in turn; click each one and notice that they differ in whether the variable has any individual-level analogue at all — a distinction that becomes important when we reach the ecologic fallacy in Section 3.
With the variable types in hand, the standard analytic move is a linear regression at the group level. The model below is the workhorse; the equations that follow it describe what the regression coefficients mean and where their interpretation gets uncomfortable.
14.2.1 The Linear Model in Ecologic Studies
Ecologic studies often use linear regression to model the relationship between group-level exposure and group-level outcome:
Where Y is the outcome rate for group j, X1 is the exposure proportion, X2 is a confounder, and ε is the error term. The group-level incidence rate ratio (IRG) is estimated as:
A major limitation of this approach is that IRG requires extrapolation to groups with 0% and 100% exposure, which may extend far beyond the range of observed data. Additionally, different group sizes may require weighted regression for valid inference.
14.2.2 Modelling Issues
Several issues arise when modelling ecologic data:
- Correlation vs. regression: About 33% of ecologic studies use correlation coefficients instead of regression coefficients. Regression coefficients estimate the incidence rate difference, which correlation does not provide directly.
- Standardized outcomes: Some studies use standardized mortality ratios (SMRs) rather than crude rates, which may introduce additional complexity.
- Interaction terms: The form of interaction at the group level may differ from the individual level when using linear models at group level and logit models at individual level.
Reflection
Consider the three types of ecologic variables. For a study on the relationship between income inequality and mental health outcomes across Canadian provinces, classify each: (a) provincial median income, (b) provincial mental health policy score, (c) average winter temperature.
Minimum 20 characters required.
1. Which type of ecologic variable has NO analogue at the individual level?
2. In an ecologic linear regression model Yj = β0 + β1X1j + εj, what does the group-level incidence rate ratio IRG estimate?
3. Why is using correlation coefficients rather than regression coefficients problematic in ecologic studies?
Inferential Errors & Sources of Ecologic Bias
Introduction and Overview
Sections 1 and 2 set up the design and the model. This section is about the trap. It introduces two complementary errors — the ecologic fallacy and the atomistic fallacy — then catalogues the three structural reasons ecologic estimates can mislead. Two interactive simulators let you build the failure modes yourself rather than just reading about them.
Learning Objectives
- Define the ecologic fallacy (Robinson, 1950) and explain how a strong group-level association can mislead about individual-level effects.
- Define the atomistic fallacy and identify population-level emergent properties (e.g., herd immunity) that have no individual analogue.
- Identify the three structural sources of ecologic bias: within-group exposure misclassification, group-level confounding, and effect modification by group.
- Predict the direction and magnitude of bias under each source and use that prediction to read published ecologic studies critically.
14.3 The Ecologic Fallacy
The ecologic fallacy is the error of assuming that a group-level association applies to individuals. A finding at the group level (e.g., exposure associated with 3x increased disease risk) does not necessarily mean this is true for individuals. This concept was formally named by Robinson (1950).
Watch a country-level pattern flip when you zoom into individuals. Next ▶ advances scenes.
A 6-scene visualization of Simpson's paradox: wine consumption and life expectancy across countries shows a strong positive trend; zoom into France and the individual-level pattern reverses. Aggregate data answers aggregate questions.
The group-level bias typically exaggerates the association away from the null, but can occasionally reverse the direction of association.
14.3.1 The Atomistic Fallacy
The atomistic fallacy is the opposite error: assuming individual-level findings apply at the group level (Schwartz, 1994; Diez Roux, 1998). Populations have emergent properties not found in individuals. A classic example is herd immunity—a population-level phenomenon with no individual-level counterpart.
Key Distinction
The ecologic fallacy occurs when group-level findings are incorrectly applied to individuals. The atomistic fallacy occurs when individual-level findings are incorrectly applied to groups.
Hands-on: Ecological Fallacy Explorer
What you'll do: use the simulator below to set a within-group slope (how X relates to Y for individuals inside the same group) and a between-group slope (how group means relate), then watch the ecological regression line that an analyst would actually report. What to take away: when the within-group and between-group slopes have opposite signs, the ecological regression points the wrong way for individuals — and you have built the canonical failure mode by hand. Try the “Robinson 1950” preset first; it reproduces the original demonstration that gave the fallacy its name.
📊 Interactive: Ecological Fallacy Explorer
Each colored dot is a person, nested inside a group (e.g., a country or neighborhood). Adjust the within-group slope (how X & Y relate inside a group) and the between-group slope (how group means relate). When the two slopes have opposite signs, the ecological regression lies about the individual reality.
Individual-level data
Each dot = a person, colored by group. Black line = individual-level regression slope.
Group-level (ecological) data
Each large dot = a group's mean. Red line = ecological regression — what an ecological study reports.
The simulator shows that the fallacy is real and structural — the same dataset can support opposite conclusions at the two levels. The next subsection names the three specific mechanisms by which the structural problem becomes a quantitative bias in real ecologic estimates.
14.4 Three Sources of Ecologic Bias
Each mechanism below corresponds to a different way the within-group slope and the between-group slope can come apart. The first is about exposure measurement, the second is about confounders that look different at the two levels, and the third is about the mathematical mismatch between linear and logit models. The canonical treatment of these mechanisms is Greenland & Robins (1994) and Morgenstern (1995).
14.4.1 Within-Group Misclassification (Bias)
Non-differential misclassification at the individual level biases group-level estimates AWAY from the null (opposite direction from individual-level studies). This is given by:
Where IR is the true individual-level incidence rate ratio, Se is sensitivity, and Sp is specificity. The example of a school CRD study (Example 29.4) demonstrated how misclassification at the individual level inflates group-level estimates.
14.4.2 Group-Level Confounding
Group-level confounding arises from differential distribution of individual-level risk factors across groups. Critically, even factors that are NOT confounders at the individual level can cause confounding at the group level.
Controlling for extraneous risk factors in ecologic analysis generally only removes part of the bias. Example 29.5 showed confounding that produces biased IRG even when there is no confounding at the individual level.
14.4.3 Effect Modification (Interaction) by Group
When the rate difference at the individual level varies across groups, non-linearity is introduced: the linear model at group level assumes additivity, but the logit model at individual level is inherently non-linear.
Example 29.6 is striking: effect modification by group completely reversed the direction of association. The true individual-level IR was 5.0, but the ecologic IRG was 0.67, making a harmful exposure appear protective at the group level.
14.4.4 When Cross-Level Bias Is Less Likely
Conditions Minimizing Ecologic Bias
Cross-level (ecologic) bias will NOT occur if:
- The incidence rate difference within groups is uniform across groups, AND
- There is no correlation between group-level exposure and the rate of the outcome in the unexposed
Ecologic bias is LESS likely when:
- There is a large observed range of exposure across groups
- There is small within-group variance of exposure (homogeneous groups)
- Exposure is a strong risk factor varying in prevalence across groups
- Distribution of extraneous risk factors is similar among groups (little group-level confounding)
- Include positive and negative health controls to strengthen ecologic evidence
Hands-on: MAUP Sandbox
What you'll do: the second simulator below holds the underlying individual-level data fixed and changes only how zone boundaries are drawn around the same people. What to take away: the area-level correlation can change dramatically — even flip sign — without anything about the people changing. This is the modifiable areal unit problem (Fotheringham & Wong, 1991), and it is the reason ecologic results are sensitive to choices that look administrative rather than scientific. Try each zoning preset; same individuals, different ecological “findings.”
🧭 Interactive: MAUP Sandbox — Same People, Different Zones
A 12×12 grid of people, each with an exposure value (X) and outcome (Y). The individual correlation is fixed — but the area-level correlation depends entirely on how you draw the boundaries. Try each zoning scheme: same people, very different ecological "findings." That is the modifiable areal unit problem (Fotheringham & Wong, 1991).
Individual data (with zoning overlay)
Each tile = a person, shaded by their X value. Black borders show the chosen zones.
Zone-level (ecological) scatter
Each dot = one zone's mean X vs. mean Y. Red line = ecological regression.
What you'll do: simulate three groups whose means rise together but whose individuals within each group are uncorrelated. Compute the within-group correlations, the pooled (overall) correlation, and the group-level (ecologic) correlation, then visualise all three on one scatterplot.
What to take away: a strong group-level correlation can coexist with near-zero individual-level association — that gap is the ecological fallacy in numbers.
set.seed(230)
# Three groups; means line up positively, individuals are flat within group.
group <- rep(c("A", "B", "C"), each = 50)
x <- c(rnorm(50, 2), rnorm(50, 5), rnorm(50, 8))
y <- c(rnorm(50, 3), rnorm(50, 6), rnorm(50, 9))
# Pooled (individual-level) correlation -- dominated by group means
cor(x, y)
# Within each group (truth at the person level)
tapply(seq_along(x), group, function(i) cor(x[i], y[i]))
# Group-level (ecologic) correlation: nearly perfect by construction
gmean <- aggregate(cbind(x, y), list(group = group), mean)
cor(gmean$x, gmean$y)
# Stretch: visualise the discrepancy
plot(x, y, col = factor(group), pch = 19,
xlab = "X", ylab = "Y",
main = "Ecological fallacy: groups separate, individuals flat")
points(gmean$x, gmean$y, pch = 8, cex = 3, col = "black")
Reading the three numbers. The within-group correlations are essentially zero. The group-mean correlation is essentially 1. The pooled correlation lies in between but is dominated by between-group variation. Concluding from the ecologic 0.9999 that there is an individual-level relationship would be the ecological fallacy.
R Reflect on what you just ran
Use the questions below to interpret the output you produced. Look at your console and plot before answering.
1. Compare the three within-group correlations (close to 0) with the group-level correlation (~1.0). Which of these two numbers describes the relationship for an individual person, and which describes the relationship between groups?
2. Look at the scatterplot. The three coloured clusters appear flat (no slope) within themselves but rise diagonally as a whole. Describe in your own words why the pooled correlation (~0.94) is so close to the ecologic correlation rather than to the within-group correlations.
3. Imagine a researcher only had access to the three group-mean stars (no individual dots) — e.g., country-level averages of two variables. State the conclusion they would draw and, using the simulation, explain precisely how that conclusion would be wrong at the individual level.
Reflection
A researcher finds that countries with higher per-capita chocolate consumption have more Nobel Prize winners. They conclude that eating chocolate makes individuals smarter. Identify the inferential error being made and explain why this conclusion is problematic. What confounders might explain the group-level association?
Minimum 20 characters required.
1. What is the ecologic fallacy?
2. How does non-differential exposure misclassification at the individual level affect ecologic study estimates?
3. In Example 29.6, effect modification by group caused the ecologic IRG to be 0.67 when the true individual-level IR was 5.0. What does this demonstrate?
Reducing Bias & Non-Ecologic Group Studies
Introduction and Overview
Section 3 catalogued the failure modes. This section is the constructive response. The first half names analytic strategies that can pull part of the individual-level signal out of group-level data; the second half makes a distinction the textbook treatment often glosses over — that some studies use group-level data without committing the ecologic fallacy at all, because their inferences also stay at the group level.
Learning Objectives
- Describe analytic strategies for reducing ecologic bias, including multilevel modelling (Diez Roux, 1998) and the Wakefield & Haneuse two-phase design (2008).
- Distinguish ecologic from non-ecologic group-level studies based on the level at which inferences are drawn, not the level at which data are collected.
- Apply Rose’s distinction between “the etiology of a case” and “the etiology of incidence” to research questions you encounter.
- Use the Dufault & Klar (2011) reporting-quality findings as a reading checklist for any ecologic study you appraise.
14.5 Minimizing Ecologic Bias
Ecologic bias is less of a problem when certain conditions are met (see Section 3 summary). Additionally, researchers can employ specific analytical strategies:
14.5.1 Analysing Ecologic Data
- Multilevel modelling (MLM): Combines individual-level and group-level data to distinguish individual-level effects from contextual (group-level) effects (Diez Roux, 1998). Validates assumptions and investigates random effects.
- Two-phase design (Wakefield & Haneuse, 2008): Links individual-level data with ecologic data using outcome-dependent sampling within groups, reducing the need for complete individual-level information.
- Prior information: Importance of prior knowledge about within-area probabilities and contextual effects when making inferences.
The strategies above try to repair ecologic data so it can speak about individuals. The next subsection makes the alternative move: keep the inference at the group level and acknowledge that groups are sometimes the right unit of scientific interest in their own right.
14.6 Non-Ecologic Group-Level Studies
Not all studies using group-level data are ecologic studies. A critical distinction:
The Key Difference
When variables are measured at the group level AND inferences remain at the group level → NOT ecologic. The group as the aggregate-scale of interest studying how group-level characteristics (population density, policies, social environments) affect group-level outcomes.
Examples of non-ecologic group-level studies include:
- Health promotion programs targeting communities, with outcomes measured at the community level
- Vaccination campaigns evaluated by population-level coverage and population-level incidence
- Organizational interventions in clinics or hospitals, with organization-level outcomes
14.6.1 The Question of Inference Level
Rose (2001) distinguished two key epidemiological questions:
- "What is the etiology of a case?" This is an individual-level question, seeking to understand why a particular person became ill.
- "What is the etiology of incidence?" This is a population-level question, seeking to understand why populations have different disease rates.
Both questions are important; the appropriate level of analysis depends on the research question. The atomistic fallacy arises when researchers reduce all phenomena to individual-level explanations, ignoring emergent group properties.
14.6.2 Quality of Current Ecologic Research
Dufault & Klar (2011) reviewed the reporting quality of ecologic studies and found concerning patterns:
- Only 18% explicitly justified their choice of ecologic units
- 97% of outcomes were aggregate in nature
- 54% relied on fewer than 100 group-level observations
- Only 42% adequately justified why an ecologic design was necessary
- Most studies did not sufficiently inform readers about possible ecologic bias
Reflection
Consider a city that wants to evaluate whether its new bicycle-sharing program has reduced cardiovascular disease rates. Would an ecologic design or individual-level design be more appropriate? What are the trade-offs? How might you combine both approaches using multilevel modelling?
Minimum 20 characters required.
1. Which of the following conditions makes ecologic bias LESS likely?
2. What is multilevel modelling (MLM) in the context of ecologic studies?
3. When is a group-level study NOT considered an ecologic study?
Final Assessment
Bringing It All Together
This lesson moved from the rationale for ecologic designs through the variables and models they use, to the structural traps that make them treacherous, and finally to the analytic and conceptual responses to those traps. The arc was deliberate: every step was preparation for being able to read a published group-level study without overclaiming or underclaiming what its data actually support.
The single most important idea to carry forward is the one Robinson named in 1950: a finding that holds at the group level need not hold at the individual level — and the reverse (the atomistic fallacy) is equally damaging in the other direction. The three structural sources of ecologic bias (within-group misclassification, group-level confounding, effect modification by group) explain why the cross-level move can fail. Multilevel modelling and the two-phase design explain how to bring some of that signal back without abandoning the group-level data we already have. Rose’s distinction between the etiology of a case and the etiology of incidence reminds us that the right unit of analysis depends on the question being asked, not on which is more familiar from earlier lessons.
Lesson 7 takes the next step: how the constructs we measure (exposure, outcome, confounder) are defined and operationalised in the first place — the conceptualisation step that determines whether any of the cross-level inference machinery in this lesson can do the work we want it to.
The companion R script r-activities/HSCI_230_Lesson_6_Ecological_and_Group_Level_Studies.R simulates three groups whose group means line up almost perfectly while individuals within each group are essentially uncorrelated. You will compute the pooled correlation, the within-group correlations, and the group-mean (ecologic) correlation, and watch the three numbers diverge — a worked demonstration of the cross-level inference trap that defines this lesson.
set.seed(230)
# Three groups; means line up positively, individuals are flat within group.
group <- rep(c("A", "B", "C"), each = 50)
x <- c(rnorm(50, 2), rnorm(50, 5), rnorm(50, 8))
y <- c(rnorm(50, 3), rnorm(50, 6), rnorm(50, 9))
# Individual-level correlation (overall - mostly driven by group means)
cor(x, y)
# Individual-level correlation WITHIN each group (truth at the person level)
tapply(seq_along(x), group, function(i) cor(x[i], y[i]))
# Group-level (ecologic) correlation: nearly perfect by construction
gmean <- aggregate(cbind(x, y), list(group = group), mean)
cor(gmean$x, gmean$y)
Reflection
Reflecting on this lesson, describe a scenario from public health or your field of interest where an ecologic study design would be the most practical and informative approach. What safeguards would you implement to minimize the risk of the ecologic fallacy?
Minimum 20 characters required.
Key Takeaways from Lesson 6
- Ecologic studies measure exposure and outcome at the group level; partial ecologic studies mix individual- and group-level variables, which introduces its own inferential challenges.
- The ecologic fallacy (Robinson, 1950) is the error of assuming a group-level association applies to individuals; the atomistic fallacy (Schwartz, 1994) is the symmetric error in the other direction.
- Three structural sources of ecologic bias — within-group exposure misclassification, group-level confounding, and effect modification by group — explain why the cross-level move so often fails.
- The ecologic linear model estimates a group-level rate ratio that requires extrapolation to 0% and 100% exposure, well beyond most observed data.
- Multilevel modelling and the Wakefield–Haneuse two-phase design are the principal analytic strategies for recovering individual-level signal from group-level data.
- Not every group-level study is ecologic: when both data and inferences stay at the group level, the cross-level fallacies do not apply — Rose’s “etiology of incidence” is a legitimate and important question in its own right.
1. Ecologic studies differ from other observational designs primarily because:
2. A "partial ecologic study" is one where:
3. Which is an example of a global variable?
4. Aggregate variables in ecologic studies are:
5. In the ecologic linear model, the group-level incidence rate ratio IRG requires:
6. The ecologic fallacy refers to:
7. The atomistic fallacy is:
8. Non-differential exposure misclassification in ecologic studies biases estimates:
9. Group-level confounding in ecologic studies:
10. In Example 29.6 from the textbook, effect modification by group caused:
11. Ecologic bias is LESS likely when:
12. Multilevel modelling (MLM) helps address ecologic bias by:
13. A study measuring the effect of a city's water fluoridation policy on community-level dental health (with inferences remaining at the community level) is:
14. According to Dufault and Klar (2011), what proportion of ecologic studies adequately justified the choice of ecologic design?
15. Rose (2001) distinguished between two key epidemiological questions. Which pair correctly represents them?