# Lesson 6 — Ecological & Group-Level Studies (v3 expanded)

*Companion-podcast transcript • Sarah & Kiffer* 
*~5,319 words • ~28.8 min audio*

---

**Sarah:** Welcome back to Office Hours. I'm Sarah.

**Kiffer:** And I'm Kiffer. Today we're working through Lesson 6, Ecological and Group-Level Studies. And I'll say up front, this is one of the most conceptually important lessons earlier in this series, even though the methods themselves can sometimes feel like a side trip from the main road.

**Sarah:** Okay, let me set the stage. The last few lessons all worked at the level of the individual person. Cross-sectional, case-control, and cohort. All three sample people, measure exposures and outcomes on people, and try to make claims about people.

**Kiffer:** Right. And this lesson changes the unit of analysis to the group. Counties. Schools. Nations. Neighborhoods. Provinces. Census tracts. The thing you're sampling is no longer a person. It's a place, or an institution, or some other collective.

**Sarah:** And immediately you get a problem the previous designs did not have. Even when group-level associations are strong and well-measured, you cannot, in general, conclude anything about the individuals inside those groups.

**Kiffer:** And the central trap that the entire lesson is organized around has a name. The ecological fallacy. The phrase you'll spend most of this lesson learning to recognize.

**Sarah:** Can we ground this with a quick example before we go further? Because the abstract version sounds harmless until you see it in action.

**Kiffer:** Sure. Imagine you find that countries with higher per-capita chocolate consumption have more Nobel Prize winners per capita. The data, at the country level, shows a clean positive correlation. Switzerland, Sweden, Belgium. Lots of chocolate. Lots of Nobel laureates. And actually this is a real correlation that was published in the New England Journal of Medicine in 2012.

**Sarah:** Right. And the question is, does that mean if I personally eat more chocolate, I become more likely to win a Nobel Prize?

**Kiffer:** No. Obviously not. The countries with high chocolate consumption tend to be wealthy European countries with strong research infrastructure, lots of universities, deep traditions of scientific funding. Those same countries also produce lots of Nobel Prize winners. Chocolate is not causing Nobels. They're both effects of being a wealthy European country with a long academic tradition.

**Sarah:** And the danger is that if you took the country-level association and applied it to individuals, eat more chocolate, become a Nobel laureate, you would be making a category error. The chocolate-Nobel correlation is real at the country level. It just does not transfer down to individuals.

**Kiffer:** That's the ecological fallacy in a nutshell. And the lesson walks you through why it happens, when it happens, and what analytic tools can mitigate it.

**Sarah:** Okay, four sections. Section 1, the rationale for these designs and why anyone uses them despite the fallacy. Section 2, the variables they use and the linear model. Section 3, the inferential traps, including the fallacy itself. And Section 4, the mitigations and the related-but-distinct world of group-level studies that don't commit the fallacy at all.

**Kiffer:** Let's start with Section 1. What is an ecologic study, and why would anyone do one?

**Sarah:** An ecologic study is a study where exposure, outcome, and confounders are all measured at the group level. Counties. Townships. Nations. School districts. And then the researcher wants to make inferences about individuals. The groups serve as cluster samples of the population.

**Kiffer:** And the lesson distinguishes a few subtypes that are worth knowing. Exploratory ecologic studies don't measure exposure directly. They look for associations to guide future research. Analytic ecologic studies do measure the exposure factor and include it in the analysis. And partial ecologic studies combine some individual-level variables with group-level variables, which introduces its own unique inferential challenges.

**Sarah:** And the primary limitation, which the lesson states upfront, is that we don't know the joint distribution of risk factors and disease within groups. We see the totals. We see, for example, that 30 percent of a county is exposed and that the cancer rate in that county is 50 per 100,000. But we don't see who within that county has both the exposure and the disease.

**Kiffer:** And that ignorance about within-group associations is exactly what creates the potential for severe bias when we try to infer from group level to individual level. We're missing the most important piece of the puzzle. The thing that matters for individuals.

**Sarah:** Two short opening examples make the design concrete. The first is Example 29.1 from the textbook. County-level data on cancer incidence and arsenic levels in groundwater in Idaho. After adjusting for confounders, no significant relationship was found between county-level arsenic exposure and cancer incidence.

**Kiffer:** Quick context on arsenic, since this matters. Arsenic is a naturally occurring metalloid that leaches into groundwater from certain rock formations. Idaho, like many parts of the American West, has natural geological deposits that put arsenic into well water. And arsenic is a known human carcinogen. It has been linked, at the individual level, to bladder cancer and skin cancer in particular.

**Sarah:** So you'd think a county-level analysis would pick up the signal. Higher arsenic counties would have more cancer.

**Kiffer:** And it didn't. Which illustrates one of the recurring patterns in this lesson. Group-level analyses can fail to detect individual-level associations that really do exist. The aggregation washes out the signal. Or it picks up something else entirely.

**Sarah:** The second example is Example 29.2. Bladder cancer mortality rates across US states, regressed on state-level smoking prevalence, health insurance coverage, UV index, and water supply type. The analysis identified some associations. But the question, always, is whether they reflect true individual-level mechanisms or just structural correlations between states.

**Kiffer:** Now the question students always ask. Given the inferential limit we just stated, why does anyone do ecologic studies at all?

**Sarah:** Four reasons. First, measurement constraints at the individual level. Some exposures are impractical or impossible to measure for everyone. Historical pollution levels going back decades. Region-wide policy changes. The cumulative dietary intake of an entire population. Group-level aggregates serve as proxies. You cannot reconstruct what every person in a county was breathing in 1985, but you can sometimes find county-level air quality records.

**Kiffer:** Second, exposure homogeneity within groups. Sometimes within a group, exposure is essentially uniform. All residents of a region drink from the same water supply. All schoolchildren in a district get the same curriculum-based intervention. All patients in a clinic receive the same standard of care. In those situations, the group is the natural unit of exposure variation. Individuals within a group don't differ on the exposure, so there's nothing to be gained by sampling at the individual level.

**Sarah:** Third, interest in group-level effects. Sometimes the question itself is about the group, not the individual. Do communities with water fluoridation have lower dental caries rates? Do nations with higher vaccination coverage have lower measles incidence? Do cities with bicycle infrastructure have lower cardiovascular disease rates at the population level? The group is the unit of scientific interest in its own right.

**Kiffer:** And fourth, the lesson flags this as a warning rather than a virtue. Simplicity. Ecologic analysis is often faster and cheaper than acquiring individual-level data across many groups. But the lesson is sharp about this. That simplicity hides serious methodological problems. Saving time by aggregating data can cost you the ability to draw individual-level conclusions at all.

**Sarah:** So the first three reasons are about when ecologic data are the only data you have. The fourth is a caution that simplicity is not free.

**Kiffer:** Right. Okay, Section 2. The variables and the linear model.

**Sarah:** Three categories of variables in ecologic models. Aggregate, environmental, and global. And the lesson is careful to distinguish them because they differ in whether the variable has any individual-level analogue at all.

**Kiffer:** Let's walk through them one at a time. Aggregate variables. These are summaries of individual-level measurements within a group. Mean body mass index of residents. Proportion of smokers. Median household income. Disease rate. Each one is built by taking individual data and averaging or proportioning it up to the group level.

**Sarah:** So aggregate variables have a clear individual analogue. Each person in the group has their own body mass index. Each person is or isn't a smoker. The group-level variable is just the summary of those individual values.

**Kiffer:** Environmental variables. These are physical measurements of the place itself. Air pollution levels. Average temperature. Water quality. Ambient noise. These exist in the environment and are experienced by everyone in the group. But they are properties of the place rather than summaries of individuals.

**Sarah:** And there's an interesting quasi-individual analogue here. You could in principle measure each person's personal air pollution exposure, with a wearable monitor, but the standard ecologic measurement is the place-level reading. Air quality at the central monitoring station, applied to everyone who lives in the catchment area.

**Kiffer:** Right. And then global variables. These are characteristics of the group that have no individual-level counterpart at all. Population density. Laws and regulations. Organizational policies. The presence or absence of a public health agency. The minimum legal drinking age. These are intrinsically group-level. There's no meaningful sense in which an individual person has a population density.

**Sarah:** And that distinction matters when we get to the ecological fallacy. Aggregate variables tempt you most strongly toward making individual-level claims, because the variable is just a summary of individuals. Global variables don't have that temptation, because there's no corresponding individual measurement to translate to.

**Kiffer:** Now the linear model. The standard analytic move in ecologic studies is a linear regression at the group level.

**Sarah:** Let me try to describe the model in plain words, because the lesson presents it with symbols, but for podcast purposes we want it spoken. The outcome rate in each group, say cancer rate, is modeled as a baseline value plus a slope coefficient times the proportion of the group that's exposed, plus another slope coefficient times a confounder, plus an error term. The intercept is the predicted outcome rate when the proportion exposed is zero. The slope tells you how much the outcome rate changes as the exposure proportion in the group goes up.

**Kiffer:** And from this model, the textbook gives Equation 29.1, which says the group-level incidence rate ratio equals one plus the slope coefficient divided by the intercept. In everyday terms, what that ratio is doing is comparing the predicted outcome rate in a hypothetical fully exposed group, where 100 percent of the people are exposed, to the predicted rate in a hypothetical fully unexposed group, where zero percent are exposed.

**Sarah:** And the catch is right there. Most observed groups don't span that range. You don't usually have counties where literally zero percent of people smoke. You don't have counties where literally one hundred percent of people smoke. The ratio depends on extrapolating beyond the range of your actual data.

**Kiffer:** Which is a methodological caution the lesson really wants you to internalize. The group-level incidence rate ratio you compute is a model-based extrapolation. It is not directly observed in any group.

**Sarah:** And there are a few more modeling pitfalls worth flagging. The first is correlation versus regression. About a third of published ecologic studies use simple correlation coefficients instead of regression coefficients. And the lesson says this is even worse than the regression-based approach.

**Kiffer:** Why is it worse? Because regression coefficients give you an estimate of the incidence rate difference, the actual change in outcome rate per unit change in exposure. Correlation just gives you the rank correspondence. You can know that high-exposure groups tend to have high outcome rates, but you don't know by how much. So you cannot translate a correlation into an estimated rate ratio in any clean way.

**Sarah:** Second pitfall. Standardized rates. Some studies use standardized mortality ratios or standardized incidence ratios instead of crude rates. Standardization adjusts for differences in the age structure of populations, which is often necessary, but it introduces additional analytic complexity. The standardized rate is itself a model-based quantity, and the model has its own assumptions.

**Kiffer:** And third, interaction terms. The form of an interaction at the group level can differ from the form at the individual level. Group-level analyses typically use linear models, where effects add. Individual-level analyses typically use logit models, where effects multiply on the odds scale. So an interaction term in a group-level linear model and an interaction term in an individual-level logistic regression are not estimating the same quantity. They can disagree even when the underlying biology is the same.

**Sarah:** Which sets up Section 3 nicely. Because the trap we keep flagging gets formalized here.

**Kiffer:** Section 3. The trap. The fallacy itself.

**Sarah:** The ecological fallacy. The error of assuming that a group-level association applies to individuals. A finding at the group level, exposure associated with three times the disease risk, does not necessarily mean the same is true for individuals.

**Kiffer:** And it was formally named by William S. Robinson in 1950. Robinson was an American sociologist at Columbia University. He was working on a paper called Ecological Correlations and the Behavior of Individuals, published in the American Sociological Review.

**Sarah:** And what did he do? Walk us through it, because this is the original demonstration.

**Kiffer:** He used 1930 US census data on race and literacy. At the state level, states with higher proportions of Black residents had higher illiteracy rates. The naive ecological inference, the one Robinson was warning against, would be to say that being Black causes illiteracy.

**Sarah:** Which is obviously wrong as a causal claim, but the data did show that pattern.

**Kiffer:** Right. And when Robinson then looked at individual-level data within states, the within-group association was much weaker than the between-group association. Most of the state-level pattern was driven by the fact that Black residents at the time were concentrated in southern states with worse education systems for everyone, regardless of race. The state-level correlation reflected geography and segregation and underfunded schools, not an individual-level relationship.

**Sarah:** So the within-state slope and the between-state slope had completely different magnitudes.

**Kiffer:** Different magnitudes, and in the most dramatic cases the slopes can have different signs entirely. And what Robinson showed is structural. The same dataset can support opposite conclusions at different levels of analysis. The ecological regression can point the wrong way for individuals.

**Sarah:** And the lesson notes that group-level bias typically exaggerates the association away from the null. Sometimes it can reverse the direction entirely, but more often it just inflates whatever is there.

**Kiffer:** Now the lesson also names a second, opposite error. The atomistic fallacy. Assuming that individual-level findings apply at the group level.

**Sarah:** And the classic example is herd immunity. Walk us through that.

**Kiffer:** Herd immunity is a population-level phenomenon. When a sufficient fraction of a population is immune to an infectious disease, either through vaccination or prior infection, the disease cannot spread efficiently. Even unvaccinated people are protected, because the chains of transmission break down before they reach those individuals.

**Sarah:** And the key point is that this protection only exists at the population level. There is no individual-level concept that maps onto herd immunity. You can't say, this individual has herd immunity. The property is emergent. It exists for populations, not for the individuals inside them.

**Kiffer:** So if you took an individual-level finding, that being unvaccinated raises your individual risk of measles, and naively scaled it up, you might predict that a population of unvaccinated people would have a measles rate proportional to their individual risks. But that ignores the network structure of transmission. At the population level, vaccination coverage interacts with population density and contact patterns in ways that don't show up at the individual level at all.

**Sarah:** And populations have other emergent properties. Hospital capacity. Health system organization. Cultural norms. Policy environments. These are not just averages of individual properties. They are different things.

**Kiffer:** So the key distinction. The ecological fallacy is when group-level findings get incorrectly applied to individuals. The atomistic fallacy is when individual-level findings get incorrectly applied to groups. They're mirror errors.

**Sarah:** Now the lesson catalogs three specific mechanisms by which the ecological fallacy turns from a structural worry into actual quantitative bias. Three sources of ecologic bias.

**Kiffer:** Source one. Within-group misclassification. And here's a really counterintuitive fact. Non-differential misclassification at the individual level, the kind that biases individual-level studies toward the null, biases group-level estimates away from the null. Opposite direction.

**Sarah:** Okay, walk us through why.

**Kiffer:** At the group level, what matters is the proportion of people in each group classified as exposed. If your exposure measurement has imperfect sensitivity and specificity, the proportions you observe are compressed toward the middle. Groups with truly high exposure look slightly less exposed than they really are. Groups with truly low exposure look slightly more exposed. So the apparent contrast between groups gets shrunk.

**Sarah:** And then the group-level slope, which is fitted to that compressed contrast, comes out steeper to compensate. Because you're attributing the same outcome difference to a smaller exposure difference.

**Kiffer:** Exactly. The textbook gives a formula, Equation 29.3, which expresses the group-level incidence rate ratio in terms of sensitivity, specificity, and the true individual-level rate ratio. In words, the bias is a function of how badly the exposure is measured at the individual level. And it goes the wrong way relative to what your intuition from individual-level studies would suggest.

**Sarah:** Source two. Group-level confounding. Differential distribution of individual-level risk factors across groups.

**Kiffer:** And here's another counterintuitive point. Factors that are not confounders at the individual level can still cause confounding at the group level. The textbook works through this in Example 29.5. You can have a setting where an individual-level analysis correctly finds no confounding by some variable, and yet the group-level analysis is severely biased by that same variable.

**Sarah:** Wait, how is that possible?

**Kiffer:** Because at the group level, you're working with means and proportions. If groups differ in their average level of some risk factor, and that risk factor also affects the outcome, then the between-group association picks up an effect that, at the individual level, was uncorrelated with exposure. The aggregation creates a correlation that didn't exist for individuals.

**Sarah:** And the lesson notes that controlling for extraneous risk factors in ecologic analysis only removes part of the bias. You can't fully fix it by adjustment, because the structural problem is in the aggregation itself, not in any one variable.

**Kiffer:** Source three. Effect modification by group. Sometimes called interaction by group. When the rate difference at the individual level varies across groups, you get a mismatch between the additive linear model used at the group level and the multiplicative logit model used at the individual level.

**Sarah:** And the lesson points to Example 29.6, which is the most striking case in the textbook. Effect modification by group completely reversed the direction of association. The true individual-level incidence rate ratio was 5.0, meaning the exposure caused a fivefold increase in disease at the individual level. And the ecologic incidence rate ratio was 0.67. Less than one. Making a harmful exposure look protective at the group level.

**Kiffer:** That is the worst-case version of the fallacy. The group-level analysis tells you the exposure prevents disease. The individual-level reality is that the exposure causes a fivefold increase in disease. Same data. Two opposite stories, depending on the unit of analysis.

**Sarah:** Then the lesson tells you when ecologic bias is less likely. Conditions that minimize the problem. A large observed range of exposure across groups, so there's actually variation to detect. Small within-group variance of exposure, so groups are homogeneous internally. Strong risk factors that vary substantially across groups. Similar distribution of extraneous risk factors across groups, so there's little group-level confounding. And including positive and negative health controls to test the structural assumptions.

**Kiffer:** And the lesson is also clear that cross-level bias will not occur if two specific conditions hold. The incidence rate difference within groups is uniform across groups, meaning the exposure has the same effect on individuals everywhere. And there is no correlation between group-level exposure and the rate of the outcome in the unexposed.

**Sarah:** Both of which are strong assumptions. They essentially require no effect modification by group and no group-level confounding by anything that affects baseline risk. So in practice, those conditions are rarely fully met.

**Kiffer:** And the lesson includes two interactive simulators that I really recommend playing with. They're worth the time.

**Sarah:** The first is the Ecological Fallacy Explorer. You set a within-group slope, which describes how exposure relates to outcome for individuals inside the same group. And you set a between-group slope, which describes how the group means relate. When the two slopes have opposite signs, you can watch the ecological regression line literally point the wrong way for individuals.

**Kiffer:** And there's a Robinson 1950 preset that reproduces the original demonstration. Try that one first. Set the within-group slope to plus one, set the between-group slope to minus one, and the ecological regression confidently announces a negative association in a dataset where every individual relationship is positive.

**Sarah:** Building the failure mode by hand is the fastest way to feel why the fallacy is structural rather than a fluke. You're not seeing a one-off coincidence. You're seeing a logical consequence of how aggregation works.

**Kiffer:** And the second simulator is the MAUP Sandbox. MAUP stands for the modifiable areal unit problem. We should spell that out because it'll come up again. The modifiable areal unit problem is the observation that when you analyze spatial data, the result depends on how you draw the area boundaries.

**Sarah:** And the simulator demonstrates this beautifully. It holds the underlying individual-level data fixed. The same 144 simulated people, with their exposure and outcome values fixed. And then it lets you change only how zone boundaries are drawn around those people.

**Kiffer:** Same individuals. Different zoning. Different ecological findings. The area-level correlation can change dramatically. It can even flip sign. Without anything about the people changing.

**Sarah:** Which is a crucial point. Ecologic results are sensitive to choices that look administrative rather than scientific. Where you draw the boundaries determines what you find. A cancer cluster study might show significant clustering at the postal code level but not at the health region level. Both results are real. Both are partly artifacts of where someone decided to draw the lines.

**Kiffer:** And the choice of areal unit is often made by people who weren't thinking about epidemiology at all. Census tracts are drawn for census purposes. Postal codes are drawn for mail delivery. Health authorities are drawn for administrative convenience. The epidemiologist inherits those boundaries and has to live with whatever statistical artifacts they introduce.

**Sarah:** Section 4. The constructive response.

**Kiffer:** First, the analytic strategies for minimizing ecological bias. Multilevel modeling, also called hierarchical modeling. The idea is that you simultaneously model variation at the individual level and variation at the group level. By incorporating both, you can distinguish individual-level effects from contextual group-level effects.

**Sarah:** And we should pause on the term contextual effects, because it's important. A contextual effect is the effect of a group-level characteristic on individual outcomes, over and above the individual's own characteristics. So even after controlling for an individual's own income, the average income of their neighborhood might still affect their health. That residual neighborhood effect is contextual.

**Kiffer:** And to detect contextual effects you need data at both levels. Individual incomes and neighborhood mean incomes. Multilevel modeling estimates them jointly. You also get an estimate of the intracluster correlation coefficient, abbreviated ICC, which tells you how much of the total variation in the outcome lies between groups versus within groups.

**Sarah:** Multilevel modeling is its own substantial methodological topic. We'll see it again earlier in this series. For now, the key point is that when you have data at multiple levels, modeling them at multiple levels is the appropriate response. Aggregating up to the group level loses information. Disaggregating down to the individual level pretends information you don't have.

**Kiffer:** Second strategy. The two-phase design from Wakefield and Haneuse in 2008.

**Sarah:** Quick context. Jon Wakefield is a British biostatistician at the University of Washington. Sebastien Haneuse is now at the Harvard T. H. Chan School of Public Health. In 2008 they published a paper in Statistical Methods in Medical Research formalizing a particular hybrid design for ecologic data.

**Kiffer:** And the idea is to combine ecologic data with individual-level data using outcome-dependent sampling within groups. Phase one collects ecologic data on all groups. Phase two collects individual-level data, but only on a sample of individuals, with the sampling probability depending on outcome status.

**Sarah:** So you don't need complete individual-level data for every group. Just enough to anchor the ecologic associations and pin down the within-group joint distribution that pure ecologic data leaves invisible.

**Kiffer:** And third, the use of prior information. If you have information from previous individual-level studies about the within-area joint probabilities and the contextual effects, you can incorporate that into your current ecologic analysis. Often through Bayesian methods, where the prior information becomes a prior distribution and the ecologic data updates it.

**Sarah:** Then the lesson makes a really important conceptual move. Not all studies that use group-level data are ecologic studies.

**Kiffer:** And this distinction often gets glossed over in textbook treatments. So pay attention to it. The distinguishing question is, where are the inferences directed? If you measure variables at the group level and your inferences are also at the group level, you are not committing the ecological fallacy. The fallacy only happens when you try to translate group-level findings into individual-level claims.

**Sarah:** Examples of legitimate non-ecologic group-level studies. Health promotion programs targeting whole communities, evaluated by community-level outcomes. Vaccination campaigns evaluated by population-level coverage and population-level incidence. Organizational interventions in clinics or hospitals, with outcomes measured at the organization level.

**Kiffer:** And those are perfectly legitimate scientific studies. They use group-level data. They draw group-level conclusions. They never make claims about individuals. So the ecological fallacy does not arise. The data and the inference are both at the same level.

**Sarah:** And the lesson invokes Geoffrey Rose here. Rose was a British epidemiologist at the London School of Hygiene and Tropical Medicine. He died in 1993, but his 1985 paper Sick Individuals and Sick Populations, and a 2001 republication of related ideas, is one of the most influential pieces of conceptual work in modern epidemiology.

**Kiffer:** And Rose distinguished two key epidemiologic questions. First, what is the etiology of a case? Why did this individual person get sick? That's an individual-level question. And second, what is the etiology of incidence? Why does this population have these rates? Why does this population have higher rates than that one? That's a population-level question.

**Sarah:** And Rose's point is that these are different questions. They have different answers. They require different kinds of analysis. The factors that distinguish the people who got sick from those who didn't, within a population, are not the same as the factors that distinguish high-incidence populations from low-incidence ones.

**Kiffer:** His textbook example is salt and blood pressure. Within a single population, individual variation in salt intake doesn't strongly predict who develops hypertension. But between populations, average salt intake is a strong predictor of population hypertension prevalence. Same exposure. Different question. Different answer.

**Sarah:** And the implication is that the etiology of incidence is often about population-level factors. Cultural norms. Policy environments. Food supply. Built environment. Things that don't show up clearly when you only look at individuals.

**Kiffer:** Both questions are important. The atomistic fallacy arises when researchers reduce all phenomena to individual-level explanations and ignore emergent group properties. Public health, in particular, often has population-level questions as its real target.

**Sarah:** And the lesson includes a sobering note from Dufault and Klar in 2011. They reviewed the reporting quality of published ecologic studies.

**Kiffer:** Quick context. Brendan Dufault and Neil Klar were Canadian biostatisticians. Klar was at the University of Western Ontario. They published a methodological review of ecologic study reporting practices and found, frankly, concerning patterns.

**Sarah:** Yeah, what did they find?

**Kiffer:** Only 18 percent of published ecologic studies explicitly justified their choice of ecologic units. Why census tracts and not health authorities? Why countries and not provinces? The choice was usually unstated. Which is exactly the choice the modifiable areal unit problem says matters most.

**Sarah:** 97 percent of outcomes were aggregate in nature. Built up from individual-level measurements. Which means the studies were almost always trying to make individual-level inferences from group-level summaries.

**Kiffer:** 54 percent relied on fewer than 100 group-level observations. That's a small sample for the kind of multivariable models often fit. With small numbers of groups, you have very limited power to detect anything, and the regression coefficients are unstable.

**Sarah:** Only 42 percent adequately justified why an ecologic design was necessary. Most studies did not sufficiently inform readers about possible ecologic bias. So the limitations we just spent forty minutes describing were largely absent from the published reports.

**Kiffer:** Which is a sobering picture. The methodological literature on ecologic studies is good. The published practice often is not.

**Sarah:** Okay. Pulling the takeaways together.

**Kiffer:** Let me list them. There are about seven main ones I'd want a student to leave with.

**Sarah:** Yeah, go for it.

**Kiffer:** First takeaway. Ecologic studies measure variables at the group level but typically want to infer about individuals. That cross-level inference is the central methodological challenge of the design. Everything else in the lesson is about either justifying when you can do that, or warning you about when you can't.

**Sarah:** Second. Three variable types. Aggregate variables, which are summaries of individuals. Environmental variables, which are properties of the place. And global variables, which have no individual analogue at all. The temptation toward the ecological fallacy is strongest with aggregate variables, because the individual-level interpretation is right there waiting.

**Kiffer:** Third. The standard analytic move is a linear regression at the group level. The group-level incidence rate ratio equals one plus the slope coefficient divided by the intercept. And the catch is that this ratio is a model-based extrapolation to groups with zero percent and 100 percent exposure, which usually lie outside your observed data.

**Sarah:** Fourth. The ecological fallacy and the atomistic fallacy are mirror errors. The ecological fallacy applies group findings to individuals. The atomistic fallacy applies individual findings to groups. Robinson formally demonstrated the first one in 1950 with race and literacy data. Herd immunity is the canonical example of why the second one matters.

**Kiffer:** Fifth. Three sources of ecologic bias. Within-group misclassification, which biases ecologic estimates away from the null, the opposite direction from individual-level studies. Group-level confounding, which can exist even when individual-level confounding does not. And effect modification by group, which can flip the direction of the association entirely. Example 29.6 shows a true individual-level rate ratio of 5.0 turning into a group-level ratio of 0.67.

**Sarah:** Sixth. The modifiable areal unit problem. The same individual data, reorganized into different zone boundaries, can produce wildly different ecological correlations. Where you draw the lines matters. And the boundaries are usually drawn for administrative reasons that have nothing to do with epidemiology.

**Kiffer:** And seventh. Some studies use group-level data without committing the ecological fallacy at all, because their inferences also stay at the group level. Geoffrey Rose's distinction is the key. The etiology of a case is an individual question. The etiology of incidence is a population question. Both are legitimate. They require different designs and different inferences.

**Sarah:** And one more practical recommendation. Definitely play with the two simulators. The Ecological Fallacy Explorer and the MAUP Sandbox. Building the failure modes by hand is the fastest way to internalize that the ecological fallacy is structural, not a one-off oddity. You will not forget it once you have built it yourself.

**Kiffer:** And one more thing worth saying. Dufault and Klar's findings should sit with you when you read published ecologic studies. Most do not justify their choice of areal unit. Most do not adequately discuss the bias we just spent the lesson cataloguing. So you, as a critical reader, have to do that work yourself. Ask the questions the paper did not ask.

**Sarah:** Next up is Lesson 7. Conceptualization, Measurement, and Causal Specification. We move from the cross-cutting issues that show up across every study design into the deeper questions about what we're actually measuring and how we're modeling causation.

**Kiffer:** Take care, everyone.

**Sarah:** See you there.