# Lesson 11 — Confounding & Statistical Inference (v3 expanded)

*Companion-podcast transcript • Sarah & Kiffer* 
*~6340 words • ~34 min audio*

---

**Sarah:** Welcome back to Office Hours. I'm Sarah.

**Kiffer:** And I'm Kiffer. Today we're working through Lesson 11, Confounding and Statistical Inference. This is the lesson that closes the bias inventory of the course.

**Sarah:** So we have spent the last several lessons walking through bias category by category. Selection bias in Lesson 8. Information bias in Lesson 9. Design specific and temporal biases in Lesson 10. And today is the third leg of what is sometimes called the canonical bias triad.

**Kiffer:** Right. Confounding. And then once we are done with confounding, we step back and look at a different family of problems. Things that go wrong in the statistical model itself, even after every bias has been carefully addressed. Things like model misspecification, multicollinearity, Simpson's paradox, and missing data. Those issues can turn an unbiased estimate into a wrong conclusion.

**Sarah:** So the lesson has two main content sections. Section 1 is confounding. Section 2 is statistical inference and model issues. And then a short closing section that is really a bridge into Lesson 12, the integrated appraisal.

**Kiffer:** Let's start at the beginning, with the textbook definition. A variable is a confounder of the relationship between an exposure and an outcome if it satisfies three conditions. First, it is associated with the exposure. Second, it is an independent risk factor for the outcome. And third, it is not on the causal pathway between the exposure and the outcome.

**Sarah:** Let me slow that down because the third condition is the one students miss. The first two are intuitive. The confounder has to be related to both the exposure and the outcome. Otherwise it is not connected to the picture at all.

**Kiffer:** Right. The third condition is what distinguishes a confounder from a mediator. A mediator sits on the causal pathway. The exposure causes the mediator, and the mediator causes the outcome. A confounder, by contrast, sits off to the side. It is a common cause or an alternative explanation, not an intermediate step.

**Sarah:** And that distinction matters operationally, because if you mistakenly adjust for a mediator thinking it is a confounder, you actually introduce bias rather than removing it. You partial out the very mechanism you should be measuring.

**Kiffer:** Exactly. Mediators carry the causal effect. Confounders represent alternative explanations. If you adjust for a confounder, you remove a competing story and clarify the true effect. If you adjust for a mediator, you erase part of the true effect itself.

**Sarah:** Okay. The lesson then dives into what is probably the most consequential confounding case study in modern epidemiology. Hormone replacement therapy and cardiovascular disease in postmenopausal women. Walk me through this one, because it is famous for a reason.

**Kiffer:** Sure. So through the nineteen eighties and nineties, observational studies, including the Nurses Health Study, which is a long running cohort of American nurses begun in nineteen seventy six at Harvard, consistently suggested that postmenopausal women who took hormone replacement therapy, which is a class of medications containing estrogen, sometimes combined with progestin, given to women after menopause to relieve symptoms like hot flashes, those women had cardiovascular disease risk that looked thirty to fifty percent lower than women who did not take the therapy.

**Sarah:** And those observational findings were not minor. They actually shaped clinical practice. Doctors were routinely recommending hormone replacement therapy partly on the strength of those cardiovascular benefits.

**Kiffer:** Right. By the late nineteen nineties, hormone replacement therapy was one of the most commonly prescribed medications in the United States, partly because of the apparent heart protection. Then the Women's Health Initiative changed everything.

**Sarah:** The Women's Health Initiative. What was that?

**Kiffer:** The Women's Health Initiative was a very large randomized controlled trial launched in nineteen ninety one, funded by the United States National Institutes of Health. Tens of thousands of postmenopausal women were randomized either to receive combined estrogen and progestin therapy or to receive a placebo. And in two thousand and two, the trial was halted early, because the results were going in the wrong direction.

**Sarah:** Wait. Halted early. That is rare.

**Kiffer:** It is. The data safety monitoring board determined that continuing the trial would harm the women in the treatment arm. The combined therapy was actually increasing the risk of coronary heart disease, which is the disease where the arteries that supply the heart muscle become narrowed by plaque, with a hazard ratio of one point two nine.

**Sarah:** And just to define hazard ratio for a beginning student. A hazard ratio is the ratio of the rate at which an event happens in one group versus another. A hazard ratio of one means equal rates. Above one means the treatment group has higher rates of the bad outcome. Below one means lower.

**Kiffer:** Right. So one point two nine means roughly twenty nine percent more coronary heart disease in the women on hormone replacement compared to the placebo group. The trial also found increased rates of stroke, of pulmonary embolism, which is a blood clot in the lungs, and of breast cancer.

**Sarah:** So the observational studies said hormone replacement therapy lowered cardiovascular disease risk by thirty to fifty percent. The randomized trial said hormone replacement therapy actually raised coronary heart disease risk by about thirty percent. Those are not just different findings. Those are opposite findings. How does that happen?

**Kiffer:** Confounding by socioeconomic status and health behaviors. The women who chose to take hormone replacement therapy in the observational studies were systematically different from the women who did not. They were wealthier on average. Better educated. Leaner. More physically active. More likely to engage with preventive healthcare. They visited doctors more often. They got mammograms more often. They had access to better food.

**Sarah:** And every single one of those characteristics is independently associated with lower cardiovascular disease risk.

**Kiffer:** Exactly. So when you compare cardiovascular disease rates between the women who chose hormone replacement therapy and the women who did not, you are not really comparing the effect of the therapy. You are comparing wealthier, healthier, more proactive women to less wealthy, less healthy, less proactive women. The therapy looks protective because the women who took it would have had lower rates anyway, even without the medication.

**Sarah:** And the lesson actually walks through a stratification example to make this concrete. Let me see if I can talk through it.

**Kiffer:** Please.

**Sarah:** Imagine a simulated dataset of five thousand postmenopausal women. We build it so that the true effect of hormone replacement therapy on cardiovascular disease is exactly zero. No effect at all. But we let socioeconomic status drive both the choice to use hormone replacement therapy and the cardiovascular disease risk. High socioeconomic status women are much more likely to use the therapy. And high socioeconomic status women have lower cardiovascular disease rates regardless of the therapy.

**Kiffer:** Right. And when you compute the crude, unadjusted odds ratio, you get something like zero point four two. Which looks like the therapy cuts cardiovascular risk by more than half.

**Sarah:** But then when you stratify, when you compute the odds ratio separately within each level of socioeconomic status, you get values right around one in both strata. The truth. No effect.

**Kiffer:** And when you adjust for socioeconomic status in a regression model, the odds ratio also returns to about one. Stratification and adjustment both recover the truth that was hidden by the crude comparison. That is the mechanics of confounding in miniature.

**Sarah:** And the broader lesson is that even very large, well conducted observational studies can produce confidently wrong answers when there is unmeasured or poorly addressed confounding. It took the randomized design of the Women's Health Initiative, which balanced both measured and unmeasured confounders by design, to reveal the true direction of effect.

**Kiffer:** Okay. The hormone replacement story is the textbook case of confounding by lifestyle and socioeconomic status. The next form of confounding the lesson covers is even more pervasive in clinical research. Confounding by indication.

**Sarah:** Walk me through that one.

**Kiffer:** Confounding by indication arises in pharmacoepidemiology, which is the study of how drugs affect health outcomes in real populations. The basic problem is this. The reason a treatment gets prescribed, the indication, is often itself a strong risk factor for the outcome you are studying.

**Sarah:** Give me a concrete example.

**Kiffer:** Suppose you do a study comparing patients who are prescribed a strong opioid pain medication to patients who are not, and you measure pain outcomes a month later. The patients prescribed the opioid have worse pain on follow up. The naive reading is that the opioid is making pain worse.

**Sarah:** But of course the patients prescribed strong opioids were the ones whose baseline pain was already much more severe. The doctors did not flip a coin. They prescribed the medication to the sicker patients.

**Kiffer:** Right. So sicker patients get the treatment. Sicker patients have worse outcomes. The treatment looks harmful even if it is genuinely helping. The same logic applies to almost every drug study built on administrative data. Statins look harmful in naive analyses because they go to people with high cholesterol and existing cardiovascular disease. Antibiotics look harmful because they go to people with serious infections.

**Sarah:** And the mirror image of this is confounding by contraindication.

**Kiffer:** Yes. Confounding by contraindication is when patients with a particular characteristic are deliberately not given a treatment because of safety concerns. Pregnant women are often excluded from receiving certain medications. Patients with kidney disease do not get drugs that are processed by the kidneys. So the people who do not receive the medication look systematically different from those who do, in ways that affect the outcome.

**Sarah:** So both confounding by indication and confounding by contraindication are basically about non random treatment allocation in routine clinical practice. The reason someone is on a medication is tangled up with their underlying risk.

**Kiffer:** And the design solutions are things like new user designs, where you study only patients who are starting a medication for the first time, active comparator designs, where you compare two drugs that are prescribed for the same indication, and propensity score methods, which we touched on back in Lesson 3 when we introduced observational designs. None of these are perfect, but together they make the comparison more credible.

**Sarah:** Okay. So far we have classical confounding by lifestyle, and confounding by indication. The lesson then introduces a much harder problem. Time varying confounding.

**Kiffer:** Time varying confounding is what happens when the confounder evolves over time and is itself affected by prior treatment. So treatment at one time point changes the confounder, and the changed confounder then drives treatment at the next time point. There is a feedback loop.

**Sarah:** And the canonical example for this is HIV care. Walk me through it.

**Kiffer:** Sure. HIV stands for human immunodeficiency virus. The virus attacks a particular kind of immune cell called the CD four T cell. So the CD four cell count, which is just the concentration of those cells in the blood, is the main marker of how much immune damage the virus has done. Treatment for HIV is called antiretroviral therapy, a class of drugs that suppress viral replication.

**Sarah:** And the way clinicians use the CD four count is they look at it to decide whether to start or change the antiretroviral therapy. A low count means the immune system is in trouble. So you escalate treatment.

**Kiffer:** Right. So the CD four count at one time predicts treatment at the next time. That makes it a confounder of the treatment outcome relationship. People with low CD four counts are sicker, more likely to be treated, and more likely to die. If you do not adjust for CD four count, you have classical confounding.

**Sarah:** But here is the catch. The antiretroviral therapy itself raises the CD four count. The treatment changes the very confounder you are trying to adjust for. So the CD four count is also a mediator. It is on the causal pathway from treatment to survival.

**Kiffer:** And that is where standard regression breaks. If you put CD four count into a Cox proportional hazards model alongside antiretroviral therapy, you block part of the treatment effect. The treatment helped people partly by raising their CD four counts. If you adjust for CD four count, you remove that part of the benefit. You introduce what is called collider bias, or over adjustment bias.

**Sarah:** And if you do not adjust for CD four count, you have classical confounding. So you are stuck. There is no way to do it right with a standard model.

**Kiffer:** Exactly. Hernan, Brumback, and Robins, in a paper from two thousand, published a method that solves this. James Robins is an epidemiologist and biostatistician at Harvard who has spent his career on causal inference. Miguel Hernan is his colleague, also at Harvard. Babette Brumback was their collaborator. Their method is called the marginal structural model, and the technique they use to fit it is called inverse probability of treatment weighting.

**Sarah:** Okay, walk me through what those mean. Marginal structural model first.

**Kiffer:** Marginal structural model. The word marginal here means we are estimating the effect averaged across the population, not the effect conditional on the confounder. Structural means we are modeling the causal structure of the world. The idea is that instead of conditioning on the time varying confounder, we reweight the data to construct a pseudo population in which treatment assignment is independent of the confounder.

**Sarah:** And inverse probability of treatment weighting is the mechanic for building that pseudo population.

**Kiffer:** Right. The procedure has two steps. First, you estimate the probability that each person received the treatment they actually received, given their observed history. That is called the propensity. Second, you weight each observation by one divided by that probability. People who got the treatment they were unlikely to get receive a lot of weight. People who got the treatment everyone like them got receive less weight. The result is a weighted population in which treatment is no longer driven by the confounder.

**Sarah:** And in that pseudo population, you can fit a simple regression and the coefficient on treatment is a causal estimate of the population average effect.

**Kiffer:** What Hernan, Brumback, and Robins showed in two thousand was that standard Cox regression on real HIV data systematically underestimated the survival benefit of antiretroviral therapy. The marginal structural model recovered substantial benefits that the conventional analysis had missed.

**Sarah:** That is a striking result. The standard tool gave the wrong answer. The right answer required a method that explicitly models the feedback structure.

**Kiffer:** And it generalizes. Marginal structural models are now standard practice anywhere you have time varying treatment with feedback. Cancer chemotherapy adjustments. Long term blood pressure management. Adaptive treatment in psychiatric care. Whenever a clinical decision rule causes the confounder, you need this kind of method.

**Sarah:** Okay. Let me catch my breath, because we have just covered a lot. Classical confounding, confounding by indication, time varying confounding, marginal structural models. And we still have the most theoretically charged part of Section 1 ahead of us.

**Kiffer:** Right. The last piece of Section 1 asks a deeper question. So far we have treated confounding as a technical problem with technical solutions. Identify the confounder. Adjust for it. Move on. But what if the variable you are calling a confounder is not really a variable at all? What if it is a structural process that produces both the exposure and the outcome over a lifetime?

**Sarah:** And the most important example of this is race.

**Kiffer:** Yes. The standard practice in much of epidemiology is to put race or ethnicity into the regression model as a covariate alongside age and sex. Adjust for it. Report the residual coefficient. But David Williams, Jourdyn Lawrence, and Brigette Davis, in a paper from two thousand and nineteen, and Nancy Krieger, a social epidemiologist at Harvard, in a paper from two thousand and fourteen, argue that this framing fundamentally misrepresents what race indexes.

**Sarah:** So race is not a confounder. Racism is an exposure.

**Kiffer:** That is exactly the slogan. Race itself does not have a biological mechanism that causes hypertension or low birthweight or coronavirus disease two thousand nineteen mortality. What does the causing is racism, experienced over a lifecourse. Residential segregation, where neighborhoods of color get less investment, less green space, more pollution. Job market discrimination. Biased policing. Differential treatment in clinical encounters. Chronic vigilance, the daily cognitive load of anticipating discrimination. All of those become biologically embodied over decades.

**Sarah:** So when researchers control for race in their model, what are they actually doing?

**Kiffer:** They are obscuring the structural process they should be measuring. And it gets worse. If you also adjust for income, education, neighborhood, and insurance, those variables are themselves downstream consequences of structural racism. Adjusting for them is what Tyler VanderWeele and Whitney Robinson, in two thousand and fourteen, and Enrique Schisterman, Stephen Cole, and Robert Platt, in two thousand and nine, called over adjustment bias.

**Sarah:** Walk me through over adjustment bias, because the term is doing real work here.

**Kiffer:** Sure. If income is a mediator on the path from racism to a health outcome, then adjusting for income partials out part of the causal effect of racism. The variable racism, or its proxy race, is in the model. The downstream mediators are also in the model. The estimate you get for race after adjusting for income, education, and insurance is the effect of race that does not flow through any of those things. Which, if those things are mediators, is essentially nothing. The variable is in the model. The explanation has been removed.

**Sarah:** And then the lesson introduces intersectionality, which adds another layer to this.

**Kiffer:** Right. Intersectionality is a concept introduced by Kimberle Crenshaw, who is a legal scholar at the University of California Los Angeles School of Law, in a paper from nineteen eighty nine. Crenshaw was writing about anti discrimination law. She observed that a Black woman's experience of discrimination is not the sum of being Black plus being a woman. The intersection produces qualitatively distinct exposures that neither single category captures.

**Sarah:** And Greta Bauer, in a paper from two thousand and fourteen, showed why this is theoretically devastating for standard regression.

**Kiffer:** Yes. Standard regression assumes additivity. The coefficient on race is interpreted as the effect of race holding gender and class constant. But holding gender constant while estimating a racial effect imagines a population in which racial categorization is detached from gendered experience. That counterfactual does not exist in the world.

**Sarah:** So if you want to study inequality, you cannot just put the variables side by side and let regression do its thing. You need explicit interaction terms, stratified analyses, or specialized methods.

**Kiffer:** And one of those specialized methods is from Clare Evans and colleagues, in a paper from two thousand and eighteen. They developed an approach called multilevel analysis of individual heterogeneity and discriminatory accuracy. The acronym is MAIHDA, but I will just say the full name, the multilevel analysis of individual heterogeneity and discriminatory accuracy method.

**Sarah:** And what does that method actually do?

**Kiffer:** It treats every intersectional combination, for example, Black, low income, women, urban, as its own stratum. It estimates the rate of the outcome within each stratum. And it asks how much of the total variance in the outcome is explained by intersection rather than by single axis variables. It captures the fact that the experience at the intersection is not just the sum of the parts.

**Sarah:** Let me work through the lesson's example, because I think it brings everything together. Maternal mortality.

**Kiffer:** Right. In the United States, Black women die from pregnancy related causes at roughly three times the rate of White women. That is from a paper by Emily Petersen and colleagues at the Centers for Disease Control in two thousand and nineteen.

**Sarah:** And a conventional analysis would fit a model of maternal mortality with race, age, education, income, insurance, and parity as covariates. And report a residual race coefficient that is smaller than the crude difference. The headline becomes something like, most of the disparity is explained by socioeconomic factors.

**Kiffer:** And through the lens of fundamental cause theory, which Bruce Link and Jo Phelan introduced in nineteen ninety five and elaborated with Parisa Tehranifar in two thousand and ten, that is the wrong reading. Education, income, and insurance are mechanisms through which structural racism produces the disparity. Adjusting for them does not explain the disparity away. It just reroutes it.

**Sarah:** And fundamental cause theory predicts something interesting. Even when one mechanism is closed off, for example smoking is reduced, the disparity reappears through whatever the next available mechanism is. Obesity. Opioid overdoses.

**Kiffer:** Because the cause is upstream of any specific mechanism. The cause is the structural position. So mechanism by mechanism adjustment cannot make it go away.

**Sarah:** And the practical takeaway for someone reading a paper. When you see an effect estimate adjusted for race, income, and education, ask what theory of how those variables relate to the exposure and outcome the authors are operating under. Are they confounders to be partialled out? Or are they mediators that carry the causal effect of structural conditions?

**Kiffer:** And a defensible study makes those theoretical commitments explicit. It is honest about the difference between the effect of an exposure holding other variables constant, which is a statistical operation, and what would happen if we changed the exposure, which is a causal claim about the world.

**Sarah:** Okay. That closes Section 1. Let me try to summarize. Confounding distorts associations when a third variable is associated with both exposure and outcome and is not on the causal pathway. The hormone replacement story is the classic case. Confounding by indication is pervasive in pharmacoepidemiology. Time varying confounding requires marginal structural models. And treating structural variables like race as ordinary confounders embeds theoretical claims that may obscure exactly the processes we should be studying.

**Kiffer:** Now Section 2. Statistical inference and model issues. Even after every bias has been addressed, the analysis itself can produce wrong conclusions. There are seven issues the lesson walks through, and I want to take them in order.

**Sarah:** Start with model misspecification.

**Kiffer:** A statistical model is misspecified when the assumed functional form does not match the true relationship between variables. The most common version is assuming linearity when the truth is nonlinear.

**Sarah:** And the lesson uses alcohol and mortality as the example.

**Kiffer:** Right. The relationship between alcohol consumption and all cause mortality is often described as J shaped. Light to moderate drinkers appear to have lower mortality than both abstainers and heavy drinkers. If you fit a linear model to that curve, you can get a uniformly protective slope, depending on where most of your data sit, even though the truth is harm at both extremes.

**Sarah:** And it is worth noting that more recent analyses, like Tim Stockwell and colleagues in two thousand and sixteen, have shown that even the apparent protective effect of moderate drinking largely disappears once you correct for what is called sick quitter bias, where former drinkers who stopped because they were ill get classified as abstainers.

**Kiffer:** So model specification is not just a statistical nicety. It can reverse a study's conclusions. The fix is to use polynomial terms, splines, or categorical exposure variables to let the data show the shape rather than imposing one.

**Sarah:** Second issue. Multicollinearity.

**Kiffer:** Multicollinearity is when two or more predictor variables in a regression model are highly correlated with each other. It does not bias the coefficient estimates. But it dramatically inflates their standard errors. So the coefficients become unstable and hard to interpret.

**Sarah:** And the example the lesson uses is air pollution.

**Kiffer:** Right. Air quality studies often try to estimate the health effects of multiple pollutants simultaneously. Particulate matter two point five, which is fine particles with diameter less than two and a half micrometers. Ozone. Nitrogen dioxide. Sulfur dioxide. All of those tend to come from the same sources. Traffic. Industrial combustion.

**Sarah:** So they are highly correlated with each other. And when you put them all in the same regression, the model has trouble assigning credit. The coefficients can flip sign between samples. They lose statistical significance. Even though we know all of them affect health.

**Kiffer:** There are a few solutions. You can do principal component analysis, which combines correlated variables into a smaller number of composite indices. You can do variable selection, choosing one pollutant at a time. Or you can analyze each pollutant separately with mutual adjustment in sensitivity analyses. None is perfect, but each is more honest than just throwing everything in and reporting unstable coefficients.

**Sarah:** Third issue. Type one and type two errors, and statistical power.

**Kiffer:** Quick definitions. Type one error is the probability of concluding an effect exists when it actually does not. A false positive. Type two error is the probability of failing to detect an effect that actually does exist. A false negative. Statistical power is one minus the type two error rate. The probability that the study correctly detects a real effect.

**Sarah:** And these errors trade off against each other.

**Kiffer:** Right. If you make it harder to declare significance, you reduce false positives but increase false negatives. The standard convention in epidemiology is to set the type one error rate, called alpha, at five percent. And to aim for power of at least eighty percent.

**Sarah:** And the lesson highlights a particular problem in rare disease research. The winner's curse.

**Kiffer:** Yes. Suppose you are studying a disease that affects one in a hundred thousand people. Even with a large sample, you may have only fifty cases. With that sample size, your power to detect even a moderate effect might be only twenty to thirty percent. Which means seventy to eighty percent of true effects will be missed.

**Sarah:** And here is the paradox. The studies that do reach statistical significance in low power settings tend to overestimate the true effect.

**Kiffer:** Because to cross the significance threshold with a small sample, the random sampling fluctuation has to push the estimate further away from the null than the truth. So conditional on being significant, the effect estimate is biased upward. That is the winner's curse.

**Sarah:** And it has implications for how we interpret published findings in rare diseases. The published effect sizes, on average, are larger than the true effect sizes. So replication studies will routinely find smaller effects, and that is not a failure. That is regression to the truth.

**Kiffer:** Fourth issue. Simpson's paradox. Edward Simpson described it in a paper from nineteen fifty one, although the phenomenon was noted earlier.

**Sarah:** And Simpson's paradox is when a trend that appears in aggregated data reverses when the data are stratified by a confounding variable. Walk me through the standard example.

**Kiffer:** Imagine a new treatment, call it Treatment X, tested at two hospitals. Hospital A treats mostly mild cases. Hospital B treats mostly severe cases. Across all patients, Treatment X has a sixty percent success rate, while the standard treatment has seventy percent. So Treatment X looks worse overall.

**Sarah:** But then you stratify by severity.

**Kiffer:** Among mild cases, Treatment X succeeds in fifty out of sixty patients, so eighty three percent. The standard treatment succeeds in fifty five out of seventy, so seventy nine percent. Treatment X is better among mild cases. Among severe cases, Treatment X succeeds in ten out of forty, so twenty five percent. The standard treatment succeeds in fifteen out of thirty, so fifty percent.

**Sarah:** Wait, so within each stratum, Treatment X is sometimes better, sometimes worse, and the aggregate flips the answer relative to one of the strata. Walk me through what is happening.

**Kiffer:** What is happening is that Treatment X was disproportionately given to severe cases, which have much worse outcomes regardless of treatment. The mix differs by treatment. So when you aggregate, you are comparing a treatment that was given mostly to severe cases against a treatment given mostly to mild cases. The overall comparison is dominated by the severity mix, not the treatment effect.

**Sarah:** And the only way to see the truth is to stratify.

**Kiffer:** Right. Within strata, you can see the actual treatment effect. Aggregated, you see the confounding by severity. Simpson's paradox is really just a vivid version of confounding showing up at the population level.

**Sarah:** Fifth issue. Ecological and atomistic fallacies. We covered these in Lesson 6 with the group level studies, but the lesson recalls them here.

**Kiffer:** Quickly. The ecological fallacy is when you take a relationship that holds at the group level and assume it holds at the individual level. States with higher income inequality have higher mortality. That does not mean that any individual living in an unequal state is at higher risk just because of their personal exposure to inequality. The pattern at the state level may be driven by contextual factors like underfunded public services.

**Sarah:** And the atomistic fallacy is the mirror image. You take an individual level relationship and assume it scales to the group level. Income predicts health for individuals. Therefore raising every individual's income will proportionally improve population health.

**Kiffer:** But population health depends on the distribution of income, on social cohesion, on infrastructure, things that are not captured by individual income. So scaling up the individual relationship ignores the contextual factors that matter at the group level.

**Sarah:** And related to the ecological fallacy is the modifiable areal unit problem.

**Kiffer:** Right. The modifiable areal unit problem is what happens specifically in spatial analysis when you aggregate individual data into geographic units. Census tracts. Postal codes. Counties. Health regions. The choice of unit size and boundary definitions can change your statistical results.

**Sarah:** Walk me through an example.

**Kiffer:** Suppose you are studying cancer incidence near an industrial facility. If you aggregate at the postal code level, you may find a significant cluster. If you aggregate at the health region level, the cluster disappears because it is averaged out across a larger area. If you aggregate at an even smaller level, you might get unstable estimates because each unit has very few cases.

**Sarah:** And neither result is wrong. They are both artifacts of the chosen boundaries. Which means spatial epidemiological conclusions depend partly on arbitrary geographic decisions.

**Kiffer:** And the way to handle it is to do the analysis at multiple spatial scales and report whether the conclusions are sensitive to the choice. If the same finding appears at every scale, it is robust. If it only appears at one specific scale, it may be a boundary artifact.

**Sarah:** Last issue in Section 2. Missing data.

**Kiffer:** Missing data are everywhere in epidemiology. Participants skip questionnaire items. Lab samples get lost. People drop out of follow up. The validity of any analysis depends critically on the mechanism that generated the missingness. Donald Rubin, who we met back in Lesson 3 with the observational designs, defined three categories.

**Sarah:** First. Missing completely at random.

**Kiffer:** Missing completely at random means the probability of being missing is unrelated to both observed and unobserved variables. For example, a lab sample gets accidentally dropped on the floor. The breakage has nothing to do with the participant's characteristics. Under missing completely at random, a complete case analysis, where you just analyze the people with complete data, is unbiased. You lose statistical power but you do not introduce bias.

**Sarah:** Second. Missing at random.

**Kiffer:** Missing at random is a more permissive condition. Missingness depends on observed variables but not on the missing values themselves. For example, younger participants are more likely to skip a depression questionnaire. But among people of the same age, the probability of skipping does not depend on how depressed they are.

**Sarah:** And under missing at random, complete case analysis is biased, because younger people are systematically underrepresented among completers. But methods like multiple imputation, where you fill in plausible values based on the observed relationships, and maximum likelihood methods can produce valid estimates.

**Kiffer:** Third. Missing not at random.

**Sarah:** Missing not at random means missingness depends on the unobserved values themselves. People with severe depression are less likely to complete the depression questionnaire because of their depression. The thing that determines whether you respond is the very thing you are trying to measure.

**Kiffer:** And here, no standard analytic method can fully correct the bias. You cannot impute based on observed values because the observed values do not fully predict the missingness. The best you can do is sensitivity analyses, where you make different assumptions about the missingness mechanism and see how much your conclusions change.

**Sarah:** And the practical takeaway is that complete case analysis only works under missing completely at random. As soon as missingness depends on observed or unobserved variables, you need more sophisticated methods, and you need to be transparent about the assumptions you are making.

**Kiffer:** Okay. Let me try to bring everything together. The takeaways for this lesson, organized into a few clusters.

**Sarah:** Please.

**Kiffer:** First, the formal definition of confounding. A confounder is associated with the exposure, is an independent risk factor for the outcome, and is not on the causal pathway. Mediators are different from confounders. Adjusting for a mediator introduces bias rather than removing it.

**Sarah:** Second, the hormone replacement therapy and Women's Health Initiative case is the canonical demonstration that observational studies can produce confidently wrong answers when confounding by lifestyle and socioeconomic status is unaddressed. It took randomization to reveal the true direction of effect.

**Kiffer:** Third, confounding by indication is pervasive in pharmacoepidemiology. Sicker patients receive more aggressive treatments. Naive comparisons of treated and untreated patients systematically overestimate harm or underestimate benefit. The mirror image, confounding by contraindication, is its complement.

**Sarah:** Fourth, time varying confounding requires specialized methods. When a confounder is also a mediator and is affected by prior treatment, standard regression introduces collider bias if you adjust and confounding bias if you do not. Marginal structural models, fitted with inverse probability of treatment weighting, are the standard solution. The Hernan, Brumback, and Robins paper from two thousand established this approach using HIV data.

**Kiffer:** Fifth, structural variables like race, gender, and socioeconomic status are not ordinary confounders. They index lifelong exposure to racism, patriarchy, and economic deprivation. Treating them as variables to be partialled out, especially after adjusting for downstream mediators like income and education, can constitute over adjustment bias. Williams, Lawrence, and Davis in two thousand and nineteen, Krieger in two thousand and fourteen, and VanderWeele and Robinson in two thousand and fourteen are key references.

**Sarah:** Sixth, intersectionality, from Crenshaw in nineteen eighty nine, points out that the experience of being at the intersection of multiple categories is not the additive sum of the parts. Bauer in two thousand and fourteen showed that standard regression's additivity assumption is theoretically inadequate for studying inequality. The multilevel analysis of individual heterogeneity and discriminatory accuracy method, from Evans and colleagues in two thousand and eighteen, is one quantitative response.

**Kiffer:** Seventh, the maternal mortality example illustrates this concretely. Black women in the United States die from pregnancy related causes at roughly three times the rate of White women. Adjusting for income, education, and insurance does not explain this disparity away. It reroutes it. Phelan, Link, and Tehranifar's fundamental cause theory predicts exactly this pattern.

**Sarah:** Eighth, on the statistical inference side. Model misspecification, like fitting a linear model to a J shaped relationship, can reverse a study's conclusions. The alcohol mortality literature is the canonical example.

**Kiffer:** Ninth, multicollinearity inflates standard errors and produces unstable coefficients. The air pollutant studies are the textbook case. Solutions include principal component analysis, variable selection, and mutual adjustment in sensitivity analyses.

**Sarah:** Tenth, low statistical power produces both type two errors and the winner's curse. Significant findings from underpowered studies tend to overestimate effect sizes. Replication will tend to produce smaller effects. That is regression to the truth, not a failure of the original or the replication.

**Kiffer:** Eleventh, Simpson's paradox is confounding visible at the level of the aggregate. Stratification reveals it. Aggregating without considering subgroup structure can hide or even reverse the truth.

**Sarah:** Twelfth, the ecological fallacy and the atomistic fallacy are about moving between levels. Group level patterns do not necessarily hold at the individual level, and individual level patterns do not necessarily scale up. The modifiable areal unit problem is a special case in spatial analysis.

**Kiffer:** Thirteenth, missing data mechanisms determine which methods are valid. Missing completely at random allows complete case analysis without bias. Missing at random requires multiple imputation or maximum likelihood methods. Missing not at random cannot be fully corrected. Sensitivity analyses are essential.

**Sarah:** And the meta takeaway. Critically evaluating an epidemiological study is not just about checking the results. It is about checking the assumptions, model choices, and analytic decisions that produced the results. Lesson 12 will pull all of this together into an integrated appraisal exercise.

**Kiffer:** And one practical note for someone reading a paper after this lesson. When you see an effect estimate adjusted for a list of variables, ask three things. What theory of causation are the authors operating under? Are the variables they adjusted for confounders or mediators? And what would happen to the estimate if they made different choices?

**Sarah:** And often, the most informative thing in a paper is not the headline estimate. It is the sensitivity analyses. How much do conclusions change when you fit a different model? When you handle missing data differently? When you stratify rather than adjust? A study that reports robust conclusions across many specifications is worth more than a study that reports one number with confidence.

**Kiffer:** And a defensible study is honest about the difference between a statistical operation and a causal claim about the world. Adjusting for variables in a regression is a mathematical procedure. Saying that a treatment causes an outcome is a much stronger claim that depends on what those variables represent in the world, what theory of causation you have committed to, and whether your design has actually identified the effect you say it has.

**Sarah:** Next up is Lesson 12, the Integrated Appraisal of Epidemiological Research. The capstone of this material. It pulls together the historical and ethical framing from Lesson 1, the systematic review work from Lesson 2, the four observational designs from Lessons 3 to 6, the measurement and causal specification work from Lesson 7, and the full bias inventory of Lessons 8 to 11. By the end, you can read any epidemiological paper systematically rather than impressionistically.

**Kiffer:** Take care, everyone.