# Lesson 1 — Introduction & Causal Concepts (v3 expanded)

*Companion-podcast transcript • Sarah & Kiffer* 
*~5480 words • ~33 min audio*

---

**Sarah:** Welcome to Office Hours, the companion podcast. I'm Sarah.

**Kiffer:** And I'm Kiffer. New chapter, but a familiar voice on the other side of the microphone. If you came through this material with us, welcome back. The very short version is that the earlier material was about reading epidemiological evidence critically. Recognizing study designs. Spotting biases.

**Sarah:** And the next part picks up the design and conduct side. Instead of asking, did this team do a good job? we start asking, how would I do this myself? Design valid studies. Calculate measures of disease frequency and association. Work through screening tests. Conduct surveillance and outbreak investigations.

**Kiffer:** The bias inventory you built earlier becomes the design checklist you carry through this course. Today is Lesson 1, Introduction and Causal Concepts. The conceptual foundation for the entire course.

**Sarah:** Four sections. A brief history of causal thinking. The logic of scientific inference. Models of causation. And the counterfactual framework that underpins modern causal inference.

**Kiffer:** I want to flag that this lesson is denser than most of this material. By design. We are loading the conceptual scaffolding now so that everything that follows has a place to hang. If something doesn't click immediately, that's fine. We come back to all of it.

**Sarah:** Okay. Let's start with Section 1, where the lesson asks the foundational question, what is epidemiology?

**Kiffer:** Epidemiology is the study of the patterns, causes, and effects of health and disease in populations. The key word is populations. We're not diagnosing one patient. We're looking across groups, asking what produces the differences in who gets sick and who doesn't.

**Sarah:** And the work breaks down into identifying exposures, things like demographic factors, infectious agents, nutritional factors, toxins, lifestyles, and evaluating their associations with health outcomes.

**Kiffer:** And the lesson emphasizes that epidemiology is a field-based discipline. The associations we find are part of a complex web of relationships involving organisms and all aspects of their environment. We don't get to study causation in a clean petri dish.

**Sarah:** Then the lesson walks through eight historical milestones. And before we click through them, you flagged a recurring tension.

**Kiffer:** Yeah. Two perspectives running through the entire history. One oriented toward biology and the mechanism of causation. Microbe, gene, biomarker. The other oriented toward populations and their interactions with the environment. Climate, sanitation, social conditions. Both are essential. The pendulum swings, but neither side ever wins outright.

**Sarah:** Milestone one. Hippocrates, around 400 BC.

**Kiffer:** Hippocrates was an ancient Greek physician on the island of Kos. His treatise Airs, Waters, and Places shifted disease explanations from supernatural causes to environmental ones. Climate, water quality, geography, season. The first naturalistic theory of disease. So already in 400 BC the population-environment perspective is on the table.

**Sarah:** Milestone two. Miasma theory, dominant from roughly 1750 to 1885.

**Kiffer:** The idea that disease spread through bad air, particularly from rotting organic matter and crowded slums. Wrong about the mechanism, but extraordinarily productive. Miasma theory motivated huge sanitation reforms across Europe. Drainage of swamps. Clean water infrastructure. Sewer systems. Those interventions genuinely reduced disease, even though the proposed causal story was incorrect.

**Sarah:** Milestone three. John Snow, mid-1800s.

**Kiffer:** John Snow was a London physician working primarily as an anesthesiologist. In 1854 he traced a cholera outbreak in the Soho neighborhood of London to a contaminated water pump on Broad Street. He mapped the cases. He compared water sources across households. He removed the pump handle and the outbreak abated.

**Sarah:** And this is decades before the bacterium that causes cholera was identified.

**Kiffer:** Roughly thirty years before Vibrio cholerae was named by Robert Koch in 1883. The big takeaway is that disease can be prevented even without knowing the exact causal organism. Identify the source. Remove it.

**Sarah:** Milestone four. Germ theory, late 1800s.

**Kiffer:** The work of Louis Pasteur in France and Robert Koch in Germany. They identified specific microorganisms as the causes of specific infectious diseases. Anthrax. Tuberculosis. Cholera. The high-water mark of the biological-mechanism perspective. One disease, one bug. The pendulum swings hard toward microbiology.

**Sarah:** Milestone five. Goldberger and pellagra, early 1900s.

**Kiffer:** Joseph Goldberger was a physician with the United States Public Health Service. Between 1914 and 1929 he worked on pellagra, a disease devastating poor populations in the American South. Skin lesions, dementia, eventually death. The conventional wisdom at the time was that pellagra was infectious.

**Sarah:** Because germ theory was the dominant frame.

**Kiffer:** Everyone reached for the microbe. Goldberger's careful studies showed pellagra was caused by a nutritional deficiency. We now know it's niacin deficiency. Diets in poor southern households were heavy in corn and lacking in meat, dairy, and fresh vegetables. The pendulum swings back toward population-environment thinking. Don't assume the mechanism before checking.

**Sarah:** Milestone six. Framingham and beyond, mid-1900s.

**Kiffer:** The Framingham Heart Study, started in 1948 in Framingham, Massachusetts, a small town outside Boston, and still ongoing today. This is the study that established the multifactorial nature of cardiovascular disease. High blood pressure, cholesterol, smoking, obesity, diabetes. None of them necessary on its own. None sufficient on its own. Multiple causes acting in combination.

**Sarah:** Milestone seven. Agent-host-environment, 1970s.

**Kiffer:** A triad model. The agent, like a microorganism. The host, the person at risk. The environment, the context that brings them together. Useful especially for infectious disease ecology.

**Sarah:** Milestone eight. One Health, 21st century.

**Kiffer:** One Health explicitly integrates human, animal, and environmental health. Emerging diseases like Severe Acute Respiratory Syndrome, Ebola, and COVID-19 cross species boundaries. Antimicrobial resistance crosses agricultural and clinical settings. Climate change affects all three. You cannot understand modern disease patterns without thinking across all three domains.

**Sarah:** So the takeaway from Section 1 is that causal thinking has gotten more multifactorial and more interconnected. From single causes to webs.

**Kiffer:** And modern epidemiology accepts two things. There are multiple causes for almost every outcome. And a single cause can have multiple effects.

**Sarah:** Okay. Moving on to Section 2, on scientific inference and key research components. This is the logic part of the lesson.

**Kiffer:** Right. How do epidemiologists actually reason from observed data to claims about cause and effect? Two classical forms of reasoning, plus a third that has become unavoidable in modern practice. Inductive, deductive, Bayesian.

**Sarah:** Walk me through the first one.

**Kiffer:** Inductive reasoning is what most people think of when they think of science. You observe specific instances and draw a broader conclusion. You watch a thousand smokers and notice they have more lung cancer than non-smokers and you generalize, smoking is associated with lung cancer.

**Sarah:** And the historical anchor for inductive reasoning is Francis Bacon, in 1620.

**Kiffer:** Bacon, the English philosopher, presented inductive reasoning systematically in his Novum Organum in 1620. The classic textbook example is Edward Jenner. Jenner was an English country physician who in the late 1790s noticed that milkmaids who had contracted cowpox seemed not to get smallpox. Repeated observation led to a generalization, which led to the development of the smallpox vaccine.

**Sarah:** And then John Stuart Mill, in 1843.

**Kiffer:** John Stuart Mill, the English philosopher, formalized rules for inductive inference in his System of Logic. His canons of induction helped shape our concepts of necessary and sufficient causes.

**Sarah:** But induction has a famous problem.

**Kiffer:** It does. The Scottish philosopher David Hume, writing in the 1700s, pointed out that there is no logical force to inductive reasoning. We cannot perceive a causal connection. We can only perceive a series of events. Repeated observation is consistent with causation, but it does not prove causation.

**Sarah:** The classic example is, the sun has risen every day, therefore the sun will rise tomorrow. There's no logical guarantee.

**Kiffer:** Right. The fact that something has happened many times does not, as a matter of logic, force it to happen the next time. Induction never gives you certainty. Only confidence that grows with repetition.

**Sarah:** Deductive reasoning is the response to that limit.

**Kiffer:** Deductive reasoning works in the opposite direction. You start with a general hypothesis and test specific predictions against observations. The historical anchor is Karl Popper, the Austrian-British philosopher who articulated refutationism in the mid-twentieth century.

**Sarah:** Refutationism. Can you define that for us?

**Kiffer:** Popper argued that scientists should not collect data to prove a hypothesis. They should attempt to disprove it. Only by disproving hypotheses can we make scientific progress. A hypothesis that survives serious refutation gains credibility, but is never proven.

**Sarah:** And that's why statistical testing is structured the way it is. You set up a null hypothesis of no association, and try to refute it. Failure to refute the null doesn't prove the null. It just means you couldn't reject it.

**Kiffer:** Exactly. The logic of the p-value is Popperian. The benefit Popper offers in practice is that it forces narrow, specific, testable hypotheses. You don't go fishing. You name your prediction and try to break it.

**Sarah:** Third reasoning style. Bayesian thinking.

**Kiffer:** Thomas Bayes was an English clergyman and mathematician whose famous theorem on conditional probability was published in 1764, after his death. The deep point Bayes made for science is that all inference depends on the validity of our premises. The information we already have before we make new observations shapes how we interpret those observations.

**Sarah:** So if I read a study claiming that drinking coffee causes you to grow a third arm, my prior probability is essentially zero. I need overwhelming evidence to update. If I read a study claiming smoking is associated with lung cancer, my prior is already very high based on decades of work.

**Kiffer:** Bayesian analysis formalizes that. It uses prior probability distributions, updates them with new data, and produces posteriors that combine prior knowledge with new evidence. Single studies don't sit in isolation.

**Sarah:** And then Thomas Kuhn enters.

**Kiffer:** Thomas Kuhn was an American philosopher and historian of science. His 1962 book The Structure of Scientific Revolutions made the point that although a single observation can in principle disprove a hypothesis, in practice the observation might be anomalous. Scientific communities therefore rely on consensus. Theories don't fall to the first contradicting datum. They fall when the weight of evidence and a better alternative tip the community into what Kuhn called a paradigm shift.

**Sarah:** Okay. The lesson then introduces Figure 1.1, the key components of an epidemiologic study.

**Kiffer:** Six pieces. Source population, the larger group you ultimately want to make claims about. Sampling, the process by which you select people, with selection bias as the threat. Study group, the people you actually have. Exposure variables, what you measure as the predictor. Outcome, what you measure as the response. Extraneous variables, shorthand for confounders and information bias. And finally analysis and causal inferences.

**Sarah:** And every box on that diagram corresponds to a topic that gets a full lesson later in the course.

**Kiffer:** Surveillance and outbreak investigation come up in Lesson 2. Sampling is Lesson 3. Exposure measurement and questionnaires move into Lesson 4. Outcomes by type, continuous, dichotomous, count, time-to-event, run through measures of disease frequency in Lesson 5. Confounding shows up explicitly in Lesson 12. Figure 1.1 is genuinely a roadmap for this material.

**Sarah:** Okay, then the section turns to directed acyclic graphs.

**Kiffer:** A directed acyclic graph, abbreviated DAG, is the formal version of Figure 1.1. The working tool that modern epidemiologists use to write down their assumptions before they touch the data, and to read off mechanically what those assumptions imply for analysis.

**Sarah:** Define each piece. Directed. Acyclic. Graph.

**Kiffer:** A graph is a set of nodes connected by lines. The nodes here are variables. Directed means each line is an arrow with a head and a tail, pointing from cause to effect. Acyclic means no cycles. A variable cannot cause itself, even by going around a long loop. So no feedback loops in a DAG.

**Sarah:** And every arrow is a claim.

**Kiffer:** Every arrow is a claim that A directly affects B, controlling for the rest of the graph. And just as importantly, the absence of an arrow asserts there is no direct causal effect.

**Sarah:** Three building blocks. Chain. Fork. Collider.

**Kiffer:** A chain is X arrow M arrow Y. The exposure causes the mediator, which causes the outcome. A fork is a single common cause C with arrows to both X and Y. That common cause is what we call a confounder. And a collider is the opposite. Two arrows pointing into the same variable. The shared descendant is the collider.

**Sarah:** Why do we care about the distinctions?

**Kiffer:** Because they tell you what to do in your analysis. In a chain, the mediator is on the causal pathway. If you adjust for it, you remove part of the effect. In a fork, the confounder distorts the X-Y association. You must adjust for it. In a collider, conditioning on the collider opens up a spurious association between X and Y that did not exist before. Colliders are variables you must not adjust for.

**Sarah:** Could you give an everyday example of a collider?

**Kiffer:** Suppose attractiveness and intelligence are independent in the general population. Both increase your chances of getting cast in Hollywood films. So in a sample of Hollywood actors, the only way in is to be high on at least one. Among working actors, you'll find a negative correlation between attractiveness and intelligence, even though the two have no causal relationship. The selection into the sample, conditioning on the collider, manufactures the association.

**Sarah:** So drawing a DAG forces you to commit to structure before you analyze.

**Kiffer:** Yes. The lesson lays out a recipe. Name the exposure and outcome. List every other variable that could plausibly affect either one, including unmeasured variables. Draw an arrow from each variable to every variable it directly causes. Be ruthless about direct. Check for cycles. Then read off the implications.

**Sarah:** And what does reading off the implications get you?

**Kiffer:** Four things. First, the DAG identifies confounders. Anything on a so-called back-door path between exposure and outcome, that is, any path that doesn't go through the direct causal arrow, is a candidate for adjustment. Second, the DAG flags variables you must not adjust for. Mediators, if you want the total effect, and colliders, period. Third, the DAG makes your assumptions criticisable. A reviewer can disagree with an arrow and have a real argument. Fourth, the DAG supports a transparent estimand. The total effect of X on Y, adjusting for this back-door set. Defensible.

**Sarah:** Now mediation analysis. The lesson says DAGs and mediation analysis are related but not the same.

**Kiffer:** Right. The DAG tells you that a variable lies on the pathway from exposure to outcome. Mediation analysis puts numbers on how much of the exposure's effect runs through the mediator versus around it. The total effect equals the direct effect, the part of X arrow Y that does not pass through the mediator, plus the indirect effect, the part that does.

**Sarah:** And the historical workhorse is Baron and Kenny.

**Kiffer:** Reuben Baron and David Kenny, two American psychologists, published a landmark paper in 1986 laying out a causal-steps procedure for mediation. Three regressions, four conditions.

**Sarah:** Walk us through it.

**Kiffer:** Step one. Regress the outcome Y on the exposure X. Call the slope c. It must be statistically significant. That's the total X-to-Y effect. Step two. Regress the mediator M on X. Call that slope a. It must be significant. If X doesn't move M, M cannot be a mediator. Step three. Regress Y on both X and M. The coefficient on M, call it b, must be significant. Step four. Compare c to c-prime. The X coefficient in step three, c-prime, is the direct effect after accounting for M. If c-prime is much smaller than c, the gap, equivalently a times b, is the indirect effect. If c-prime is essentially zero, the mediation is complete. Otherwise it is partial.

**Sarah:** Nice and concrete. But you said Baron and Kenny is limited.

**Kiffer:** Yes. The original procedure assumes linear models. No interaction between exposure and mediator. No unmeasured confounding of the mediator-outcome relationship. Strong assumptions. Modern alternatives have addressed the limits.

**Sarah:** Such as what, for example?

**Kiffer:** Two big developments. First, bootstrap confidence intervals for the indirect effect a times b. Associated with Kristopher Preacher and Andrew Hayes. Bootstrapping replaces the older Sobel z-test, which made unrealistic distributional assumptions. Second, counterfactual or causal mediation analysis. Associated with Kosuke Imai, the political scientist, Judea Pearl, the computer scientist who built much of modern causal inference, and Tyler VanderWeele, the Harvard biostatistician. Counterfactual mediation defines the natural direct effect and natural indirect effect rigorously. Handles binary outcomes and exposure-mediator interactions.

**Sarah:** And the punchline. DAGs and mediation analysis answer different questions.

**Kiffer:** A DAG is a qualitative tool. It encodes structure. Mediation analysis is quantitative. It estimates the size of direct and indirect effects, given a structure. Every credible mediation analysis sits on top of a DAG.

**Sarah:** That brings us to Section 3, which lays out the models of causation.

**Kiffer:** The lesson opens by defining cause in epidemiologic terms. A cause is any factor that produces a change in the severity or frequency of an outcome. Some causes operate at the biological level within individuals. A specific microorganism. A genetic variant. Others operate at the group or population level. Lifestyle. Nutrition. Weather. Income.

**Sarah:** And there's a pragmatic emphasis on manipulable causes.

**Kiffer:** Epidemiologists prefer to identify causal factors that can be manipulated, because manipulable causes are the levers for prevention. But non-manipulable factors, like genetic predisposition, can still be crucial for understanding disease patterns.

**Sarah:** First model. The component-cause model.

**Kiffer:** Three definitions, and they stack. A necessary cause is one without which the disease cannot occur. The factor must always be present if the disease occurs. The classic example is Mycobacterium tuberculosis, the bacterium that causes tuberculosis. You cannot develop tuberculosis without that bacterium.

**Sarah:** And the second of the three definitions.

**Kiffer:** A sufficient cause is a set of conditions that, when present together, will invariably produce the disease. In practice, very few single exposures are sufficient on their own. Different groupings of factors combine.

**Sarah:** And the third definition.

**Kiffer:** A component cause is one of the factors that, in combination with others, constitutes a sufficient cause. The factors might be present at the same time, or they might follow one another in a temporal chain.

**Sarah:** And the worked example is childhood respiratory disease.

**Kiffer:** Four risk factors. The bacterium Streptococcus pneumoniae. A virus, respiratory syncytial virus, abbreviated RSV. Environmental stressors, like cold weather. And another bacterium, Mycoplasma pneumoniae. Different combinations form sufficient causes.

**Sarah:** Walk me through the table.

**Kiffer:** Four sufficient causes in the example. One is Streptococcus pneumoniae plus environmental stressors. Two is RSV plus environmental stressors. Three is RSV plus Mycoplasma pneumoniae. Four is Streptococcus pneumoniae plus environmental stressors plus Mycoplasma pneumoniae. Each combination, when present, will produce the disease.

**Sarah:** So no single factor is necessary.

**Kiffer:** None appears in every sufficient cause. Streptococcus pneumoniae is in two of four. RSV is in two of four. The disease can occur without any one specific factor, as long as some sufficient combination is present.

**Sarah:** And then the lesson uses this scaffolding for an important point about causal complements and strength of association.

**Kiffer:** This is the worked example I want listeners to pay attention to. Imagine two populations. Population A has a respiratory syncytial virus prevalence of thirty percent. Population B has an RSV prevalence of seventy percent. Same disease. Same biology. Same component-cause structure. Now measure the risk ratio for Streptococcus pneumoniae in each.

**Sarah:** And the punchline of the comparison.

**Kiffer:** In Population A, the risk ratio for Streptococcus pneumoniae is 4.83. In Population B, it drops to 2.93. Same causal mechanism. The only thing that changed is how common the co-factor RSV is.

**Sarah:** Why does a more common co-factor compress the risk ratio?

**Kiffer:** The risk ratio is the risk in the exposed divided by the risk in the unexposed. When RSV is rare, the unexposed group, the people without Streptococcus pneumoniae, has very little disease, because they need both Streptococcus pneumoniae and one of the co-factors to get sick. So the denominator is small, the ratio big. When RSV is common, even people without Streptococcus pneumoniae are getting the disease through other sufficient causes. The denominator goes up. The ratio shrinks.

**Sarah:** So strength of association is not a fixed property of a cause.

**Kiffer:** It is population-specific. The same cause can look strong in one population and weak in another, not because the biology has changed, but because the prevalence of its causal complements is different. One of the most important and least intuitive points in modern causal epidemiology.

**Sarah:** Second model. The causal-web model.

**Kiffer:** The causal web maps how multiple factors combine through interconnected direct and indirect chains. A direct, sometimes called proximal, cause has no known intervening variable between it and the disease. The exposure is adjacent to the outcome. Specific microorganisms. Toxins. An indirect cause is one whose effects are mediated through one or more intervening variables.

**Sarah:** And the example the lesson uses.

**Kiffer:** Environmental stressors like cold weather may make a child more susceptible to Streptococcus pneumoniae, RSV, and Mycoplasma pneumoniae. Stressors are not directly causing the disease. They are operating upstream, through the mediating organisms. Removing stressors could still reduce disease, even though the mechanism is indirect.

**Sarah:** And that connects to a really important practical point.

**Kiffer:** In disease control, direct causes are not necessarily more valuable than indirect ones. Many of our largest public-health victories work through indirect causes. Sanitation. Housing. Vaccination programs that change population-level transmission. The web reminds you that you can intervene anywhere along it.

**Sarah:** And then the population attributable fraction.

**Kiffer:** The population attributable fraction, abbreviated AFp, is the proportion of disease in the population that is attributable to a given exposure. It answers the planner's question. What fraction of disease in this population would disappear if we eliminated this exposure?

**Sarah:** And the historical anchor is Levin in 1953.

**Kiffer:** Morton Levin, an American epidemiologist, gave us the standard formula in 1953. In plain words, the population attributable fraction equals the prevalence of the exposure times the relative risk minus one, all divided by the same numerator plus one. Or in everyday terms, it's the share of cases that you would lose if the exposure went away, taking into account both how strong the effect is and how widespread the exposure is.

**Sarah:** And there's a counterintuitive consequence. The fractions for all the risk factors of a single disease can sum to more than one hundred percent.

**Kiffer:** Not a math error. The multicausal reality. Component causes appear in more than one sufficient cause. So when you ask what fraction of disease would disappear if Streptococcus pneumoniae were eliminated, you might get fifty percent. What fraction would disappear if environmental stressors were eliminated, you might get sixty percent. Add them, you exceed one hundred percent because the same cases are being claimed by both.

**Sarah:** And then the prevention paradox.

**Kiffer:** Suppose a vaccine has an attributable fraction of fifty percent and the underlying disease prevalence is six percent. Universal vaccination would cut prevalence from six to three percent. Big public-health win. But ninety-four percent of the vaccinated population would not have gotten the disease anyway. Most people don't perceive a personal benefit. And half of the people who would have gotten sick still get sick. The average person sees less benefit than the population data shows.

**Sarah:** That's the paradox the British epidemiologist Geoffrey Rose articulated, right?

**Kiffer:** Right. A preventive measure that brings large benefits to the population brings small benefits to each individual participant. The tension between population-level public-health logic and individual-level decision-making. Both perspectives are correct. They're measuring different things.

**Sarah:** Section 4. The counterfactual concept. The modern logic of causal inference.

**Kiffer:** And the framework that pulls all the analytic methods you'll meet later in the series into a single coherent picture. The potential outcomes framework, also called the counterfactual model, sometimes called the Neyman-Rubin causal model, after the Polish statistician Jerzy Neyman, who in the 1920s formulated the original version, and the American statistician Donald Rubin, who in the 1970s extended and popularized it.

**Sarah:** And the central question is deceptively simple.

**Kiffer:** What would have happened to this same person if they had not been exposed? For any individual, the framework imagines two potential outcomes. The outcome we would observe if this person were exposed. And the outcome we would observe if this same person were not exposed. The individual treatment effect is the difference between those two.

**Sarah:** Walk me through the thought experiment.

**Kiffer:** You want to know if a vaccine protects against a disease. You observe a vaccinated person who develops the disease. If you could rewind time and observe the same person, in the same period, without vaccination, and they did not develop the disease, you'd conclude the vaccine actually caused the disease in that individual. Conversely, if they still got the disease without the vaccine, the vaccine was not the cause.

**Sarah:** But that counterfactual person doesn't exist.

**Kiffer:** You can never observe the same person under two different exposure levels at the same time. That impossibility is the heart of the framework. In 1986 the American statistician Paul Holland called it the fundamental problem of causal inference. Individual causal effects are unobservable. Period. No clever measurement, no improved technology, no more careful design solves it at the level of a single person.

**Sarah:** So the question itself has to shift.

**Kiffer:** We stop asking, what was the effect for this person? and start asking, what is the average effect across a group of similar people? The average treatment effect, abbreviated ATE, is the average outcome if everyone in the group were exposed minus the average outcome if no one were exposed.

**Sarah:** In the lesson's epidemiologic notation, the probability of disease if all members were exposed minus the probability of disease if none were exposed.

**Kiffer:** If those two quantities differ, we infer a causal effect at the population level, even though we can't pinpoint which specific individuals were affected. That shift from individual to group is the conceptual move that powers modern causal research.

**Sarah:** And randomization is the gold-standard tool for estimating that group effect.

**Kiffer:** When you randomly assign subjects to exposed and unexposed groups, you create what's called exchangeability. The condition where the disease frequency in each group would not change if you swapped their exposure status. The two groups are similar enough on every relevant dimension, measured and unmeasured, that the only systematic difference is whether they happened to receive the exposure.

**Sarah:** And that's the inferential power of a randomized controlled trial. Randomized controlled trial, abbreviated RCT, balances all known and unknown confounders by design.

**Kiffer:** On average, with a large enough sample, yes. So the difference in average outcomes between the groups is your best estimate of the average treatment effect.

**Sarah:** Then the lesson turns to confounding, which wrecks exchangeability.

**Kiffer:** A confounder is a variable that is associated with both the exposure and the outcome and can distort the observed association. The lesson walks through a stratification example with twenty subjects.

**Sarah:** Talk us through what the numbers actually show.

**Kiffer:** The raw analysis shows the probability of disease given exposure is seven out of thirteen, about fifty-four percent. The probability of disease without exposure is three out of seven, about forty-three percent. So at first glance the exposure looks like it raises risk. But there is a third variable C, a pre-existing health condition, that predicts both who gets the exposure and who gets the disease.

**Sarah:** And when you stratify by C.

**Kiffer:** Among C-positive subjects, the risk is six out of nine, about sixty-seven percent, in the exposed and two out of three, also about sixty-seven percent, in the unexposed. Same risk. Among C-negative subjects, the risk is one out of four, twenty-five percent, in the exposed and one out of four, also twenty-five percent, in the unexposed. Same risk again.

**Sarah:** So within each level of the confounder, the exposure has no effect.

**Kiffer:** None. The apparent association in the pooled data was entirely manufactured by confounding. C-positive people happened to also be more likely to be exposed and more likely to have disease. The crude risk difference vanishes the moment you condition on C. That's why controlling for confounders is essential.

**Sarah:** And in observational studies, where we cannot randomize, we deploy a toolkit to approximate exchangeability.

**Kiffer:** The classical strategies are restriction, where you limit the study to one level of the confounder, matching, where you pair exposed and unexposed people on the confounder, stratification, what we just did, and multivariable statistical models, where you adjust statistically. All of these aim to simulate the exchangeability that randomization would have given you for free.

**Sarah:** And the lesson previews four modern tools, all strategies for approximating the missing potential outcome.

**Kiffer:** First, propensity score matching. A propensity score is the estimated probability of being exposed, given a set of measured characteristics. Propensity score matching pairs each exposed person with one or more unexposed people who had a similar probability of being exposed. The matched unexposed group becomes the counterfactual stand-in. The most direct attempt to manufacture exchangeability from observational data.

**Sarah:** Second tool, regression.

**Kiffer:** Regression models, linear, logistic, Poisson, Cox, all the variants, estimate the average difference in outcome associated with exposure while statistically holding other variables constant. Among people who look the same on measured confounders, what is the average outcome difference between exposed and unexposed? When the model is correctly specified and confounders are well measured, the regression coefficient on the exposure is an estimate of the average treatment effect. The workhorse of this material.

**Sarah:** Third tool, difference-in-differences.

**Kiffer:** Difference-in-differences, abbreviated DiD, is used when an exposure is rolled out to one group but not another. Instead of comparing exposed and unexposed groups directly, it compares the change over time in the exposed group to the change over time in the unexposed. The unexposed group's change is the counterfactual for what would have happened in the exposed group absent the intervention. It subtracts out stable group differences and shared time trends, isolating the effect of the exposure under the assumption of parallel trends.

**Sarah:** And the fourth tool, mediation analysis.

**Kiffer:** Mediation decomposes the total effect into a direct effect and one or more indirect effects through mediators. In counterfactual terms, what would the outcome be if exposure were changed but the mediator held at its unexposed value, versus if both were changed? You estimate not just whether an exposure matters, but through what pathways.

**Sarah:** And the common thread that ties them together.

**Kiffer:** Each tool is a different strategy for the same problem. How do we build a credible counterfactual when we cannot randomize? Keep that lens through the rest of the course. The methods look very different on the surface. Underneath, they are all answering the same question.

**Sarah:** Okay. Let's pull the big takeaways together.

**Kiffer:** First, epidemiology has always lived in the tension between biology-mechanism and population-environment perspectives. Eight historical milestones, two perspectives, both essential. Modern epidemiology accepts multiple causes for almost every outcome and multiple effects for almost every cause.

**Sarah:** Second, scientific inference is a mix of induction, deduction in the Popperian style, and Bayesian reasoning where prior knowledge shapes interpretation. Single observations rarely overturn theories. Communities rely on consensus and Kuhnian paradigm shifts.

**Kiffer:** Third, the structure of any epidemiologic study can be read off Figure 1.1. Source population, sampling, study group, exposure, outcome, extraneous variables, analysis. The modern formal version is a directed acyclic graph. Three building blocks, chain, fork, collider. Four jobs, identify confounders, flag what not to adjust for, make assumptions criticisable, support a transparent estimand.

**Sarah:** Fourth, mediation analysis is the quantitative complement to a DAG. Baron and Kenny in 1986 give us the causal-steps procedure. Modern alternatives, bootstrap intervals from Preacher and Hayes, and counterfactual mediation from Imai, Pearl, and VanderWeele, relax the strong assumptions of the original.

**Kiffer:** Fifth, the component-cause model with necessary, sufficient, and component causes. The childhood respiratory disease example shows how four risk factors combine into different sufficient causes. And the worked risk-ratio example, 4.83 in Population A versus 2.93 in Population B, shows that strength of association is population-specific. Same biology, different co-factor prevalence, different observed effect.

**Sarah:** Sixth, the causal web distinguishes direct from indirect causes. Many of our biggest public-health wins are interventions on indirect causes. The population attributable fraction from Levin in 1953 gives planners a number, and its odd properties, summing above one hundred percent, and the prevention paradox, follow directly from the multicausal structure.

**Kiffer:** Seventh, the counterfactual framework is the modern logic of causal inference. Two potential outcomes. Holland in 1986 called it the fundamental problem of causal inference. Individual treatment effects are unobservable. We move to the average treatment effect at the group level. Randomization gives exchangeability. Confounding wrecks it.

**Sarah:** And eighth, the modern toolkit, propensity score matching, regression, difference-in-differences, and mediation analysis, all share a common logic. Each is a strategy for approximating the missing potential outcome when we cannot randomize.

**Kiffer:** That's the lens you carry into Lesson 2 next week, Surveillance and Outbreak Investigation. We take this causal foundation and apply it to the front-line work of detecting disease and chasing down outbreaks in real time.

**Sarah:** Thanks, Kiffer. And thanks for being here, everyone. We'll see you next time.

**Kiffer:** See you next time.
