Introduction &
Causal Concepts
Fundamental Epidemiological Concepts and Approaches
Kiffer G. Card, PhD, Faculty of Health Sciences, Simon Fraser University
Learning objectives for this lesson:
- Trace the history of causal thinking in epidemiology
- Understand component-cause and causal-web models
- Describe the potential-outcomes (counterfactual) framework for estimating causal effects
- Explain why individual effects are unobservable and how average treatment effects fill the gap
- Recognize how counterfactual logic underpins propensity score matching, regression, difference-in-differences, and mediation analysis
- Explain how observational studies and experiments seek causal evidence
- Distinguish inductive and deductive reasoning in science
- Identify the key components of epidemiologic research
- Read and build a directed acyclic graph (DAG), and recognise the role of chains, forks, and colliders
- Distinguish DAG-based causal reasoning from quantitative mediation analysis (Baron & Kenny)
- Apply causal criteria to evaluate associations
This course was developed by Kiffer G. Card, PhD, as a companion to Dohoo, I. R., Martin, S. W., & Stryhn, H. (2012). Methods in Epidemiologic Research. VER Inc.
Glossary — Key Terms, People & Concepts
📚 Reference page — available throughout the lesson
This glossary collects the key concepts, people, and ideas you will meet in this lesson. Use it as a reference while you work through the material, or as a review before assessments. Type in the search box to filter entries.
What Is Epidemiology?
⏱ Estimated reading time: 10 minutes
Introduction and Overview
Welcome to HSCI 341. If you've come from HSCI 230, you spent that course learning to read epidemiological research critically — to evaluate study designs, identify biases, and decide what evidence to trust. HSCI 341 picks up where that left off and asks you to do the work yourself: design valid studies, calculate measures of disease frequency and association, work through screening tests, and conduct surveillance and outbreak investigations. The bias inventory you built in 230 becomes the design checklist you'll use here. Lesson 1 sets up the conceptual foundation for the entire course — what epidemiology is, how scientific inference works (Susser & Susser, 1996), and what we mean when we say one thing “causes” another. Across four content sections we move from the discipline's history (Section 1), through the inferential logic that organizes any study (Section 2), into formal models of causation (Section 3), and finally to the counterfactual framework that underpins modern causal inference (Section 4).
Learning Objectives
- Define epidemiology and explain its core purpose.
- Describe the historical evolution of causal thinking about disease.
- Recognize that epidemiology seeks to identify causal associations between exposures and outcomes.
Defining Epidemiology
Epidemiology is fundamentally about understanding the patterns, causes, and effects of health and disease in populations. Historically, epidemiologists have been concerned with identifying the "succession of events which result in the exposure of specific types of individuals to specific types of environment" — that is, the exposures and causal factors that drive disease.
Modern epidemiology aims to improve population health by integrating data from many disciplines and proposing interventions based on scientific evidence. The discipline focuses on identifying exposures — whether demographic factors, infectious agents, nutritional factors, toxins, or lifestyle elements — and evaluating their associations with health outcomes such as disease, quality of life, and mortality.
Core Insight
Epidemiology is a field-based discipline. It is only by studying exposure-disease associations under real-world conditions that we can begin to understand the web of causal relationships that affect health. The associations we find are part of a complex web of relationships involving organisms and all aspects of their environment.
A Brief History of Causal Thinking
The way we think about what causes disease has shifted dramatically over the centuries (Susser & Susser, 1996a, 1996b). Understanding this history helps us appreciate the complexity of modern causal models. The eight cards below trace this evolution chronologically — from environmental theories in ancient Greece, through miasma and germ theory, to the multifactorial frameworks we use today. As you click through them, watch for one recurring tension: each era pulls between explaining disease at the level of individual mechanism (microbe, gene, biomarker) and explaining it at the level of populations and their environments. Section 2 will pick up that tension when we turn to the logic of scientific inference itself.
Key Historical Milestones
Click each card to learn more:
(~400 BC)Click to learn more
(1750–1885)Click to learn more
(mid-1800s)Click to learn more
(late 1800s)Click to learn more
(early 1900s)Click to learn more
(mid-1900s)Click to learn more
(1970s)Click to learn more
(21st Century)Click to learn more
Why the History Matters
Throughout the history of epidemiology, there has been an ongoing tension between two perspectives: one oriented toward biology and mechanisms of causation, the other toward populations and their interactions with the environment. Both are essential. Epidemiologists accept that there are multiple causes for almost every outcome and that a single cause can have multiple effects.
Key Takeaways
- Epidemiology identifies causal associations between exposures and outcomes to improve population health.
- Causal thinking has evolved from single-cause models (miasma, germ theory) to multifactorial models embracing complexity.
- Modern epidemiology integrates social, biological, and environmental factors in understanding disease.
1. What is the primary goal of epidemiology?
2. What important principle did John Snow's cholera investigation demonstrate?
3. Modern epidemiology accepts that:
✦ Pass the knowledge check with 100% to continue
Scientific Inference & Key Research Components
⏱ Estimated reading time: 12 minutes
Introduction and Overview
Section 1 traced the discipline's history. This section moves from history to logic: how do epidemiologists actually reason from observed data to claims about cause and effect? The two forms of reasoning we cover here — induction and deduction — are the philosophical scaffolding underneath every study you'll design later in this course. We'll then map those modes of reasoning onto the concrete components of an epidemiologic study (Figure 1.1) and finish with directed acyclic graphs (DAGs), the modern formal tool for encoding causal assumptions before any data are touched.
Learning Objectives
- Distinguish between inductive and deductive reasoning.
- Explain the role of Bayesian thinking and scientific consensus in epidemiology.
- Identify the key components of epidemiologic research design.
Why Scientific Inference Matters
Epidemiology relies primarily on observational studies because many health-related problems cannot be studied under controlled laboratory conditions. Ethical concerns, practical limitations, and the complexity of real-world relationships all demand that we study humans in their natural environments. Drawing valid inferences from these studies requires both inductive and deductive reasoning.
Two Forms of Reasoning
The three tabs below define induction and deduction and add Bayesian thinking, which formalises how prior knowledge enters into our interpretation of any new study. Click through each tab and watch for the unifying point: no single mode of reasoning gives you certainty — epidemiologic claims always rest on a combination of observation, hypothesis testing, and consensus.
Inductive Reasoning
Inductive reasoning involves making generalized inferences about causation based on repeated observations. You observe specific instances and draw broader conclusions.
Francis Bacon (1620) first presented inductive reasoning as a method of making generalizations from careful observations. Classic examples include Edward Jenner's observation that milkmaids who developed cowpox didn't get smallpox — which led to the development of the smallpox vaccine. John Stuart Mill's canons (1843) formalized rules for inductive inference and helped shape our concepts of necessary and sufficient causes.
However, as David Hume noted, "there is no logical force to inductive reasoning" — we cannot perceive a causal connection, only a series of events. Repeated observations may be consistent with causation but do not prove it.
Deductive Reasoning
Deductive reasoning involves inferring that a general "law of nature" exists and testing specific hypotheses against observations to prove or refute them. This approach is closely linked to refutationism, attributed to Karl Popper.
Popper argued that scientists should not collect data to prove a hypothesis but rather should attempt to disprove it. Only by disproving hypotheses can we make scientific progress. This is why statistical analyses typically form hypotheses in the null (no association) and then attempt to refute them.
The key benefit: it helps narrow the scope of studies. We carefully review what is known and formulate a few specific, testable hypotheses rather than casting a wide net with hundreds of variables.
Bayesian Thinking & Scientific Consensus
Thomas Bayes (1764) noted that all inference is based on the validity of our premises and that no inference can be known with certainty. The information we have before making observations influences our interpretation of those observations. This gave rise to Bayesian analysis, which formally incorporates prior knowledge and updates it with new data.
Thomas Kuhn reminded us that although a single observation can disprove a hypothesis, the observation might be anomalous. Scientific communities therefore rely on consensus — paradigm shifts — when weighing the usefulness of theories, even if they cannot prove absolute truth.
Inductive, deductive, and Bayesian reasoning are abstract. The next subsection makes them concrete by walking through the components of an actual study and showing where each form of reasoning enters.
Key Components of Epidemiologic Research
The overall structure of an epidemiologic study involves several interrelated components, each of which must be carefully managed to produce valid results. Read Figure 1.1 below as a roadmap for the rest of HSCI 341 — every box in the diagram corresponds to a topic we'll cover in detail later (sampling in Lesson 3, exposure measurement and questionnaires in Lesson 4, confounding throughout the course, and so on).
Figure 1.1 — Key components of epidemiologic research. Research starts from a source population, samples a study group, measures exposures and outcomes, accounts for extraneous variables (confounders and biases), and ultimately draws causal inferences.
The Central Goal
The rationale for epidemiologic research is to identify potential causal associations between exposures and outcomes. In many instances the exposures are potential risk factors and the outcome is a disease of interest. Ultimately, we aim to make causal inferences about these relationships in the source population as a basis for developing policy and prevention programs.
Directed Acyclic Graphs (DAGs)
The diagram in Figure 1.1 is informal — it shows boxes and arrows but does not commit to a precise causal meaning. A directed acyclic graph (DAG) is the formal version of that picture (Pearl, 1995; Greenland & Brumback, 2002). It is the working tool that modern epidemiologists use to write down their assumptions about how the world works before they touch the data, and to read off — mechanically — what those assumptions imply for analysis.
What is a DAG?
A DAG is a picture made of nodes (variables) connected by directed arrows (causal effects), with no cycles — no variable can cause itself, even by going around the long way. Each arrow is a claim: “A directly affects B, controlling for the rest of the graph.” The absence of an arrow is just as much of a claim — it asserts no direct causal effect.
The Building Blocks
Every DAG, no matter how large, is built from a small handful of structural pieces. Learning to recognise these by sight is the entire skill:
How to Build One
Drawing a DAG is a substantive exercise, not a statistical one. The arrows come from your subject-matter knowledge of the system, not from p-values. A workable recipe:
- Name the exposure (X) and outcome (Y). Put them on the page, with the exposure on the left.
- List every other variable that could plausibly affect either X or Y — demographic, biological, social, environmental. Include unmeasured variables (draw them with dashed nodes); they belong in the diagram even if you cannot put numbers on them.
- Draw an arrow from each variable to every variable it directly causes. Be ruthless about “direct” — if A → B only by going through C, you draw A → C and C → B, not A → B.
- Check for cycles. If A causes B and B causes A, you need to add a time index or break the loop with intermediate variables. DAGs forbid feedback loops.
- Read off the implications. Every “back-door” path from X to Y that does not pass through a collider must be blocked by adjustment. Mediators must be left alone if you want the total effect. Colliders must be left alone, full stop.
What DAGs Are For
A DAG plays four roles in an analysis. The first three are decided before you fit anything; the fourth is what makes it worth the trouble.
- It identifies confounders. Anything on a back-door path is a candidate for adjustment. Anything not on a back-door path is not — even if it is statistically “significant.”
- It flags variables you must not adjust for. Mediators (over-adjustment) and colliders (selection bias) ruin estimates if you control for them. The DAG tells you which is which.
- It makes assumptions criticisable. A reviewer can disagree with an arrow. They can’t disagree with a regression equation in the same way.
- It supports a transparent estimand. “The total effect of X on Y, adjusting for the back-door set {C1, C2}” is a precise target you can defend.
Mediation Analysis
A DAG tells you that a variable lies on the pathway from exposure to outcome. Mediation analysis is the next step: it puts numbers on how much of the exposure’s effect runs through the mediator versus around it.
The Question Mediation Asks
Given a chain X → M → Y, with X potentially also affecting Y directly (X → Y), how much of the total effect of X on Y is the indirect effect (through M) and how much is the direct effect (the part of X → Y that does not pass through M)? Total = Direct + Indirect.
The Classical Baron & Kenny (1986) Approach
Baron and Kenny’s causal-steps procedure is the recipe most students meet first. It runs three regressions and checks four conditions:
- Step 1 — Total effect. Regress Y on X. The slope (call it c) must be significant. This is the total X → Y effect.
- Step 2 — X predicts M. Regress M on X. The slope (a) must be significant. If X does not move M, M cannot be a mediator.
- Step 3 — M predicts Y, controlling for X. Regress Y on both X and M. The coefficient on M (call it b) must be significant.
- Step 4 — Compare c to c′. The X coefficient in Step 3 (c′) is the direct effect. If c′ is much smaller than c, the gap (c − c′, equivalently a×b) is the indirect effect through M. If c′ is essentially zero, the mediation is “complete”; otherwise it is “partial.”
Beyond Baron & Kenny
Baron & Kenny is intuitive but limited. It assumes linear models, no exposure-mediator interaction, and no unmeasured confounding of the M–Y relationship. Modern alternatives that you should be aware of:
- Bootstrap confidence intervals for a×b (Preacher & Hayes) — replaces the unreliable Sobel z-test.
- Counterfactual / causal mediation (Imai, Pearl, VanderWeele) — defines the “natural direct effect” and “natural indirect effect” without requiring linearity, handles binary outcomes and exposure-mediator interactions, and is implemented in the
mediationandCMAverseR packages.
DAGs vs. Mediation Analysis — Related, Not the Same
It is easy to conflate the two because both involve arrows and pathways, but they answer different questions:
- A DAG is a qualitative tool. It encodes which variables cause which others and lets you read off, structurally, what should be adjusted for. It does not fit a model or estimate an effect.
- Mediation analysis is a quantitative procedure. It estimates the size of a direct and indirect effect, given a model that has already been specified.
- A DAG tells you whether mediation analysis is appropriate (is M really on the pathway? are there back-door paths between M and Y that need adjustment?), and which variables to put in the regressions. Mediation analysis tells you how much of the effect runs through the mediator. Running a Baron-Kenny without the DAG-level thinking can give precise numbers for a misspecified pathway; drawing a DAG without follow-up estimation tells you the structure but not the magnitude.
- Put differently: every credible mediation analysis sits on top of a DAG. Not every DAG implies a mediation analysis.
With this scaffolding in place, the R exercise below puts the DAG side of the story into code. We will return to mediation explicitly in HSCI 410, where you will fit the same kind of model in R.
What you'll do: write a tiny smoking → CHD DAG in R using the dagitty package, then ask it (a) which variables you must adjust for and (b) what every causal/back-door path between exposure and outcome looks like. What to take away: the DAG is no longer a sketch on paper — it's a queryable object, and identifying confounders becomes a function call. We'll use this same toolkit throughout 341 (whenever you design a study) and 410 (when you fit the regression).
Modern causal epidemiology turns the diagram above into a formal object you can query. The dagitty package in R lets you draw a DAG, then ask it which variables you must adjust for to estimate a causal effect — without trial and error.
# One-time install (skip if you have done it before):
# install.packages(c("dagitty", "ggdag"))
library(dagitty)
library(ggdag)
# Smoking -> CHD, with age as a confounder of both.
g <- dagitty("dag {
smoking -> chd
age -> smoking
age -> chd
smoking [exposure]
chd [outcome]
}")
# Which variables do we need to adjust for, and which paths are open?
adjustmentSets(g) # minimal sufficient adjustment set(s)
paths(g, "smoking", "chd") # list every path between exposure and outcome
# Tidy plot
ggdag(g, layout = "circle") + theme_dag()
Why this matters. A DAG turns "I think age is a confounder" into a formal claim you can verify with code. adjustmentSets() tells you the minimum set of variables to control for; paths() lists every connection. We will use this same toolkit throughout 341 and 410.
R Reflect on what you just ran
Use the questions below to interpret the output you produced. Look at your console / plot before answering.
1. What variable(s) did adjustmentSets(g) tell you to control for, and why does that make sense given the smoking → CHD diagram you encoded?
adjustmentSets(g) returns {age} — the single confounder needed to identify the causal effect of smoking on CHD. This is the right answer because in the DAG you encoded, age has arrows into both smoking (older adults more likely to have started smoking decades ago) and CHD (independent risk factor). It satisfies all three Rothman/Greenland conditions and sits on the only back-door path from smoking to CHD. Once age is conditioned on, the only open path from smoking to CHD is the direct causal arrow.2. paths(g, "smoking", "chd") returned two paths. Which one is the direct causal path and which one is a back-door path? How can you tell from the arrow directions?
smoking → chd: a single forward-pointing arrow, beginning at the exposure and ending at the outcome with no intermediate back-tracks. The back-door path is smoking ← age → chd: one arrow points against the flow from smoking, indicating that age is a common cause, not an intermediate step. The tell-tale signature is the direction of the first arrow leaving the exposure node — if it points into smoking, the path is back-door and must be blocked.3. If you removed the age -> smoking arrow from the DAG, what would adjustmentSets(g) return next time, and what would that imply about the need to adjust for age?
adjustmentSets(g) would return an empty set { } meaning no adjustment is necessary for identification. Note that adjusting for age might still be useful for precision (it explains variation in CHD), but it is no longer required for causal identification. This is the structural distinction between confounders and prognostic factors that the DAG framework makes explicit.Key Takeaways
- Inductive reasoning generalizes from observations; deductive reasoning tests specific hypotheses.
- Bayesian thinking incorporates prior knowledge into the interpretation of new evidence.
- Epidemiologic research involves defining a source population, sampling a study group, measuring exposures and outcomes, controlling for bias and confounding, and making causal inferences.
1. Which philosopher argued that scientists should attempt to disprove rather than prove their hypotheses?
2. Bayesian analysis is best described as:
3. Which of the following is a potential threat to validity when sampling from a source population?
✦ Pass the knowledge check with 100% to continue
Seeking Causes & Models of Causation
⏱ Estimated reading time: 15 minutes
Introduction and Overview
Section 2 gave us the inferential machinery and the DAG vocabulary. This section pushes deeper into what we actually mean when we draw an arrow on a DAG and call it a “causal effect.” Three classical models structure the discussion: the component-cause model (necessary, sufficient, component causes), causal complements (which explain why the same cause has different observed effects in different populations), and the causal-web model. The section closes with the population attributable fraction — a quantity that turns the abstract notion of cause into a number a public-health planner can use.
Learning Objectives
- Define what constitutes a "cause" in epidemiology.
- Explain the component-cause model including necessary, sufficient, and component causes.
- Describe how causal complements affect the strength of association.
- Understand the causal-web model and distinguish direct from indirect causes.
What Is a "Cause"?
For practical purposes in epidemiology, a cause is any factor that produces a change in the severity or frequency of an outcome. Some causes operate at the biological level within individuals (such as a specific microorganism), while others operate at the group or population level (such as lifestyle, nutrition, or weather).
Flip through Bradford Hill's 9 viewpoints one card at a time. Next ▶ reveals each criterion.
An 11-scene game-show walkthrough of the Bradford Hill (1965) viewpoints: strength, consistency, specificity, temporality (the only hard rule), gradient, plausibility, coherence, experiment, analogy — framed as a structured judgment, not a checklist.
Epidemiology deals with groups of individuals because the methods for determining causality require it. Researchers take a holistic approach, striving to study and measure every suspected causal factor for the outcome of interest — while recognizing that not every factor can be captured in a single study.
Pragmatic Focus
Epidemiologists prefer to identify causal factors that can be manipulated to prevent disease. But some non-manipulable factors (like genetic predisposition) may also be crucial for understanding disease patterns in populations.
The Component-Cause Model
This foundational model, developed by Rothman (1976) and elaborated by Rothman & Greenland (2005), is based on the concepts of necessary and sufficient causes. The accordion below defines all three (necessary, sufficient, and component) in turn. Click each one open in order — the definitions stack on top of one another and only make sense in sequence.
A necessary cause is one without which the disease cannot occur. The factor will always be present if the disease occurs. For example, Mycobacterium tuberculosis is a necessary cause of tuberculosis — you cannot develop TB without the bacterium being present.
A sufficient cause is a set of conditions that, when present, will invariably produce the disease. In practice, very few single exposures are sufficient on their own. Instead, different groupings of factors combine to form sufficient causes.
A component cause is one of a number of factors that, in combination, constitutes a sufficient cause. The factors might be present at the same time or follow one another in a temporal chain. When there are a number of causal chains with one or more factors in common, we can conceptualize the web of causal chains as a causal web.
Example: Childhood Respiratory Disease (CRD)
Consider four risk factors for CRD: the bacterium Streptococcus pneumoniae (STREP), a virus (RSV), environmental stressors like cold weather, and other bacteria like Mycoplasma pneumoniae (MP). Different two-factor combinations of these can form sufficient causes:
| Component Causes | Sufficient Cause I | Sufficient Cause II | Sufficient Cause III | Sufficient Cause IV |
|---|---|---|---|---|
| STREP | + | + | ||
| RSV | + | + | ||
| Stressors | + | + | + | |
| Other organism (MP) | + |
Key Points from this Model
No single factor is a necessary cause of CRD (none appears in every sufficient cause). STREP is a component of 2 of the 4 sufficient causes. A child exposed to any complete combination will develop CRD. And critically, because the causal complements (the other factors in a sufficient cause) can vary in prevalence, the observed strength of association between an exposure like STREP and CRD can change even though the underlying causal mechanism has not changed.
Causal Complements and Strength of Association
A critical insight from the component-cause model is that the prevalence of causal complements — the other factors needed to complete a sufficient cause — directly affects the strength of association we observe between an exposure and outcome. Even when the causal mechanism stays the same, changes in the distribution of co-factors in the population can make the association appear stronger or weaker.
Worked Example: How Co-Factor Prevalence Matters
Imagine STREP requires RSV or Stressors as a co-factor to cause CRD. In Population A, where RSV prevalence is 30%, the risk ratio for STREP is 4.83. In Population B, where RSV prevalence rises to 70%, the risk ratio drops to 2.93 — even though the causal relationship between STREP and CRD has not changed at all.
The difference is due entirely to the change in the frequency of the co-factor RSV. This is why strength of association is not a fixed measure and is considered "population specific."
The component-cause model explains the abstract logic of multicausality. The causal-web model is the related diagrammatic tool for thinking about how those component causes interact with each other in the real world — and crucially, where you can intervene.
The Causal-Web Model
An alternative way to visualize how multiple factors combine to cause disease is the causal web, consisting of interconnected direct and indirect causal chains:
Direct (Proximal) Causes
A direct cause has no known intervening variable between it and the disease. Diagrammatically, the exposure is adjacent to the outcome. Examples often include specific microorganisms or toxins. However, in disease control, direct causes are not necessarily more valuable than indirect ones — many large-scale control efforts work by manipulating indirect rather than direct causes.
Indirect Causes
An indirect cause is one whose effects on the outcome are mediated through one or more intervening variables. For example, Stressors (cold weather) may make a child susceptible to STREP, RSV, and MP — so Stressors act as an indirect cause of CRD. Removing stress could reduce CRD even though stress itself is not a direct cause.
Implications of the Causal Web
The causal-web model complements the component-cause model but is not equivalent. It shows that we can control disease by preventing the action of direct causes (e.g., vaccination against RSV) or by removing indirect causes (e.g., reducing environmental stressors). The diagram also reveals gaps in our knowledge — apparent direct connections might actually reflect unmeasured intervening factors.
Proportion of Disease Explained
Using the concepts of necessary and sufficient causes, we can estimate the population attributable fraction (AFp) — the proportion of disease in the population that is attributable to a given exposure. Because component causes can appear in multiple sufficient causes, the AFp for all factors can sum to more than 100%. This is not an error; it reflects the reality of multicausal disease.
The Prevention Paradox
Even when a factor has a high AFp (say a vaccine with AFp = 50%), the benefit at the individual level may appear modest. If disease prevalence was 6%, universal vaccination would reduce it to 3%. While 94% of the vaccinated population would not have gotten the disease anyway, the 3% reduction is still a major population-level achievement. However, half of those who would have gotten sick will still get the disease despite being vaccinated. This creates a paradox: the average person may not perceive the same benefit that population-level data shows.
What you'll do: compute a 2×2 table-based risk ratio, then plug it into Levin's (1953) formula to get the population attributable fraction. What to take away: AFp is the bridge between an individual-level effect (the RR) and a population-level statement about how much disease would disappear if the exposure were eliminated. The same calculation will reappear in Lessons 5 and 7 when we work through measures of disease frequency and association.
The AFp answers: what fraction of disease in the population would disappear if the exposure were eliminated? Two equivalent formulas, both easy to compute in R.
# Suppose a 2x2 cross-tabulation from a population-based study:
# Disease+ Disease-
# Exposed 180 820
# Unexposed 60 940
a <- 180; b <- 820 # exposed: a = cases, b = non-cases
c <- 60; d <- 940 # unexposed
risk_e <- a / (a + b) # risk in exposed
risk_u <- c / (c + d) # risk in unexposed
RR <- risk_e / risk_u
p_exp <- (a + b) / (a + b + c + d) # prevalence of exposure
# AFp = p_e * (RR - 1) / (1 + p_e * (RR - 1)) (Levin, 1953)
AFp <- p_exp * (RR - 1) / (1 + p_exp * (RR - 1))
round(c(RR = RR, AFp = AFp), 3)
Reading the result. RR = 3 and 50% of disease in this population is attributable to the exposure. Because component causes appear in multiple sufficient causes, AFp's for different exposures can sum to more than 100% — not a math error, but the multicausal reality.
R Reflect on what you just ran
Use the questions below to interpret the output you produced. Look at your console before answering.
1. What did RR equal in your console output, and how do you interpret that number in plain language?
2. AFp came out to 0.500. In one sentence, what does an AFp of 50% say about how much disease in this population is attributable to the exposure?
3. If the prevalence of exposure (p_exp) were cut in half, would AFp go up or down? Re-run the formula with p_exp / 2 to confirm.
Key Takeaways
- A cause in epidemiology is any factor that changes disease severity or frequency.
- The component-cause model shows how different groupings of factors form sufficient causes, and why no single factor need be necessary for a disease.
- The strength of association can vary between populations even when the underlying causal mechanism is unchanged, due to differences in the prevalence of causal complements.
- The causal-web model distinguishes direct and indirect causes and guides study design and disease control strategies.
- The population attributable fraction can exceed 100% because components are shared across multiple sufficient causes.
1. In the component-cause model, a "sufficient cause" is best described as:
2. Why can the strength of association between an exposure and disease change between populations?
3. An indirect cause of disease is one that:
4. Why can the population attributable fractions for all risk factors of a disease sum to more than 100%?
✦ Pass the knowledge check with 100% to continue
The Counterfactual Concept
⏱ Estimated reading time: 12 minutes
Introduction and Overview
Section 3 walked through three classical models of causation: necessary/sufficient causes, the causal web, and the population attributable fraction. Those models are about the structure of causation. This final section is about the modern logic of causal inference — specifically, the potential-outcomes framework, also called the counterfactual model. By the end of the section you'll see how every analytic technique you'll meet later in HSCI 341 and HSCI 410 (propensity scores, regression, difference-in-differences, mediation) is a different way of approximating the same impossible quantity: what would have happened if the same person had not been exposed.
Learning Objectives
- Define the potential-outcomes (counterfactual) model for causal inference.
- Explain the fundamental problem of causal inference — why individual treatment effects cannot be directly observed.
- Describe how the average treatment effect (ATE) provides a tractable, group-level alternative.
- Describe how randomized experiments approximate the counterfactual ideal.
- Understand the concept of confounding and exchangeability.
- Recognize how counterfactual logic motivates the major analytic tools used in modern epidemiology — propensity score matching, regression, difference-in-differences, and mediation analysis.
What Is the Counterfactual?
The potential outcomes framework — also called the counterfactual model, and sometimes the Neyman-Rubin causal model after the statisticians who formalized it (Rubin, 1974) — is currently the most widely accepted conceptual basis for causal inference in epidemiology and across the modern health and social sciences. At its core, it asks a deceptively simple question: What would have happened to this same person if they had not been exposed?
For any individual i, the framework imagines two potential outcomes:
- Yi(1) — the outcome we would observe if person i were exposed (or treated)
- Yi(0) — the outcome we would observe if that same person were unexposed (or untreated)
The individual treatment effect is the difference between these two potential outcomes: Yi(1) − Yi(0). This is the quantity we would most like to know — but, as the next section makes clear, we can never observe both potential outcomes for the same person.
The Thought Experiment
Imagine you want to know if a vaccine protects against a disease. You observe a vaccinated person who develops the disease. If you could rewind time and observe the same person in the same period without vaccination, and they did NOT develop the disease, you would conclude the vaccine actually caused the disease in that individual. Conversely, if they still got the disease without the vaccine, the vaccine was not the cause.
This counterfactual individual does not exist — you can never observe the same person under two different exposure levels simultaneously. But this is the ideal that our research methods try to approximate.
The Fundamental Problem of Causal Inference
Because each person is either exposed or unexposed — never both simultaneously — we can only ever observe one of their two potential outcomes. The other one is missing. Holland (1986) called this the fundamental problem of causal inference: individual causal effects are unobservable. No clever measurement, no improved technology, and no more careful study design can fully solve this problem at the level of a single person (Hernán, 2004).
The standard response is to give up on the individual treatment effect and instead estimate something we can learn from data: the average treatment effect (ATE) in a population or sample.
The Average Treatment Effect (ATE)
Rather than asking “what is the effect for this person?” we ask “what is the average effect across a group of similar people?” Formally:
ATE = E[Y(1)] − E[Y(0)]
That is, the average outcome if everyone were exposed, minus the average outcome if no one were exposed. Equivalently, in epidemiologic notation:
- p(DE+) — the potential frequency of disease if all population members were exposed
- p(DE-) — the potential frequency of disease if none were exposed
If these two quantities differ, we infer a causal effect in the population — even though we cannot pinpoint which specific individuals were affected.
Why This Matters
The shift from individual effects to group-level average effects is the conceptual move that powers most modern causal research. Randomized trials, observational studies, policy evaluations, and program evaluations all ultimately rest on the same logic: build two groups (or build a counterfactual comparison) that we can plausibly treat as exchangeable, and compare their average outcomes. Almost every major analytic tool covered later in this course is, at heart, a different strategy for doing exactly this.
The Role of Randomization
In a perfect experiment, we would randomly assign subjects to exposed and unexposed groups. Randomization creates exchangeability: the condition where the disease frequency in each group would not change if the groups' exposure status were switched. This means any difference in outcomes can be attributed to the exposure itself.
Why Randomization Works
When groups are exchangeable, comparing p(D|E+) and p(D|E-) gives us the closest possible estimate of the true counterfactual effect. However, in real trials, data come from two different subsets of subjects, so the estimate is approximate. The assumption is that random assignment balances all known and unknown confounders between groups.
Confounding: A Threat to Causal Inference
A confounder is a variable that is associated with both the exposure and the outcome and can distort the observed association between them (Greenland & Brumback, 2002). Consider a study of vaccination (E) and disease (D) where a third variable — say a pre-existing health condition (C) — independently predicts both who gets vaccinated and who gets the disease.
Confounding in Action
In Table 1.3 from the text, 20 subjects are studied. Looking at the raw data, p(D|E+) = 7/13 = 0.54 and p(D|E-) = 3/7 = 0.43, suggesting the exposure might increase disease risk. But when we stratify by the confounder C:
Among C+ subjects: p(D|E+) = 6/9 = 0.67 and p(D|E-) = 2/3 = 0.67
Among C- subjects: p(D|E+) = 1/4 = 0.25 and p(D|E-) = 1/4 = 0.25
Within each stratum, the exposure has NO effect on disease! The apparent association was entirely due to confounding by C. This is why controlling for confounders is essential in epidemiologic analysis.
Observational Studies and the Counterfactual
In observational studies, we cannot randomize. This means groups may not be exchangeable, and confounding is a major concern. Epidemiologists use several strategies to address this: restriction (limiting the study to one level of the confounder), matching, stratification, and multivariable statistical models. All of these aim to simulate the exchangeability that randomization would provide.
From Counterfactual Logic to the Modern Toolkit
Once you accept that causal inference is fundamentally about constructing a credible counterfactual comparison — an estimate of what would have happened in the absence of exposure — many of the methods you will encounter later in this course (and across modern epidemiology, biostatistics, health economics, and policy evaluation) start to look like variations on a single theme. Each is a different strategy for approximating the missing potential outcome (Vandenbroucke, Broadbent, & Pearce, 2016).
Below is a brief preview of four widely used tools. Each will be explored in much greater depth later in this series — here we are only flagging how each one is anchored in counterfactual logic.
A propensity score is the estimated probability of being exposed given a set of measured characteristics. Propensity score matching pairs each exposed person with one or more unexposed people who had a similar probability of being exposed. The matched unexposed group then serves as the counterfactual stand-in for the exposed group — mimicking what randomization would have done by balancing observed confounders. It is one of the most direct attempts to manufacture exchangeability from observational data.
Regression models — linear, logistic, Poisson, Cox, and others — estimate the average difference in outcome associated with exposure while statistically holding other variables constant. Conceptually, regression asks: among people who look the same on measured confounders, what is the average outcome difference between the exposed and the unexposed? When the model is correctly specified and confounders are adequately measured, the regression coefficient on the exposure can be interpreted as an estimate of the average treatment effect. Regression is the workhorse of HSCI 410 and underlies many of the more advanced techniques.
Difference-in-differences is a quasi-experimental design used when an exposure (often a policy or program) is rolled out to one group but not another. Rather than comparing exposed and unexposed groups directly, it compares the change over time in the exposed group to the change over time in the unexposed group. The unexposed group's change serves as the counterfactual for what would have happened in the exposed group absent the intervention. This subtracts out stable group differences and shared time trends, isolating the effect of the exposure under the assumption of parallel trends.
Mediation analysis decomposes the total effect of an exposure on an outcome into a direct effect and one or more indirect effects operating through intermediate variables (mediators). In counterfactual terms, it asks: what would the outcome be if exposure were changed but the mediator were held at its unexposed value, versus if both were changed? This allows researchers to estimate not just whether an exposure matters, but through what pathways — an essential step for designing targeted interventions.
The Common Thread
Each of these tools — propensity score matching, regression, difference-in-differences, and mediation — is a different answer to the same question: How do we build a credible counterfactual when we cannot randomize? Keep this lens in mind throughout the rest of the course. The methods will look very different on the surface, but the underlying logic — comparing observed outcomes to a thoughtfully constructed estimate of what would have happened otherwise — remains the same.
Reflection
Think of a research question in your area of interest. What would the ideal counterfactual comparison look like? What confounders might distort the observed association, and how might you control for them?
Minimum 20 characters required.
Key Takeaways
- The potential-outcomes (counterfactual) model asks: what would have happened to the same individual under a different exposure level?
- The fundamental problem of causal inference is that individual treatment effects are unobservable — we never see both Y(1) and Y(0) for the same person.
- To make progress, we estimate the average treatment effect (ATE) at the group level rather than the individual effect. This shift from individual to group is the foundation of modern causal research.
- Randomized experiments create exchangeability, allowing the ATE to be estimated by comparing average outcomes across groups.
- Confounding occurs when a third variable distorts the exposure-outcome association; controlling for it is essential for valid causal inference.
- Modern tools — propensity score matching, regression, difference-in-differences, and mediation analysis — are all strategies for constructing a credible counterfactual when randomization is not possible.
1. The counterfactual concept asks:
2. Exchangeability in a randomized experiment means:
3. A confounder is a variable that:
✦ Complete the reflection and pass the knowledge check with 100% to continue
Lesson Review & Final Assessment
⏱ Estimated time: 15 minutes
Bringing It All Together
This lesson laid the conceptual foundation for everything that follows in HSCI 341. You moved from a working definition of epidemiology and a brief history of causal thinking into the formal logic of scientific inference, then into the language epidemiologists use to talk about causes — component causes, sufficient causes, causal webs — and finally into the counterfactual framework that underwrites modern causal inference.
The threads pulled together here are deliberately abstract because the rest of the course is not. Sampling, questionnaire design, measures of disease frequency, screening, measures of association, and study design choices all assume you already know why a research question is causal and what a confounder is. As you work through the final assessment, treat the takeaways below as the vocabulary the next eleven lessons will keep using.
Key Takeaways from Lesson 1
- Epidemiology studies exposure–outcome associations in populations to identify modifiable causes of disease and inform prevention.
- Causal thinking has evolved from Hippocratic environmental theories through miasma and germ theory to today's multifactorial models combining biological, behavioural, and social drivers.
- Inductive, deductive, and Bayesian reasoning each play a distinct role in moving from observation to evidence; no single mode is sufficient on its own.
- The component–cause and causal–web models explain why most disease has many causes, why "strength of association" is population-specific, and why necessary causes are rare.
- The counterfactual (potential-outcomes) framework is the conceptual gold standard: causal effects are comparisons of what happened with what would have happened under a different exposure.
- Because individual counterfactuals are unobservable, epidemiologists estimate average treatment effects using randomization, matching, regression, and related tools — all of which depend on controlling confounding.
Final Reflection
Think about a health issue you are interested in studying. Identify a potential exposure-outcome relationship and sketch out what a component-cause model might look like. What would be a direct cause versus an indirect cause? What confounders would you need to consider?
Minimum 20 characters required.
Final Knowledge Assessment
Complete the following 15-question assessment. A score of 100% is required to complete the lesson. You may retake the assessment as many times as needed.
1. The primary goal of epidemiology is to:
2. John Snow's investigation of cholera demonstrated that:
3. Karl Popper's philosophy of refutationism holds that:
4. Bayesian analysis in epidemiology:
5. In epidemiology, a "cause" is defined as:
6. A sufficient cause in the component-cause model is:
7. The strength of association between an exposure and disease can vary between populations because:
8. An indirect cause of disease is one that:
9. The counterfactual model is based on comparing:
10. Exchangeability in a randomized trial means:
11. A confounder is a variable that:
12. Selection bias occurs when:
13. The population attributable fraction (AFp) can exceed 100% because:
14. The prevention paradox refers to the fact that:
15. Which statement best reflects the overall message of this lesson?
✦ Complete the final reflection above before submitting