HSCI 341 — Lesson 1

Introduction &
Causal Concepts

Fundamental Epidemiological Concepts and Approaches

Kiffer G. Card, PhD, Faculty of Health Sciences, Simon Fraser University

Learning objectives for this lesson:

  • Trace the history of causal thinking in epidemiology
  • Understand component-cause and causal-web models
  • Describe the potential-outcomes (counterfactual) framework for estimating causal effects
  • Explain why individual effects are unobservable and how average treatment effects fill the gap
  • Recognize how counterfactual logic underpins propensity score matching, regression, difference-in-differences, and mediation analysis
  • Explain how observational studies and experiments seek causal evidence
  • Distinguish inductive and deductive reasoning in science
  • Identify the key components of epidemiologic research
  • Read and build a directed acyclic graph (DAG), and recognise the role of chains, forks, and colliders
  • Distinguish DAG-based causal reasoning from quantitative mediation analysis (Baron & Kenny)
  • Apply causal criteria to evaluate associations

This course was developed by Kiffer G. Card, PhD, as a companion to Dohoo, I. R., Martin, S. W., & Stryhn, H. (2012). Methods in Epidemiologic Research. VER Inc.

Reference

Glossary — Key Terms, People & Concepts

📚 Reference page — available throughout the lesson

This glossary collects the key concepts, people, and ideas you will meet in this lesson. Use it as a reference while you work through the material, or as a review before assessments. Type in the search box to filter entries.

Key Concepts & Ideas
Cause An event, condition, or characteristic that precedes a disease event and without which the disease event either would not have occurred at all or would not have occurred until some later time.
Effect The change in an outcome that is attributable to a cause — typically defined as a contrast between potential outcomes under different exposure conditions.
Counterfactual A “what would have happened” outcome — the outcome the same individual would have experienced under a different (unobserved) exposure condition. Forms the conceptual basis for causal inference.
Potential Outcomes Framework Formal causal inference framework (Neyman-Rubin) that defines the causal effect for an individual as the difference between the outcome they would have under treatment and under control. Only one of the two is ever observed — the “fundamental problem of causal inference.”
Average Treatment Effect (ATE) The expected difference in potential outcomes across the population: E[Y(1) − Y(0)]. Identifiable from data when exchangeability, positivity, and consistency hold.
Exchangeability The condition that exposed and unexposed groups would have had the same risk of the outcome had they swapped exposures — i.e., no unmeasured confounding. Achieved by design via randomization or by adjustment in observational studies.
Sufficient Cause A complete causal mechanism — a minimal set of conditions and events that inevitably produces the disease. Rothman’s “causal pies” depict each sufficient cause as a pie made of component causes.
Component Cause Any one of the factors that, together with others, makes up a sufficient cause. Each “slice” in a Rothman pie is a component cause.
Necessary Cause A component that must be present in every sufficient cause for the disease to occur. Example: Mycobacterium tuberculosis is necessary for tuberculosis.
Causal Pie Model Rothman’s schematic showing each sufficient cause as a pie composed of component causes. Disease can have multiple sufficient causes (multiple pies), and removing any single component breaks that pie.
Web of Causation MacMahon and Pugh’s metaphor for the complex network of interconnected factors — biological, behavioural, environmental, and social — that contribute to disease.
Association vs. Causation Association is a statistical relationship between two variables; causation is a relationship in which changing one would change the other. All causal relationships imply association, but not all associations are causal — chance, bias, and confounding can produce non-causal associations.
Confounding Distortion of an exposure-outcome association by a third variable that is associated with the exposure and independently affects the outcome — without being on the causal pathway.
Directed Acyclic Graph (DAG) A graphical tool that encodes assumed causal relationships among variables using directed arrows, with no directed cycles. Used to identify confounders, mediators, and colliders for proper adjustment.
Collider A variable on which two or more arrows in a DAG converge. Adjusting for a collider opens a non-causal path and can introduce selection bias (collider-stratification bias).
Mediator A variable that lies on the causal pathway between exposure and outcome. Adjusting for a mediator removes part of the causal effect being estimated.
Induction vs. Deduction Induction reasons from specific observations to general principles; deduction reasons from general principles to specific predictions. Modern epidemiology blends both: hypotheses are deduced from theory and tested against inductively gathered data.
Bradford Hill Considerations Nine viewpoints (strength, consistency, specificity, temporality, biological gradient, plausibility, coherence, experiment, analogy) Hill proposed for judging whether an observed association is likely causal. Not a checklist; only temporality is a strict requirement.
Methods & Approaches
Propensity Score Matching A method that matches exposed and unexposed units on their estimated probability of being exposed (given covariates) to mimic the balance achieved by randomization.
Difference-in-Differences A quasi-experimental method that compares the change in outcome over time in a treated group to the change in an untreated control group, removing time-invariant confounding.
Mediation Analysis A statistical approach that decomposes a total effect into direct and indirect (mediated) components. Classical Baron & Kenny approach has been refined by counterfactual-based methods.
Key People
Sir Austin Bradford Hill (1897–1991) British epidemiologist and statistician who articulated the nine considerations for causal inference (1965) and co-led the British Doctors Study linking smoking to lung cancer.
Kenneth Rothman American epidemiologist who introduced the sufficient-component-cause (“causal pie”) model and authored foundational texts on modern epidemiology.
Judea Pearl Computer scientist whose work formalized causal inference using directed acyclic graphs and the do-calculus, bridging statistics, AI, and epidemiology.
Miguel Hernán Epidemiologist at Harvard whose work (with Robins) developed and popularized the potential-outcomes framework, target trial emulation, and modern causal-inference methods.
Sander Greenland Epidemiologist and biostatistician whose work clarified confounding, effect modification, and the use of DAGs in observational research.
Mervyn Susser (1921–2014) South African-American epidemiologist who advanced concepts of causal thinking in epidemiology and championed eco-epidemiology and social determinants frameworks.
Donald Rubin Statistician who developed the formal potential-outcomes (Rubin Causal Model) framework and propensity score methods.
No matching entries. Try a different search term.
Section 1

What Is Epidemiology?

⏱ Estimated reading time: 10 minutes

Introduction and Overview

Welcome to HSCI 341. If you've come from HSCI 230, you spent that course learning to read epidemiological research critically — to evaluate study designs, identify biases, and decide what evidence to trust. HSCI 341 picks up where that left off and asks you to do the work yourself: design valid studies, calculate measures of disease frequency and association, work through screening tests, and conduct surveillance and outbreak investigations. The bias inventory you built in 230 becomes the design checklist you'll use here. Lesson 1 sets up the conceptual foundation for the entire course — what epidemiology is, how scientific inference works (Susser & Susser, 1996), and what we mean when we say one thing “causes” another. Across four content sections we move from the discipline's history (Section 1), through the inferential logic that organizes any study (Section 2), into formal models of causation (Section 3), and finally to the counterfactual framework that underpins modern causal inference (Section 4).

Learning Objectives

  • Define epidemiology and explain its core purpose.
  • Describe the historical evolution of causal thinking about disease.
  • Recognize that epidemiology seeks to identify causal associations between exposures and outcomes.

Defining Epidemiology

Epidemiology is fundamentally about understanding the patterns, causes, and effects of health and disease in populations. Historically, epidemiologists have been concerned with identifying the "succession of events which result in the exposure of specific types of individuals to specific types of environment" — that is, the exposures and causal factors that drive disease.

Modern epidemiology aims to improve population health by integrating data from many disciplines and proposing interventions based on scientific evidence. The discipline focuses on identifying exposures — whether demographic factors, infectious agents, nutritional factors, toxins, or lifestyle elements — and evaluating their associations with health outcomes such as disease, quality of life, and mortality.

Core Insight

Epidemiology is a field-based discipline. It is only by studying exposure-disease associations under real-world conditions that we can begin to understand the web of causal relationships that affect health. The associations we find are part of a complex web of relationships involving organisms and all aspects of their environment.

A Brief History of Causal Thinking

The way we think about what causes disease has shifted dramatically over the centuries (Susser & Susser, 1996a, 1996b). Understanding this history helps us appreciate the complexity of modern causal models. The eight cards below trace this evolution chronologically — from environmental theories in ancient Greece, through miasma and germ theory, to the multifactorial frameworks we use today. As you click through them, watch for one recurring tension: each era pulls between explaining disease at the level of individual mechanism (microbe, gene, biomarker) and explaining it at the level of populations and their environments. Section 2 will pick up that tension when we turn to the logic of scientific inference itself.

Key Historical Milestones

Click each card to learn more:

Hippocrates
(~400 BC)
Click to learn more
Miasma Theory
(1750–1885)
Click to learn more
John Snow
(mid-1800s)
Click to learn more
Germ Theory
(late 1800s)
Click to learn more
Goldberger & Pellagra
(early 1900s)
Click to learn more
Framingham & Beyond
(mid-1900s)
Click to learn more
Agent-Host-Environment
(1970s)
Click to learn more
One Health
(21st Century)
Click to learn more

Why the History Matters

Throughout the history of epidemiology, there has been an ongoing tension between two perspectives: one oriented toward biology and mechanisms of causation, the other toward populations and their interactions with the environment. Both are essential. Epidemiologists accept that there are multiple causes for almost every outcome and that a single cause can have multiple effects.

Key Takeaways

  • Epidemiology identifies causal associations between exposures and outcomes to improve population health.
  • Causal thinking has evolved from single-cause models (miasma, germ theory) to multifactorial models embracing complexity.
  • Modern epidemiology integrates social, biological, and environmental factors in understanding disease.
Knowledge Check — Section 1

1. What is the primary goal of epidemiology?

Epidemiology focuses on identifying exposure-outcome associations at the population level to inform prevention and intervention.

2. What important principle did John Snow's cholera investigation demonstrate?

Snow identified contaminated water as the cause of cholera transmission roughly 30 years before the organism Vibrio cholerae was discovered.

3. Modern epidemiology accepts that:

Epidemiologists embrace multicausal models, recognizing that disease arises from complex webs of interacting factors.

✦ Pass the knowledge check with 100% to continue

Section 2

Scientific Inference & Key Research Components

⏱ Estimated reading time: 12 minutes

Introduction and Overview

Section 1 traced the discipline's history. This section moves from history to logic: how do epidemiologists actually reason from observed data to claims about cause and effect? The two forms of reasoning we cover here — induction and deduction — are the philosophical scaffolding underneath every study you'll design later in this course. We'll then map those modes of reasoning onto the concrete components of an epidemiologic study (Figure 1.1) and finish with directed acyclic graphs (DAGs), the modern formal tool for encoding causal assumptions before any data are touched.

Learning Objectives

  • Distinguish between inductive and deductive reasoning.
  • Explain the role of Bayesian thinking and scientific consensus in epidemiology.
  • Identify the key components of epidemiologic research design.

Why Scientific Inference Matters

Epidemiology relies primarily on observational studies because many health-related problems cannot be studied under controlled laboratory conditions. Ethical concerns, practical limitations, and the complexity of real-world relationships all demand that we study humans in their natural environments. Drawing valid inferences from these studies requires both inductive and deductive reasoning.

Two Forms of Reasoning

The three tabs below define induction and deduction and add Bayesian thinking, which formalises how prior knowledge enters into our interpretation of any new study. Click through each tab and watch for the unifying point: no single mode of reasoning gives you certainty — epidemiologic claims always rest on a combination of observation, hypothesis testing, and consensus.

Inductive Reasoning

Inductive reasoning involves making generalized inferences about causation based on repeated observations. You observe specific instances and draw broader conclusions.

Francis Bacon (1620) first presented inductive reasoning as a method of making generalizations from careful observations. Classic examples include Edward Jenner's observation that milkmaids who developed cowpox didn't get smallpox — which led to the development of the smallpox vaccine. John Stuart Mill's canons (1843) formalized rules for inductive inference and helped shape our concepts of necessary and sufficient causes.

However, as David Hume noted, "there is no logical force to inductive reasoning" — we cannot perceive a causal connection, only a series of events. Repeated observations may be consistent with causation but do not prove it.

Deductive Reasoning

Deductive reasoning involves inferring that a general "law of nature" exists and testing specific hypotheses against observations to prove or refute them. This approach is closely linked to refutationism, attributed to Karl Popper.

Popper argued that scientists should not collect data to prove a hypothesis but rather should attempt to disprove it. Only by disproving hypotheses can we make scientific progress. This is why statistical analyses typically form hypotheses in the null (no association) and then attempt to refute them.

The key benefit: it helps narrow the scope of studies. We carefully review what is known and formulate a few specific, testable hypotheses rather than casting a wide net with hundreds of variables.

Bayesian Thinking & Scientific Consensus

Thomas Bayes (1764) noted that all inference is based on the validity of our premises and that no inference can be known with certainty. The information we have before making observations influences our interpretation of those observations. This gave rise to Bayesian analysis, which formally incorporates prior knowledge and updates it with new data.

Thomas Kuhn reminded us that although a single observation can disprove a hypothesis, the observation might be anomalous. Scientific communities therefore rely on consensus — paradigm shifts — when weighing the usefulness of theories, even if they cannot prove absolute truth.

Inductive, deductive, and Bayesian reasoning are abstract. The next subsection makes them concrete by walking through the components of an actual study and showing where each form of reasoning enters.

Key Components of Epidemiologic Research

The overall structure of an epidemiologic study involves several interrelated components, each of which must be carefully managed to produce valid results. Read Figure 1.1 below as a roadmap for the rest of HSCI 341 — every box in the diagram corresponds to a topic we'll cover in detail later (sampling in Lesson 3, exposure measurement and questionnaires in Lesson 4, confounding throughout the course, and so on).

Source Population Sampling (selection bias?) Study Group Exposure Variables Outcome associations Extraneous Variables (confounding & information bias) Analysis & Causal Inferences Outcomes: continuous / dichotomous / nominal / count / time-to-event Units: individuals, groups, areas

Figure 1.1 — Key components of epidemiologic research. Research starts from a source population, samples a study group, measures exposures and outcomes, accounts for extraneous variables (confounders and biases), and ultimately draws causal inferences.

The Central Goal

The rationale for epidemiologic research is to identify potential causal associations between exposures and outcomes. In many instances the exposures are potential risk factors and the outcome is a disease of interest. Ultimately, we aim to make causal inferences about these relationships in the source population as a basis for developing policy and prevention programs.

Directed Acyclic Graphs (DAGs)

The diagram in Figure 1.1 is informal — it shows boxes and arrows but does not commit to a precise causal meaning. A directed acyclic graph (DAG) is the formal version of that picture (Pearl, 1995; Greenland & Brumback, 2002). It is the working tool that modern epidemiologists use to write down their assumptions about how the world works before they touch the data, and to read off — mechanically — what those assumptions imply for analysis.

What is a DAG?

A DAG is a picture made of nodes (variables) connected by directed arrows (causal effects), with no cycles — no variable can cause itself, even by going around the long way. Each arrow is a claim: “A directly affects B, controlling for the rest of the graph.” The absence of an arrow is just as much of a claim — it asserts no direct causal effect.

The Building Blocks

Every DAG, no matter how large, is built from a small handful of structural pieces. Learning to recognise these by sight is the entire skill:

X M Y
Chain
M is a mediator
Don't adjust for M
click to learn more
C X Y
Fork
C is a confounder
Adjust for C
click to learn more
X Y Z
Collider
Z is a collider
Never condition on Z
click to learn more

How to Build One

Drawing a DAG is a substantive exercise, not a statistical one. The arrows come from your subject-matter knowledge of the system, not from p-values. A workable recipe:

  1. Name the exposure (X) and outcome (Y). Put them on the page, with the exposure on the left.
  2. List every other variable that could plausibly affect either X or Y — demographic, biological, social, environmental. Include unmeasured variables (draw them with dashed nodes); they belong in the diagram even if you cannot put numbers on them.
  3. Draw an arrow from each variable to every variable it directly causes. Be ruthless about “direct” — if A → B only by going through C, you draw A → C and C → B, not A → B.
  4. Check for cycles. If A causes B and B causes A, you need to add a time index or break the loop with intermediate variables. DAGs forbid feedback loops.
  5. Read off the implications. Every “back-door” path from X to Y that does not pass through a collider must be blocked by adjustment. Mediators must be left alone if you want the total effect. Colliders must be left alone, full stop.

What DAGs Are For

A DAG plays four roles in an analysis. The first three are decided before you fit anything; the fourth is what makes it worth the trouble.

  • It identifies confounders. Anything on a back-door path is a candidate for adjustment. Anything not on a back-door path is not — even if it is statistically “significant.”
  • It flags variables you must not adjust for. Mediators (over-adjustment) and colliders (selection bias) ruin estimates if you control for them. The DAG tells you which is which.
  • It makes assumptions criticisable. A reviewer can disagree with an arrow. They can’t disagree with a regression equation in the same way.
  • It supports a transparent estimand. “The total effect of X on Y, adjusting for the back-door set {C1, C2}” is a precise target you can defend.

Mediation Analysis

A DAG tells you that a variable lies on the pathway from exposure to outcome. Mediation analysis is the next step: it puts numbers on how much of the exposure’s effect runs through the mediator versus around it.

The Question Mediation Asks

Given a chain X → M → Y, with X potentially also affecting Y directly (X → Y), how much of the total effect of X on Y is the indirect effect (through M) and how much is the direct effect (the part of X → Y that does not pass through M)?  Total = Direct + Indirect.

The Classical Baron & Kenny (1986) Approach

Baron and Kenny’s causal-steps procedure is the recipe most students meet first. It runs three regressions and checks four conditions:

  1. Step 1 — Total effect. Regress Y on X. The slope (call it c) must be significant. This is the total X → Y effect.
  2. Step 2 — X predicts M. Regress M on X. The slope (a) must be significant. If X does not move M, M cannot be a mediator.
  3. Step 3 — M predicts Y, controlling for X. Regress Y on both X and M. The coefficient on M (call it b) must be significant.
  4. Step 4 — Compare c to c′. The X coefficient in Step 3 (c′) is the direct effect. If c′ is much smaller than c, the gap (cc′, equivalently a×b) is the indirect effect through M. If c′ is essentially zero, the mediation is “complete”; otherwise it is “partial.”

Beyond Baron & Kenny

Baron & Kenny is intuitive but limited. It assumes linear models, no exposure-mediator interaction, and no unmeasured confounding of the M–Y relationship. Modern alternatives that you should be aware of:

  • Bootstrap confidence intervals for a×b (Preacher & Hayes) — replaces the unreliable Sobel z-test.
  • Counterfactual / causal mediation (Imai, Pearl, VanderWeele) — defines the “natural direct effect” and “natural indirect effect” without requiring linearity, handles binary outcomes and exposure-mediator interactions, and is implemented in the mediation and CMAverse R packages.

DAGs vs. Mediation Analysis — Related, Not the Same

It is easy to conflate the two because both involve arrows and pathways, but they answer different questions:

  • A DAG is a qualitative tool. It encodes which variables cause which others and lets you read off, structurally, what should be adjusted for. It does not fit a model or estimate an effect.
  • Mediation analysis is a quantitative procedure. It estimates the size of a direct and indirect effect, given a model that has already been specified.
  • A DAG tells you whether mediation analysis is appropriate (is M really on the pathway? are there back-door paths between M and Y that need adjustment?), and which variables to put in the regressions. Mediation analysis tells you how much of the effect runs through the mediator. Running a Baron-Kenny without the DAG-level thinking can give precise numbers for a misspecified pathway; drawing a DAG without follow-up estimation tells you the structure but not the magnitude.
  • Put differently: every credible mediation analysis sits on top of a DAG. Not every DAG implies a mediation analysis.

With this scaffolding in place, the R exercise below puts the DAG side of the story into code. We will return to mediation explicitly in HSCI 410, where you will fit the same kind of model in R.

R Encode a causal DAG in R with the dagitty package

What you'll do: write a tiny smoking → CHD DAG in R using the dagitty package, then ask it (a) which variables you must adjust for and (b) what every causal/back-door path between exposure and outcome looks like. What to take away: the DAG is no longer a sketch on paper — it's a queryable object, and identifying confounders becomes a function call. We'll use this same toolkit throughout 341 (whenever you design a study) and 410 (when you fit the regression).

Modern causal epidemiology turns the diagram above into a formal object you can query. The dagitty package in R lets you draw a DAG, then ask it which variables you must adjust for to estimate a causal effect — without trial and error.

# One-time install (skip if you have done it before):
# install.packages(c("dagitty", "ggdag"))

library(dagitty)
library(ggdag)

# Smoking -> CHD, with age as a confounder of both.
g <- dagitty("dag {
  smoking -> chd
  age -> smoking
  age -> chd
  smoking [exposure]
  chd     [outcome]
}")

# Which variables do we need to adjust for, and which paths are open?
adjustmentSets(g)        # minimal sufficient adjustment set(s)
paths(g, "smoking", "chd")   # list every path between exposure and outcome

# Tidy plot
ggdag(g, layout = "circle") + theme_dag()
Console output
{ age } # adjustment set: condition on age $paths [1] "smoking -> chd" # the (open) causal path [2] "smoking <- age -> chd" # the (open) backdoor path -- block it

Why this matters. A DAG turns "I think age is a confounder" into a formal claim you can verify with code. adjustmentSets() tells you the minimum set of variables to control for; paths() lists every connection. We will use this same toolkit throughout 341 and 410.

R Reflect on what you just ran

Use the questions below to interpret the output you produced. Look at your console / plot before answering.

1. What variable(s) did adjustmentSets(g) tell you to control for, and why does that make sense given the smoking → CHD diagram you encoded?

Model answeradjustmentSets(g) returns {age} — the single confounder needed to identify the causal effect of smoking on CHD. This is the right answer because in the DAG you encoded, age has arrows into both smoking (older adults more likely to have started smoking decades ago) and CHD (independent risk factor). It satisfies all three Rothman/Greenland conditions and sits on the only back-door path from smoking to CHD. Once age is conditioned on, the only open path from smoking to CHD is the direct causal arrow.

2. paths(g, "smoking", "chd") returned two paths. Which one is the direct causal path and which one is a back-door path? How can you tell from the arrow directions?

Model answerThe direct causal path is smoking → chd: a single forward-pointing arrow, beginning at the exposure and ending at the outcome with no intermediate back-tracks. The back-door path is smoking ← age → chd: one arrow points against the flow from smoking, indicating that age is a common cause, not an intermediate step. The tell-tale signature is the direction of the first arrow leaving the exposure node — if it points into smoking, the path is back-door and must be blocked.

3. If you removed the age -> smoking arrow from the DAG, what would adjustmentSets(g) return next time, and what would that imply about the need to adjust for age?

Model answerRemoving the age → smoking arrow disconnects age from the exposure side of the DAG, so age is no longer a confounder — it remains a cause of CHD but not of smoking, making it a competing risk rather than a back-door variable. adjustmentSets(g) would return an empty set { } meaning no adjustment is necessary for identification. Note that adjusting for age might still be useful for precision (it explains variation in CHD), but it is no longer required for causal identification. This is the structural distinction between confounders and prognostic factors that the DAG framework makes explicit.
Saved.

Key Takeaways

  • Inductive reasoning generalizes from observations; deductive reasoning tests specific hypotheses.
  • Bayesian thinking incorporates prior knowledge into the interpretation of new evidence.
  • Epidemiologic research involves defining a source population, sampling a study group, measuring exposures and outcomes, controlling for bias and confounding, and making causal inferences.
Knowledge Check — Section 2

1. Which philosopher argued that scientists should attempt to disprove rather than prove their hypotheses?

Karl Popper's refutationism holds that science progresses by disproving hypotheses, not by collecting data to prove them.

2. Bayesian analysis is best described as:

Bayesian analysis formally updates prior probability estimates based on new evidence.

3. Which of the following is a potential threat to validity when sampling from a source population?

Selection bias occurs when the study group is not representative of the source population, threatening the validity of inferences.

✦ Pass the knowledge check with 100% to continue

Section 3

Seeking Causes & Models of Causation

⏱ Estimated reading time: 15 minutes

Introduction and Overview

Section 2 gave us the inferential machinery and the DAG vocabulary. This section pushes deeper into what we actually mean when we draw an arrow on a DAG and call it a “causal effect.” Three classical models structure the discussion: the component-cause model (necessary, sufficient, component causes), causal complements (which explain why the same cause has different observed effects in different populations), and the causal-web model. The section closes with the population attributable fraction — a quantity that turns the abstract notion of cause into a number a public-health planner can use.

Learning Objectives

  • Define what constitutes a "cause" in epidemiology.
  • Explain the component-cause model including necessary, sufficient, and component causes.
  • Describe how causal complements affect the strength of association.
  • Understand the causal-web model and distinguish direct from indirect causes.

What Is a "Cause"?

For practical purposes in epidemiology, a cause is any factor that produces a change in the severity or frequency of an outcome. Some causes operate at the biological level within individuals (such as a specific microorganism), while others operate at the group or population level (such as lifestyle, nutrition, or weather).

▸ INTERACTIVE STORY — HILL'S CAUSATION GAME SHOW Open full screen ↗

Flip through Bradford Hill's 9 viewpoints one card at a time. Next ▶ reveals each criterion.

An 11-scene game-show walkthrough of the Bradford Hill (1965) viewpoints: strength, consistency, specificity, temporality (the only hard rule), gradient, plausibility, coherence, experiment, analogy — framed as a structured judgment, not a checklist.

Epidemiology deals with groups of individuals because the methods for determining causality require it. Researchers take a holistic approach, striving to study and measure every suspected causal factor for the outcome of interest — while recognizing that not every factor can be captured in a single study.

Pragmatic Focus

Epidemiologists prefer to identify causal factors that can be manipulated to prevent disease. But some non-manipulable factors (like genetic predisposition) may also be crucial for understanding disease patterns in populations.

The Component-Cause Model

This foundational model, developed by Rothman (1976) and elaborated by Rothman & Greenland (2005), is based on the concepts of necessary and sufficient causes. The accordion below defines all three (necessary, sufficient, and component) in turn. Click each one open in order — the definitions stack on top of one another and only make sense in sequence.

Necessary Cause

A necessary cause is one without which the disease cannot occur. The factor will always be present if the disease occurs. For example, Mycobacterium tuberculosis is a necessary cause of tuberculosis — you cannot develop TB without the bacterium being present.

Sufficient Cause

A sufficient cause is a set of conditions that, when present, will invariably produce the disease. In practice, very few single exposures are sufficient on their own. Instead, different groupings of factors combine to form sufficient causes.

Component Cause

A component cause is one of a number of factors that, in combination, constitutes a sufficient cause. The factors might be present at the same time or follow one another in a temporal chain. When there are a number of causal chains with one or more factors in common, we can conceptualize the web of causal chains as a causal web.

Example: Childhood Respiratory Disease (CRD)

Consider four risk factors for CRD: the bacterium Streptococcus pneumoniae (STREP), a virus (RSV), environmental stressors like cold weather, and other bacteria like Mycoplasma pneumoniae (MP). Different two-factor combinations of these can form sufficient causes:

Component CausesSufficient Cause ISufficient Cause IISufficient Cause IIISufficient Cause IV
STREP++
RSV++
Stressors+++
Other organism (MP)+

Key Points from this Model

No single factor is a necessary cause of CRD (none appears in every sufficient cause). STREP is a component of 2 of the 4 sufficient causes. A child exposed to any complete combination will develop CRD. And critically, because the causal complements (the other factors in a sufficient cause) can vary in prevalence, the observed strength of association between an exposure like STREP and CRD can change even though the underlying causal mechanism has not changed.

Causal Complements and Strength of Association

A critical insight from the component-cause model is that the prevalence of causal complements — the other factors needed to complete a sufficient cause — directly affects the strength of association we observe between an exposure and outcome. Even when the causal mechanism stays the same, changes in the distribution of co-factors in the population can make the association appear stronger or weaker.

Worked Example: How Co-Factor Prevalence Matters

Imagine STREP requires RSV or Stressors as a co-factor to cause CRD. In Population A, where RSV prevalence is 30%, the risk ratio for STREP is 4.83. In Population B, where RSV prevalence rises to 70%, the risk ratio drops to 2.93 — even though the causal relationship between STREP and CRD has not changed at all.

The difference is due entirely to the change in the frequency of the co-factor RSV. This is why strength of association is not a fixed measure and is considered "population specific."

The component-cause model explains the abstract logic of multicausality. The causal-web model is the related diagrammatic tool for thinking about how those component causes interact with each other in the real world — and crucially, where you can intervene.

The Causal-Web Model

An alternative way to visualize how multiple factors combine to cause disease is the causal web, consisting of interconnected direct and indirect causal chains:

Direct (Proximal) Causes

A direct cause has no known intervening variable between it and the disease. Diagrammatically, the exposure is adjacent to the outcome. Examples often include specific microorganisms or toxins. However, in disease control, direct causes are not necessarily more valuable than indirect ones — many large-scale control efforts work by manipulating indirect rather than direct causes.

Indirect Causes

An indirect cause is one whose effects on the outcome are mediated through one or more intervening variables. For example, Stressors (cold weather) may make a child susceptible to STREP, RSV, and MP — so Stressors act as an indirect cause of CRD. Removing stress could reduce CRD even though stress itself is not a direct cause.

Implications of the Causal Web

The causal-web model complements the component-cause model but is not equivalent. It shows that we can control disease by preventing the action of direct causes (e.g., vaccination against RSV) or by removing indirect causes (e.g., reducing environmental stressors). The diagram also reveals gaps in our knowledge — apparent direct connections might actually reflect unmeasured intervening factors.

Proportion of Disease Explained

Using the concepts of necessary and sufficient causes, we can estimate the population attributable fraction (AFp) — the proportion of disease in the population that is attributable to a given exposure. Because component causes can appear in multiple sufficient causes, the AFp for all factors can sum to more than 100%. This is not an error; it reflects the reality of multicausal disease.

The Prevention Paradox

Even when a factor has a high AFp (say a vaccine with AFp = 50%), the benefit at the individual level may appear modest. If disease prevalence was 6%, universal vaccination would reduce it to 3%. While 94% of the vaccinated population would not have gotten the disease anyway, the 3% reduction is still a major population-level achievement. However, half of those who would have gotten sick will still get the disease despite being vaccinated. This creates a paradox: the average person may not perceive the same benefit that population-level data shows.

R Population attributable fraction (AFp) by hand

What you'll do: compute a 2×2 table-based risk ratio, then plug it into Levin's (1953) formula to get the population attributable fraction. What to take away: AFp is the bridge between an individual-level effect (the RR) and a population-level statement about how much disease would disappear if the exposure were eliminated. The same calculation will reappear in Lessons 5 and 7 when we work through measures of disease frequency and association.

The AFp answers: what fraction of disease in the population would disappear if the exposure were eliminated? Two equivalent formulas, both easy to compute in R.

# Suppose a 2x2 cross-tabulation from a population-based study:
#                Disease+   Disease-
#  Exposed         180       820
#  Unexposed        60      940

a <- 180; b <- 820          # exposed: a = cases, b = non-cases
c <-  60; d <- 940          # unexposed

risk_e <- a / (a + b)             # risk in exposed
risk_u <- c / (c + d)             # risk in unexposed
RR     <- risk_e / risk_u

p_exp  <- (a + b) / (a + b + c + d)   # prevalence of exposure

# AFp = p_e * (RR - 1) / (1 + p_e * (RR - 1))     (Levin, 1953)
AFp    <- p_exp * (RR - 1) / (1 + p_exp * (RR - 1))
round(c(RR = RR, AFp = AFp), 3)
Console output
RR AFp 3.000 0.500

Reading the result. RR = 3 and 50% of disease in this population is attributable to the exposure. Because component causes appear in multiple sufficient causes, AFp's for different exposures can sum to more than 100% — not a math error, but the multicausal reality.

R Reflect on what you just ran

Use the questions below to interpret the output you produced. Look at your console before answering.

1. What did RR equal in your console output, and how do you interpret that number in plain language?

Model answerThe simulation gives RR = 2.0 — exposed individuals had twice the cumulative risk of the outcome compared with unexposed. In plain language: out of every 100 people exposed, you would expect 20 to develop the outcome; out of every 100 unexposed, you would expect 10. The doubling captures a strong, clinically meaningful association under the simulated null-confounding scenario.

2. AFp came out to 0.500. In one sentence, what does an AFp of 50% say about how much disease in this population is attributable to the exposure?

Model answerAFp = 0.500 means that in this population 50% of cases of the outcome would be eliminated if the exposure were entirely removed. It is the population-level ‘preventable fraction’ that combines the individual relative risk with how common the exposure is — high-RR but rare exposure can produce a small AFp, and modest-RR but ubiquitous exposure can produce a large AFp.

3. If the prevalence of exposure (p_exp) were cut in half, would AFp go up or down? Re-run the formula with p_exp / 2 to confirm.

Model answerAFp would decrease. The formula AFp = p_exp(RR−1) / [1 + p_exp(RR−1)] depends multiplicatively on prevalence — halving p_exp roughly halves the numerator while only slightly lowering the denominator, so AFp drops from 0.50 to about 0.33. This is the public-health pivot: an intervention's population impact depends as much on who is exposed as on how strong the per-person effect is.
Saved.

Key Takeaways

  • A cause in epidemiology is any factor that changes disease severity or frequency.
  • The component-cause model shows how different groupings of factors form sufficient causes, and why no single factor need be necessary for a disease.
  • The strength of association can vary between populations even when the underlying causal mechanism is unchanged, due to differences in the prevalence of causal complements.
  • The causal-web model distinguishes direct and indirect causes and guides study design and disease control strategies.
  • The population attributable fraction can exceed 100% because components are shared across multiple sufficient causes.
Knowledge Check — Section 3

1. In the component-cause model, a "sufficient cause" is best described as:

A sufficient cause is formed by a specific combination of component causes that together guarantee disease occurrence.

2. Why can the strength of association between an exposure and disease change between populations?

Even when the causal relationship is unchanged, the distribution of other factors that complete sufficient causes can alter the observed strength of association.

3. An indirect cause of disease is one that:

Indirect causes are mediated through intervening variables. They can be very important for disease control — many large-scale control efforts target indirect rather than direct causes.

4. Why can the population attributable fractions for all risk factors of a disease sum to more than 100%?

Since components can appear in multiple sufficient causes, attributing disease to each cause separately leads to sums greater than 100%.

✦ Pass the knowledge check with 100% to continue

Section 4

The Counterfactual Concept

⏱ Estimated reading time: 12 minutes

Introduction and Overview

Section 3 walked through three classical models of causation: necessary/sufficient causes, the causal web, and the population attributable fraction. Those models are about the structure of causation. This final section is about the modern logic of causal inference — specifically, the potential-outcomes framework, also called the counterfactual model. By the end of the section you'll see how every analytic technique you'll meet later in HSCI 341 and HSCI 410 (propensity scores, regression, difference-in-differences, mediation) is a different way of approximating the same impossible quantity: what would have happened if the same person had not been exposed.

Learning Objectives

  • Define the potential-outcomes (counterfactual) model for causal inference.
  • Explain the fundamental problem of causal inference — why individual treatment effects cannot be directly observed.
  • Describe how the average treatment effect (ATE) provides a tractable, group-level alternative.
  • Describe how randomized experiments approximate the counterfactual ideal.
  • Understand the concept of confounding and exchangeability.
  • Recognize how counterfactual logic motivates the major analytic tools used in modern epidemiology — propensity score matching, regression, difference-in-differences, and mediation analysis.

What Is the Counterfactual?

The potential outcomes framework — also called the counterfactual model, and sometimes the Neyman-Rubin causal model after the statisticians who formalized it (Rubin, 1974) — is currently the most widely accepted conceptual basis for causal inference in epidemiology and across the modern health and social sciences. At its core, it asks a deceptively simple question: What would have happened to this same person if they had not been exposed?

For any individual i, the framework imagines two potential outcomes:

  • Yi(1) — the outcome we would observe if person i were exposed (or treated)
  • Yi(0) — the outcome we would observe if that same person were unexposed (or untreated)

The individual treatment effect is the difference between these two potential outcomes: Yi(1) − Yi(0). This is the quantity we would most like to know — but, as the next section makes clear, we can never observe both potential outcomes for the same person.

The Thought Experiment

Imagine you want to know if a vaccine protects against a disease. You observe a vaccinated person who develops the disease. If you could rewind time and observe the same person in the same period without vaccination, and they did NOT develop the disease, you would conclude the vaccine actually caused the disease in that individual. Conversely, if they still got the disease without the vaccine, the vaccine was not the cause.

This counterfactual individual does not exist — you can never observe the same person under two different exposure levels simultaneously. But this is the ideal that our research methods try to approximate.

The Fundamental Problem of Causal Inference

Because each person is either exposed or unexposed — never both simultaneously — we can only ever observe one of their two potential outcomes. The other one is missing. Holland (1986) called this the fundamental problem of causal inference: individual causal effects are unobservable. No clever measurement, no improved technology, and no more careful study design can fully solve this problem at the level of a single person (Hernán, 2004).

The standard response is to give up on the individual treatment effect and instead estimate something we can learn from data: the average treatment effect (ATE) in a population or sample.

The Average Treatment Effect (ATE)

Rather than asking “what is the effect for this person?” we ask “what is the average effect across a group of similar people?” Formally:

ATE = E[Y(1)] − E[Y(0)]

That is, the average outcome if everyone were exposed, minus the average outcome if no one were exposed. Equivalently, in epidemiologic notation:

  • p(DE+) — the potential frequency of disease if all population members were exposed
  • p(DE-) — the potential frequency of disease if none were exposed

If these two quantities differ, we infer a causal effect in the population — even though we cannot pinpoint which specific individuals were affected.

Why This Matters

The shift from individual effects to group-level average effects is the conceptual move that powers most modern causal research. Randomized trials, observational studies, policy evaluations, and program evaluations all ultimately rest on the same logic: build two groups (or build a counterfactual comparison) that we can plausibly treat as exchangeable, and compare their average outcomes. Almost every major analytic tool covered later in this course is, at heart, a different strategy for doing exactly this.

The Role of Randomization

In a perfect experiment, we would randomly assign subjects to exposed and unexposed groups. Randomization creates exchangeability: the condition where the disease frequency in each group would not change if the groups' exposure status were switched. This means any difference in outcomes can be attributed to the exposure itself.

Why Randomization Works

When groups are exchangeable, comparing p(D|E+) and p(D|E-) gives us the closest possible estimate of the true counterfactual effect. However, in real trials, data come from two different subsets of subjects, so the estimate is approximate. The assumption is that random assignment balances all known and unknown confounders between groups.

Confounding: A Threat to Causal Inference

A confounder is a variable that is associated with both the exposure and the outcome and can distort the observed association between them (Greenland & Brumback, 2002). Consider a study of vaccination (E) and disease (D) where a third variable — say a pre-existing health condition (C) — independently predicts both who gets vaccinated and who gets the disease.

Confounding in Action

In Table 1.3 from the text, 20 subjects are studied. Looking at the raw data, p(D|E+) = 7/13 = 0.54 and p(D|E-) = 3/7 = 0.43, suggesting the exposure might increase disease risk. But when we stratify by the confounder C:

Among C+ subjects: p(D|E+) = 6/9 = 0.67 and p(D|E-) = 2/3 = 0.67
Among C- subjects: p(D|E+) = 1/4 = 0.25 and p(D|E-) = 1/4 = 0.25

Within each stratum, the exposure has NO effect on disease! The apparent association was entirely due to confounding by C. This is why controlling for confounders is essential in epidemiologic analysis.

Observational Studies and the Counterfactual

In observational studies, we cannot randomize. This means groups may not be exchangeable, and confounding is a major concern. Epidemiologists use several strategies to address this: restriction (limiting the study to one level of the confounder), matching, stratification, and multivariable statistical models. All of these aim to simulate the exchangeability that randomization would provide.

From Counterfactual Logic to the Modern Toolkit

Once you accept that causal inference is fundamentally about constructing a credible counterfactual comparison — an estimate of what would have happened in the absence of exposure — many of the methods you will encounter later in this course (and across modern epidemiology, biostatistics, health economics, and policy evaluation) start to look like variations on a single theme. Each is a different strategy for approximating the missing potential outcome (Vandenbroucke, Broadbent, & Pearce, 2016).

Below is a brief preview of four widely used tools. Each will be explored in much greater depth later in this series — here we are only flagging how each one is anchored in counterfactual logic.

Propensity Score Matching

A propensity score is the estimated probability of being exposed given a set of measured characteristics. Propensity score matching pairs each exposed person with one or more unexposed people who had a similar probability of being exposed. The matched unexposed group then serves as the counterfactual stand-in for the exposed group — mimicking what randomization would have done by balancing observed confounders. It is one of the most direct attempts to manufacture exchangeability from observational data.

Regression (Regression Adjustment)

Regression models — linear, logistic, Poisson, Cox, and others — estimate the average difference in outcome associated with exposure while statistically holding other variables constant. Conceptually, regression asks: among people who look the same on measured confounders, what is the average outcome difference between the exposed and the unexposed? When the model is correctly specified and confounders are adequately measured, the regression coefficient on the exposure can be interpreted as an estimate of the average treatment effect. Regression is the workhorse of HSCI 410 and underlies many of the more advanced techniques.

Difference-in-Differences (DiD)

Difference-in-differences is a quasi-experimental design used when an exposure (often a policy or program) is rolled out to one group but not another. Rather than comparing exposed and unexposed groups directly, it compares the change over time in the exposed group to the change over time in the unexposed group. The unexposed group's change serves as the counterfactual for what would have happened in the exposed group absent the intervention. This subtracts out stable group differences and shared time trends, isolating the effect of the exposure under the assumption of parallel trends.

Mediation Analysis

Mediation analysis decomposes the total effect of an exposure on an outcome into a direct effect and one or more indirect effects operating through intermediate variables (mediators). In counterfactual terms, it asks: what would the outcome be if exposure were changed but the mediator were held at its unexposed value, versus if both were changed? This allows researchers to estimate not just whether an exposure matters, but through what pathways — an essential step for designing targeted interventions.

The Common Thread

Each of these tools — propensity score matching, regression, difference-in-differences, and mediation — is a different answer to the same question: How do we build a credible counterfactual when we cannot randomize? Keep this lens in mind throughout the rest of the course. The methods will look very different on the surface, but the underlying logic — comparing observed outcomes to a thoughtfully constructed estimate of what would have happened otherwise — remains the same.

Reflection

Think of a research question in your area of interest. What would the ideal counterfactual comparison look like? What confounders might distort the observed association, and how might you control for them?

Model answerPick a concrete question (e.g., does ultra-processed food intake increase risk of type-2 diabetes in adults aged 25–45?). The ideal counterfactual is the same individuals followed under two scenarios: actual diet and a forced-substitution diet matched on calories but with no ultra-processed items. Because that counterfactual is unobservable, you approximate it with an experimental or quasi-experimental contrast (RCT of dietary substitution, target-trial emulation on cohort data). Confounders: SES (correlates with food access and outcomes), physical activity, smoking, family history of diabetes, and structural factors (food environment, working hours). Control strategies: DAG-guided regression adjustment on the minimal sufficient adjustment set, propensity score weighting with active-comparator framing (initiators of low-NOVA dietary patterns vs. initiators of unchanged habits), instrumental-variable approaches using neighbourhood food-environment instruments, and sensitivity analyses for unmeasured confounding (E-values).

Minimum 20 characters required.

✓ Reflection saved

Key Takeaways

  • The potential-outcomes (counterfactual) model asks: what would have happened to the same individual under a different exposure level?
  • The fundamental problem of causal inference is that individual treatment effects are unobservable — we never see both Y(1) and Y(0) for the same person.
  • To make progress, we estimate the average treatment effect (ATE) at the group level rather than the individual effect. This shift from individual to group is the foundation of modern causal research.
  • Randomized experiments create exchangeability, allowing the ATE to be estimated by comparing average outcomes across groups.
  • Confounding occurs when a third variable distorts the exposure-outcome association; controlling for it is essential for valid causal inference.
  • Modern tools — propensity score matching, regression, difference-in-differences, and mediation analysis — are all strategies for constructing a credible counterfactual when randomization is not possible.
Knowledge Check — Section 4

1. The counterfactual concept asks:

The counterfactual concept is based on comparing what actually happened with what would have happened under an alternative exposure scenario for the same individuals.

2. Exchangeability in a randomized experiment means:

Exchangeability means that the groups are comparable in all ways except exposure status, so any outcome difference can be attributed to the exposure.

3. A confounder is a variable that:

Confounders are independently associated with both the exposure and outcome, and their presence can make a non-causal association appear causal (or hide a real one).

✦ Complete the reflection and pass the knowledge check with 100% to continue

Section 5

Lesson Review & Final Assessment

⏱ Estimated time: 15 minutes

Bringing It All Together

This lesson laid the conceptual foundation for everything that follows in HSCI 341. You moved from a working definition of epidemiology and a brief history of causal thinking into the formal logic of scientific inference, then into the language epidemiologists use to talk about causes — component causes, sufficient causes, causal webs — and finally into the counterfactual framework that underwrites modern causal inference.

The threads pulled together here are deliberately abstract because the rest of the course is not. Sampling, questionnaire design, measures of disease frequency, screening, measures of association, and study design choices all assume you already know why a research question is causal and what a confounder is. As you work through the final assessment, treat the takeaways below as the vocabulary the next eleven lessons will keep using.

Key Takeaways from Lesson 1

  • Epidemiology studies exposure–outcome associations in populations to identify modifiable causes of disease and inform prevention.
  • Causal thinking has evolved from Hippocratic environmental theories through miasma and germ theory to today's multifactorial models combining biological, behavioural, and social drivers.
  • Inductive, deductive, and Bayesian reasoning each play a distinct role in moving from observation to evidence; no single mode is sufficient on its own.
  • The component–cause and causal–web models explain why most disease has many causes, why "strength of association" is population-specific, and why necessary causes are rare.
  • The counterfactual (potential-outcomes) framework is the conceptual gold standard: causal effects are comparisons of what happened with what would have happened under a different exposure.
  • Because individual counterfactuals are unobservable, epidemiologists estimate average treatment effects using randomization, matching, regression, and related tools — all of which depend on controlling confounding.

Final Reflection

Think about a health issue you are interested in studying. Identify a potential exposure-outcome relationship and sketch out what a component-cause model might look like. What would be a direct cause versus an indirect cause? What confounders would you need to consider?

Model answerExample: shift work and metabolic syndrome in adults 25–55. A component-cause (Rothman) model: a sufficient cause ‘A’ might combine shift work + low fibre + sleep debt + genetic predisposition to insulin resistance; a sufficient cause ‘B’ might combine high stress + low physical activity + smoking; both produce the outcome. Direct cause: shift work disrupts circadian glucose handling — the mechanism that connects exposure to outcome with no intermediate variable. Indirect causes: through reduced exercise (shift workers exercise less) or through diet quality (limited healthy food at night). Confounders to consider: age (older workers do different shifts and have different risk), pre-existing conditions, occupation (shift work clusters by sector with its own SES and exposure profile), and selection into shift work (people who tolerate it may have different baseline risk). The component-cause model is useful because it explains why interventions on a single component sometimes have surprisingly large or small effects: removing one component removes that sufficient cause, but other causal pathways may still produce the outcome.

Minimum 20 characters required.

✓ Reflection saved

Final Knowledge Assessment

Complete the following 15-question assessment. A score of 100% is required to complete the lesson. You may retake the assessment as many times as needed.

Final Assessment — 15 Questions

1. The primary goal of epidemiology is to:

Epidemiology seeks to identify causal associations to guide prevention and intervention at the population level.

2. John Snow's investigation of cholera demonstrated that:

Snow identified contaminated water as the transmission route before the cholera bacterium was discovered.

3. Karl Popper's philosophy of refutationism holds that:

Popper argued that hypotheses should be subjected to rigorous attempts at disproof rather than confirmation.

4. Bayesian analysis in epidemiology:

Bayesian analysis formally combines prior knowledge with new evidence to produce updated probability estimates.

5. In epidemiology, a "cause" is defined as:

A cause is broadly defined as any factor that changes the severity or frequency of the outcome.

6. A sufficient cause in the component-cause model is:

Sufficient causes are composed of multiple component causes that, when all present together, guarantee disease occurrence.

7. The strength of association between an exposure and disease can vary between populations because:

Even with an unchanged causal mechanism, different co-factor distributions affect the observed association strength.

8. An indirect cause of disease is one that:

Indirect causes are mediated through intervening variables. Many effective public health interventions target indirect causes.

9. The counterfactual model is based on comparing:

The counterfactual asks: what would have happened to these same individuals if their exposure status were different?

10. Exchangeability in a randomized trial means:

Exchangeability ensures that the groups are comparable in all ways except exposure, so differences in outcomes reflect the exposure effect.

11. A confounder is a variable that:

Confounders independently predict both exposure and outcome, and can distort the association in either direction.

12. Selection bias occurs when:

Selection bias threatens validity when the method of selecting participants makes the study group non-representative of the source population.

13. The population attributable fraction (AFp) can exceed 100% because:

Since the same component can be part of several sufficient causes, attributing disease to each cause independently results in overlap.

14. The prevention paradox refers to the fact that:

Population-level benefits (e.g., reducing disease from 6% to 3%) may go unnoticed by most individuals who would not have gotten the disease anyway.

15. Which statement best reflects the overall message of this lesson?

The lesson emphasizes multifactorial causation, the importance of causal models, and the counterfactual framework for making valid causal inferences from both experimental and observational studies.

✦ Complete the final reflection above before submitting