Introduction &
Causal Concepts

Fundamental Epidemiological Concepts and Approaches

Learning objectives for this lesson:

Trace the history of causal thinking in epidemiology
Understand component-cause and causal-web models
Describe the potential-outcomes (counterfactual) framework for estimating causal effects
Explain why individual effects are unobservable and how average treatment effects fill the gap
Recognize how counterfactual logic underpins propensity score matching, regression, difference-in-differences, and mediation analysis
Explain how observational studies and experiments seek causal evidence
Distinguish inductive and deductive reasoning in science
Identify the key components of epidemiologic research
Read and build a directed acyclic graph (DAG), and recognise the role of chains, forks, and colliders
Distinguish DAG-based causal reasoning from quantitative mediation analysis (Baron & Kenny)
Apply causal criteria to evaluate associations

This course was developed by Dr. Kiffer G. Card, Faculty of Health Sciences, Simon Fraser University based on Dohoo, I. R., Martin, S. W., & Stryhn, H. (2012). Methods in Epidemiologic Research. VER Inc.

Reference

Glossary: Key Terms, People & Concepts

📚 Reference page, available throughout the lesson

This glossary collects the key concepts, people, and ideas you will meet in this lesson. Use it as a reference while you work through the material, or as a review before assessments. Type in the search box to filter entries.

Key Concepts & Ideas

Cause An event, condition, or characteristic that precedes a disease event and without which the disease event either would not have occurred at all or would not have occurred until some later time.

Effect The change in an outcome that is attributable to a cause, typically defined as a contrast between potential outcomes under different exposure conditions.

Counterfactual A “what would have happened” outcome: the outcome the same individual would have experienced under a different (unobserved) exposure condition. Forms the conceptual basis for causal inference.

Potential Outcomes Framework Formal causal inference framework (Neyman-Rubin) that defines the causal effect for an individual as the difference between the outcome they would have under treatment and under control. Only one of the two is ever observed, the “fundamental problem of causal inference.”

Average Treatment Effect (ATE) The expected difference in potential outcomes across the population: E[Y(1) − Y(0)]. Identifiable from data when exchangeability, positivity, and consistency hold.

Exchangeability The condition that exposed and unexposed groups would have had the same risk of the outcome had they swapped exposures (i.e., no unmeasured confounding). Achieved by design via randomization or by adjustment in observational studies.

Sufficient Cause A complete causal mechanism: a minimal set of conditions and events that inevitably produces the disease. Rothman’s “causal pies” depict each sufficient cause as a pie made of component causes.

Component Cause Any one of the factors that, together with others, makes up a sufficient cause. Each “slice” in a Rothman pie is a component cause.

Necessary Cause A component that must be present in every sufficient cause for the disease to occur. Example: Mycobacterium tuberculosis is necessary for tuberculosis.

Causal Pie Model Rothman’s schematic showing each sufficient cause as a pie composed of component causes. Disease can have multiple sufficient causes (multiple pies), and removing any single component breaks that pie.

Web of Causation MacMahon and Pugh’s metaphor for the complex network of interconnected factors (biological, behavioural, environmental, and social) that contribute to disease.

Association vs. Causation Association is a statistical relationship between two variables; causation is a relationship in which changing one would change the other. All causal relationships imply association, but not all associations are causal; chance, bias, and confounding can produce non-causal associations.

Confounding Distortion of an exposure-outcome association by a third variable that is associated with the exposure and independently affects the outcome, without being on the causal pathway.

Directed Acyclic Graph (DAG) A graphical tool that encodes assumed causal relationships among variables using directed arrows, with no directed cycles. Used to identify confounders, mediators, and colliders for proper adjustment.

Collider A variable on which two or more arrows in a DAG converge. Adjusting for a collider opens a non-causal path and can introduce selection bias (collider-stratification bias).

Mediator A variable that lies on the causal pathway between exposure and outcome. Adjusting for a mediator removes part of the causal effect being estimated.

Induction vs. Deduction Induction reasons from specific observations to general principles; deduction reasons from general principles to specific predictions. Modern epidemiology blends both: hypotheses are deduced from theory and tested against inductively gathered data.

Bradford Hill Considerations Nine viewpoints (strength, consistency, specificity, temporality, biological gradient, plausibility, coherence, experiment, analogy) Hill proposed for judging whether an observed association is likely causal. Not a checklist; only temporality is a strict requirement.

Methods & Approaches

Propensity Score Matching A method that matches exposed and unexposed units on their estimated probability of being exposed (given covariates) to mimic the balance achieved by randomization.

Difference-in-Differences A quasi-experimental method that compares the change in outcome over time in a treated group to the change in an untreated control group, removing time-invariant confounding.

Mediation Analysis A statistical approach that decomposes a total effect into direct and indirect (mediated) components. Classical Baron & Kenny approach has been refined by counterfactual-based methods.

Key People

Sir Austin Bradford Hill (1897–1991) British epidemiologist and statistician who articulated the nine considerations for causal inference (1965) and co-led the British Doctors Study linking smoking to lung cancer.

Kenneth Rothman American epidemiologist who introduced the sufficient-component-cause (“causal pie”) model and authored foundational texts on modern epidemiology.

Judea Pearl Computer scientist whose work formalized causal inference using directed acyclic graphs and the do-calculus, bridging statistics, AI, and epidemiology.

Miguel Hernán Epidemiologist at Harvard whose work (with Robins) developed and popularized the potential-outcomes framework, target trial emulation, and modern causal-inference methods.

Sander Greenland Epidemiologist and biostatistician whose work clarified confounding, effect modification, and the use of DAGs in observational research.

Mervyn Susser (1921–2014) South African-American epidemiologist who advanced concepts of causal thinking in epidemiology and championed eco-epidemiology and social determinants frameworks.

Donald Rubin Statistician who developed the formal potential-outcomes (Rubin Causal Model) framework and propensity score methods.

No matching entries. Try a different search term.

Section 2

Scientific Inference & Key Research Components

⏱ Estimated reading time: 12 minutes

Section 2 of 4

Scientific Inference & Key Research Components

Inductive and deductive reasoning, the structure of an epidemiologic study, directed acyclic graphs, and mediation.

Two modes of reasoning

Induction and deduction

Inductive (Bacon, Jenner)

From specific observations to general conclusions. Useful for discovery, but as Hume noted, repeated association does not prove causation.

Deductive (Popper)

From general hypothesis to specific test. Science progresses by attempting to disprove the null, not by accumulating confirmations.

Bayesian thinking

Prior knowledge and the weight of new evidence

Bayesian analysis formalizes how prior knowledge is updated by new observations. No single study is read in isolation.

All inference is based on the validity of our premises and no inference can be known with certainty.Thomas Bayes, 1764

Study structure

From source population to causal inference

DAGs

Directed acyclic graphs: formalizing your assumptions

A DAG encodes what causes what before you touch the data. Arrows represent direct causal effects; their absence asserts none.

Chain: X → M → Y (mediation)
Fork: X ← C → Y (confounding)
Collider: X → K ← Y (conditioning opens a spurious path)

Mediation

Putting numbers on DAG pathways

Total effect decomposition

\[ \color{#0B7B6B}{\text{Total Effect}} = \underbrace{\color{#C2410C}{c'}}_\text{Direct} + \underbrace{\color{#6D28D9}{a \times b}}_\text{Indirect} \]

Total Effect effect of X on Yc' direct effect (not through M)a×b indirect effect through mediator M

Baron & Kenny (1986): three regressions, four conditions. Modern counterfactual mediation (Imai, Pearl, VanderWeele) handles binary outcomes and exposure-mediator interaction without requiring linearity.

Carry forward

What to take into the next section

Induction + deduction + Bayesian updating together move from observation to evidence.
A study's validity depends on how well every step of the research diagram is executed.
A DAG makes causal assumptions explicit, criticizable, and queryable before data analysis begins.

Introduction and Overview

An earlier section traced the discipline's history. This section moves from history to logic: how do epidemiologists actually reason from observed data to claims about cause and effect? The two forms of reasoning we cover here, induction and deduction, are the philosophical scaffolding underneath every study you'll design later in this course. We'll then map those modes of reasoning onto the concrete components of an epidemiologic study (Figure 1.1) and finish with directed acyclic graphs (DAGs), the modern formal tool for encoding causal assumptions before any data are touched.

Learning Objectives

Distinguish between inductive and deductive reasoning.
Explain the role of Bayesian thinking and scientific consensus in epidemiology.
Identify the key components of epidemiologic research design.

Why Scientific Inference Matters

Epidemiology relies primarily on observational studies because many health-related problems cannot be studied under controlled laboratory conditions. Ethical concerns, practical limitations, and the complexity of real-world relationships all demand that we study humans in their natural environments. Drawing valid inferences from these studies requires both inductive and deductive reasoning.

Two Forms of Reasoning

The three tabs below define induction and deduction and add Bayesian thinking, which formalises how prior knowledge enters into our interpretation of any new study. Click through each tab and watch for the unifying point: no single mode of reasoning gives you certainty; epidemiologic claims always rest on a combination of observation, hypothesis testing, and consensus.

Inductive Reasoning

Inductive reasoning involves making generalized inferences about causation based on repeated observations. You observe specific instances and draw broader conclusions.

Francis Bacon (1620) first presented inductive reasoning as a method of making generalizations from careful observations. Classic examples include Edward Jenner's observation that milkmaids who developed cowpox didn't get smallpox, which led to the development of the smallpox vaccine. John Stuart Mill's canons (1843) formalized rules for inductive inference and helped shape our concepts of necessary and sufficient causes.

However, as David Hume noted, "there is no logical force to inductive reasoning"; we cannot perceive a causal connection, only a series of events. Repeated observations may be consistent with causation but do not prove it.

Deductive Reasoning

Deductive reasoning involves inferring that a general "law of nature" exists and testing specific hypotheses against observations to prove or refute them. This approach is closely linked to refutationism, attributed to Karl Popper.

Popper argued that scientists should collect data to attempt to disprove a hypothesis rather than to prove it. Only by disproving hypotheses can we make scientific progress. This is why statistical analyses typically form hypotheses in the null (no association) and then attempt to refute them.

The key benefit: it helps narrow the scope of studies. We carefully review what is known and formulate a few specific, testable hypotheses rather than casting a wide net with hundreds of variables.

Bayesian Thinking & Scientific Consensus

Thomas Bayes (1764) noted that all inference is based on the validity of our premises and that no inference can be known with certainty. The information we have before making observations influences our interpretation of those observations. This gave rise to Bayesian analysis, which formally incorporates prior knowledge and updates it with new data.

Thomas Kuhn reminded us that although a single observation can disprove a hypothesis, the observation might be anomalous. Scientific communities therefore rely on consensus, and on paradigm shifts, when weighing the usefulness of theories, even if they cannot prove absolute truth.

Inductive, deductive, and Bayesian reasoning are abstract. The next subsection makes them concrete by walking through the components of an actual study and showing where each form of reasoning enters.

Key Components of Epidemiologic Research

The overall structure of an epidemiologic study involves several interrelated components, each of which must be carefully managed to produce valid results. Read Figure 1.1 below as a roadmap for the rest of this course: every box in the diagram corresponds to a topic we'll cover in detail later (sampling in a later lesson, exposure measurement and questionnaires in a later lesson, confounding throughout the course, and so on).

Figure 1.1. Key components of epidemiologic research. Research starts from a source population, samples a study group, measures exposures and outcomes, accounts for extraneous variables (confounders and biases), and ultimately draws causal inferences.

The Central Goal

The rationale for epidemiologic research is to identify potential causal associations between exposures and outcomes. In many instances the exposures are potential risk factors and the outcome is a disease of interest. Ultimately, we aim to make causal inferences about these relationships in the source population as a basis for developing policy and prevention programs.

Directed Acyclic Graphs (DAGs)

The diagram in Figure 1.1 is informal: it shows boxes and arrows but does not commit to a precise causal meaning. A directed acyclic graph (DAG) is the formal version of that picture (Pearl, 1995; Greenland & Brumback, 2002). It is the working tool that modern epidemiologists use to write down their assumptions about how the world works before they touch the data, and to read off, mechanically, what those assumptions imply for analysis.

What is a DAG?

A DAG is a picture made of nodes (variables) connected by directed arrows (causal effects), with no cycles, so no variable can cause itself, even by going around the long way. Each arrow is a claim: “A directly affects B, controlling for the rest of the graph.” The absence of an arrow is just as much of a claim: it asserts no direct causal effect.

The Building Blocks

Every DAG, no matter how large, is built from a small handful of structural pieces. Learning to recognise these by sight is the entire skill:

Chain

M is a mediator

Don't adjust for M

click to learn more

Fork

C is a confounder

Adjust for C

click to learn more

Collider

Z is a collider

Never condition on Z

click to learn more

How to Build One

Drawing a DAG is a substantive exercise, not a statistical one. The arrows come from your subject-matter knowledge of the system, not from p-values. A workable recipe:

Name the exposure (X) and outcome (Y). Put them on the page, with the exposure on the left.
List every other variable that could plausibly affect either X or Y, whether demographic, biological, social, or environmental. Include unmeasured variables (draw them with dashed nodes); they belong in the diagram even if you cannot put numbers on them.
Draw an arrow from each variable to every variable it directly causes. Be strict about “direct”: if A → B only by going through C, you draw A → C and C → B, not A → B.
Check for cycles. If A causes B and B causes A, you need to add a time index or break the loop with intermediate variables. DAGs forbid feedback loops.
Read off the implications. Every “back-door” path from X to Y that does not pass through a collider must be blocked by adjustment. Mediators must be left alone if you want the total effect. Colliders must be left alone, full stop.

What DAGs Are For

A DAG plays four roles in an analysis. The first three are decided before you fit anything; the fourth is what makes it worth the trouble.

It identifies confounders. Anything on a back-door path is a candidate for adjustment. Anything not on a back-door path is not, even if it is statistically “significant.”
It flags variables you must not adjust for. Mediators (over-adjustment) and colliders (selection bias) ruin estimates if you control for them. The DAG tells you which is which.
It makes assumptions criticisable. A reviewer can disagree with an arrow. They can’t disagree with a regression equation in the same way.
It supports a transparent estimand. “The total effect of X on Y, adjusting for the back-door set {C₁, C₂}” is a precise target you can defend.

Mediation Analysis

A DAG tells you that a variable lies on the pathway from exposure to outcome. Mediation analysis is the next step: it puts numbers on how much of the exposure’s effect runs through the mediator versus around it.

The Question Mediation Asks

Given a chain X → M → Y, with X potentially also affecting Y directly (X → Y), how much of the total effect of X on Y is the indirect effect (through M) and how much is the direct effect (the part of X → Y that does not pass through M)? Total = Direct + Indirect.

The Classical Baron & Kenny (1986) Approach

Baron and Kenny’s causal-steps procedure is the recipe most students meet first. It runs three regressions and checks four conditions:

Step 1. Total effect. Regress Y on X. The slope (call it c) must be significant. This is the total X → Y effect.
Step 2. X predicts M. Regress M on X. The slope (a) must be significant. If X does not move M, M cannot be a mediator.
Step 3. M predicts Y, controlling for X. Regress Y on both X and M. The coefficient on M (call it b) must be significant.
Step 4. Compare c to c′. The X coefficient in Step 3 (c′) is the direct effect. If c′ is much smaller than c, the gap (c − c′, equivalently a×b) is the indirect effect through M. If c′ is essentially zero, the mediation is “complete”; otherwise it is “partial.”

Beyond Baron & Kenny

Baron & Kenny is intuitive but limited. It assumes linear models, no exposure-mediator interaction, and no unmeasured confounding of the M–Y relationship. Modern alternatives that you should be aware of:

Bootstrap confidence intervals for a×b (Preacher & Hayes), which replaces the unreliable Sobel z-test.
Counterfactual / causal mediation (Imai, Pearl, VanderWeele), defines the “natural direct effect” and “natural indirect effect” without requiring linearity, handles binary outcomes and exposure-mediator interactions, and is implemented in the mediation and CMAverse R packages.

DAGs vs. Mediation Analysis: Related, Not the Same

It is easy to conflate the two because both involve arrows and pathways, but they answer different questions:

A DAG is a qualitative tool. It encodes which variables cause which others and lets you read off, structurally, what should be adjusted for. It does not fit a model or estimate an effect.
Mediation analysis is a quantitative procedure. It estimates the size of a direct and indirect effect, given a model that has already been specified.
A DAG tells you whether mediation analysis is appropriate (is M really on the pathway? are there back-door paths between M and Y that need adjustment?), and which variables to put in the regressions. Mediation analysis tells you how much of the effect runs through the mediator. Running a Baron-Kenny without the DAG-level thinking can give precise numbers for a misspecified pathway; drawing a DAG without follow-up estimation tells you the structure but not the magnitude.
Put differently: every credible mediation analysis sits on top of a DAG. Not every DAG implies a mediation analysis.

With this scaffolding in place, the R exercise below puts the DAG side of the story into code. We will return to mediation explicitly in a later course, where you will fit the same kind of model in R.

R Encode a causal DAG in R with the dagitty package

What you'll do: write a tiny smoking → CHD DAG in R using the dagitty package, then ask it (a) which variables you must adjust for and (b) what every causal/back-door path between exposure and outcome looks like. What to take away: the DAG is no longer a sketch on paper; it is a queryable object, and identifying confounders becomes a function call. We'll use this same toolkit throughout 341 (whenever you design a study) and 410 (when you fit the regression).

Modern causal epidemiology turns the diagram above into a formal object you can query. The dagitty package in R lets you draw a DAG, then ask it which variables you must adjust for to estimate a causal effect, without trial and error.

# One-time install (skip if you have done it before):
# install.packages(c("dagitty", "ggdag"))

library(dagitty)
library(ggdag)

# Smoking -> CHD, with age as a confounder of both.
g <- dagitty("dag {
  smoking -> chd
  age -> smoking
  age -> chd
  smoking [exposure]
  chd     [outcome]
}")

# Which variables do we need to adjust for, and which paths are open?
adjustmentSets(g)        # minimal sufficient adjustment set(s)
paths(g, "smoking", "chd")   # list every path between exposure and outcome

# Tidy plot
ggdag(g, layout = "circle") + theme_dag()

Console output

{ age } # adjustment set: condition on age $paths [1] "smoking -> chd" # the (open) causal path [2] "smoking <- age -> chd" # the (open) backdoor path -- block it

Why this matters. A DAG turns "I think age is a confounder" into a formal claim you can verify with code. adjustmentSets() tells you the minimum set of variables to control for; paths() lists every connection. We will use this same toolkit throughout 341 and 410.

R Reflect on what you just ran

Use the questions below to interpret the output you produced. Look at your console / plot before answering.

1. What variable(s) did adjustmentSets(g) tell you to control for, and why does that make sense given the smoking → CHD diagram you encoded?

Model answeradjustmentSets(g) returns {age}, the single confounder needed to identify the causal effect of smoking on CHD. This is the right answer because in the DAG you encoded, age has arrows into both smoking (older adults more likely to have started smoking decades ago) and CHD (independent risk factor). It satisfies all three Rothman/Greenland conditions and sits on the only back-door path from smoking to CHD. Once age is conditioned on, the only open path from smoking to CHD is the direct causal arrow.

2. paths(g, "smoking", "chd") returned two paths. Which one is the direct causal path and which one is a back-door path? How can you tell from the arrow directions?

Model answerThe direct causal path is smoking → chd: a single forward-pointing arrow, beginning at the exposure and ending at the outcome with no intermediate back-tracks. The back-door path is smoking ← age → chd: one arrow points against the flow from smoking, indicating that age is a common cause, not an intermediate step. The tell-tale signature is the direction of the first arrow leaving the exposure node: if it points into smoking, the path is back-door and must be blocked.

3. If you removed the age -> smoking arrow from the DAG, what would adjustmentSets(g) return next time, and what would that imply about the need to adjust for age?

Model answerRemoving the age → smoking arrow disconnects age from the exposure side of the DAG, so age is no longer a confounder; it remains a cause of CHD but not of smoking, making it a pure prognostic factor (a cause of the outcome only) rather than a back-door variable. adjustmentSets(g) would return an empty set { } meaning no adjustment is necessary for identification. Note that adjusting for age might still be useful for precision (it explains variation in CHD), but it is no longer required for causal identification. This is the structural distinction between confounders and prognostic factors that the DAG framework makes explicit.

Saved.

Key Takeaways

Inductive reasoning generalizes from observations; deductive reasoning tests specific hypotheses.
Bayesian thinking incorporates prior knowledge into the interpretation of new evidence.
Epidemiologic research involves defining a source population, sampling a study group, measuring exposures and outcomes, controlling for bias and confounding, and making causal inferences.

✦ Pass the knowledge check with 100% to continue

Section 3

Seeking Causes & Models of Causation

⏱ Estimated reading time: 15 minutes

Section 3 of 4

Seeking Causes & Models of Causation

Component-cause and causal-web models, the Bradford Hill viewpoints, and the population attributable fraction.

What is a cause?

Any factor that changes disease frequency or severity

Causes operate at different levels: biological (organism, gene, toxin), behavioural, and social or environmental. Not all need to be manipulable, but modifiable causes are most actionable for prevention.

Bradford Hill viewpoints (1965)

Strength · Consistency · Specificity · Temporality (the only hard rule) · Gradient · Plausibility · Coherence · Experiment · Analogy

Component-cause model

Rothman (1976): necessary, sufficient, component

Necessary cause

Must be present for disease to occur. M. tuberculosis for TB. Rare in chronic disease.

Sufficient cause

A complete set of conditions that invariably produce disease. Almost never a single factor alone.

Component cause

One factor within a sufficient cause. May appear in multiple sufficient-cause combinations.

Causal complements

Why strength of association is population-specific

Four risk factors for childhood respiratory disease form different sufficient-cause combinations:

STREPRSVStressorsMP

When RSV prevalence rises from 30% to 70% in the population, the risk ratio for STREP drops from 4.83 to 2.93, even though the causal mechanism is unchanged.

Risk Ratio shifts with co-factor prevalence

\[ \color{#0B7B6B}{\text{RR}_{\text{STREP}}} = f\!\left(\color{#C2410C}{\Pr[\text{RSV}]}\right) \]

RR_STREP risk ratio for the STREP exposurePr[RSV] prevalence of the co-factor RSV

Same biology, different epidemiology.

Causal-web model

Direct and indirect causes in a web

Population attributable fraction

Levin's formula (1953)

Population Attributable Fraction

\[ \color{#0B7B6B}{AF_p} = \frac{\color{#C2410C}{p_e}(\color{#6D28D9}{RR} - 1)}{\color{#C2410C}{p_e}(\color{#6D28D9}{RR}-1) + 1} \]

AF_p population attributable fractionp_e prevalence of the exposureRR risk ratio for the exposure

Where p_e is the prevalence of exposure and RR is the risk ratio. The AF_p for all risk factors of a disease can sum to more than 100% because each component cause participates in multiple sufficient causes.

Carry forward

What to take into the next section

Most disease has many component causes; necessary causes are rare.
Causal complements explain why the same exposure appears stronger or weaker across populations.
The population attributable fraction bridges individual-level risk ratios and population-level burden.
Managing this complexity requires the formal counterfactual logic of a later section.

Introduction and Overview

An earlier section gave us the inferential machinery and the DAG vocabulary. This section pushes deeper into what we actually mean when we draw an arrow on a DAG and call it a “causal effect.” Three classical models structure the discussion: the component-cause model (necessary, sufficient, component causes), causal complements (which explain why the same cause has different observed effects in different populations), and the causal-web model. The section closes with the population attributable fraction, a quantity that turns the abstract notion of cause into a number a public-health planner can use.

Learning Objectives

Define what constitutes a "cause" in epidemiology.
Explain the component-cause model including necessary, sufficient, and component causes.
Describe how causal complements affect the strength of association.
Understand the causal-web model and distinguish direct from indirect causes.

What Is a "Cause"?

For practical purposes in epidemiology, a cause is any factor that produces a change in the severity or frequency of an outcome. Some causes operate at the biological level within individuals (such as a specific microorganism), while others operate at the group or population level (such as lifestyle, nutrition, or weather).

▸ INTERACTIVE STORY: HILL'S CAUSATION GAME SHOW
Open full screen ↗

Flip through Bradford Hill's 9 viewpoints one card at a time. Next ▶ reveals each criterion.

An 11-scene game-show walkthrough of the Bradford Hill (1965) viewpoints: strength, consistency, specificity, temporality (the only hard rule), gradient, plausibility, coherence, experiment, and analogy, framed as a structured judgment rather than a checklist.

Epidemiology deals with groups of individuals because the methods for determining causality require it. Researchers take a holistic approach, striving to study and measure every suspected causal factor for the outcome of interest, while recognizing that not every factor can be captured in a single study.

Pragmatic Focus

Epidemiologists prefer to identify causal factors that can be manipulated to prevent disease. But some non-manipulable factors (like genetic predisposition) may also be important for understanding disease patterns in populations.

The Component-Cause Model

This foundational model, developed by Rothman (1976) and elaborated by Rothman & Greenland (2005), is based on the concepts of necessary and sufficient causes. The accordion below defines all three (necessary, sufficient, and component) in turn. Click each one open in order; the definitions stack on top of one another and only make sense in sequence.

Necessary Cause ▼

A necessary cause is one without which the disease cannot occur. The factor will always be present if the disease occurs. For example, Mycobacterium tuberculosis is a necessary cause of tuberculosis, since you cannot develop TB without the bacterium being present.

Sufficient Cause ▼

A sufficient cause is a set of conditions that, when present, will invariably produce the disease. In practice, very few single exposures are sufficient on their own. Instead, different groupings of factors combine to form sufficient causes.

Component Cause ▼

A component cause is one of a number of factors that, in combination, constitutes a sufficient cause. The factors might be present at the same time or follow one another in a temporal chain. When there are a number of causal chains with one or more factors in common, we can conceptualize the web of causal chains as a causal web.

Example: Childhood Respiratory Disease (CRD)

Consider four risk factors for CRD: the bacterium Streptococcus pneumoniae (STREP), a virus (RSV), environmental stressors like cold weather, and other bacteria like Mycoplasma pneumoniae (MP). Different two-factor combinations of these can form sufficient causes:

Component Causes	Sufficient Cause I	Sufficient Cause II	Sufficient Cause III	Sufficient Cause IV
STREP	+	+
RSV	+		+
Stressors		+	+	+
Other organism (MP)				+

Key Points from this Model

No single factor is a necessary cause of CRD (none appears in every sufficient cause). STREP is a component of 2 of the 4 sufficient causes. A child exposed to any complete combination will develop CRD. And critically, because the causal complements (the other factors in a sufficient cause) can vary in prevalence, the observed strength of association between an exposure like STREP and CRD can change even though the underlying causal mechanism has not changed.

Causal Complements and Strength of Association

A critical insight from the component-cause model is that the prevalence of causal complements, the other factors needed to complete a sufficient cause, directly affects the strength of association we observe between an exposure and outcome. Even when the causal mechanism stays the same, changes in the distribution of co-factors in the population can make the association appear stronger or weaker.

Worked Example: How Co-Factor Prevalence Matters

Imagine STREP requires RSV or Stressors as a co-factor to cause CRD. In Population A, where RSV prevalence is 30%, the risk ratio for STREP is 4.83. In Population B, where RSV prevalence rises to 70%, the risk ratio drops to 2.93, even though the causal relationship between STREP and CRD has not changed at all.

Here is the intuition for why the ratio shrinks. When RSV becomes more common, more children develop CRD through pathways that do not involve STREP at all (for example RSV together with stressors), which raises the disease risk in the STREP-unexposed group. Because the risk ratio divides the risk in the exposed by the risk in the unexposed, a larger unexposed risk pulls the ratio down toward 1, even though STREP itself is doing exactly what it always did.

The difference is due entirely to the change in the frequency of the co-factor RSV. This is why strength of association is not a fixed measure and is considered "population specific."

The component-cause model explains the abstract logic of multicausality. The causal-web model is the related diagrammatic tool for thinking about how those component causes interact with each other in the real world, and where you can intervene.

The Causal-Web Model

An alternative way to visualize how multiple factors combine to cause disease is the causal web, consisting of interconnected direct and indirect causal chains:

Direct (Proximal) Causes

A direct cause has no known intervening variable between it and the disease. Diagrammatically, the exposure is adjacent to the outcome. Examples often include specific microorganisms or toxins. However, in disease control, direct causes are not necessarily more valuable than indirect ones; many large-scale control efforts work by manipulating indirect rather than direct causes.

Indirect Causes

An indirect cause is one whose effects on the outcome are mediated through one or more intervening variables. For example, Stressors (cold weather) may make a child susceptible to STREP, RSV, and MP, so Stressors act as an indirect cause of CRD. Removing stress could reduce CRD even though stress itself is not a direct cause.

Implications of the Causal Web

The causal-web model complements the component-cause model but is not equivalent. It shows that we can control disease by preventing the action of direct causes (e.g., vaccination against RSV) or by removing indirect causes (e.g., reducing environmental stressors). The diagram also reveals gaps in our knowledge: apparent direct connections might actually reflect unmeasured intervening factors.

Proportion of Disease Explained

Using the concepts of necessary and sufficient causes, we can estimate the population attributable fraction (AF_p), the proportion of disease in the population that is attributable to a given exposure. Because component causes can appear in multiple sufficient causes, the AF_p for all factors can sum to more than 100%. This is not an error; it reflects the reality of multicausal disease.

The Prevention Paradox

Even when a factor has a high AF_p (say a vaccine with AF_p = 50%), the benefit at the individual level may appear modest. If disease prevalence was 6%, universal vaccination would reduce it to 3%. While 94% of the vaccinated population would not have gotten the disease anyway, the 3% reduction is still a major population-level achievement. However, half of those who would have gotten sick will still get the disease despite being vaccinated. This creates a paradox: the average person may not perceive the same benefit that population-level data shows.

R Population attributable fraction (AF_p) by hand

What you'll do: compute a 2×2 table-based risk ratio, then plug it into Levin's (1953) formula to get the population attributable fraction. What to take away: AF_p is the bridge between an individual-level effect (the RR) and a population-level statement about how much disease would disappear if the exposure were eliminated. The same calculation will reappear in later lessons when we work through measures of disease frequency and association.

The AF_p answers: what fraction of disease in the population would disappear if the exposure were eliminated? Two equivalent formulas, both easy to compute in R.

# Suppose a 2x2 cross-tabulation from a population-based study:
#                Disease+   Disease-
#  Exposed         180       820
#  Unexposed        60      940

a <- 180; b <- 820          # exposed: a = cases, b = non-cases
c <-  60; d <- 940          # unexposed

risk_e <- a / (a + b)             # risk in exposed
risk_u <- c / (c + d)             # risk in unexposed
RR     <- risk_e / risk_u

p_exp  <- (a + b) / (a + b + c + d)   # prevalence of exposure

# AFp = p_e * (RR - 1) / (1 + p_e * (RR - 1))     (Levin, 1953)
AFp    <- p_exp * (RR - 1) / (1 + p_exp * (RR - 1))
round(c(RR = RR, AFp = AFp), 3)

Console output

RR AFp 3.000 0.500

Reading the result. RR = 3 and 50% of disease in this population is attributable to the exposure. Because component causes appear in multiple sufficient causes, AF_p's for different exposures can sum to more than 100%, which is not a math error but a feature of multicausal reality.

R Reflect on what you just ran

Use the questions below to interpret the output you produced. Look at your console before answering.

1. What did RR equal in your console output, and how do you interpret that number in plain language?

Model answerThe console shows RR = 3.0: exposed individuals had three times the cumulative risk of the outcome compared with unexposed. In plain language, about 18 of every 100 exposed people developed the outcome (180 of 1,000), against about 6 of every 100 unexposed (60 of 1,000), and 18 divided by 6 is 3. A tripling of risk like this is a strong, clinically meaningful association.

2. AFp came out to 0.500. In one sentence, what does an AF_p of 50% say about how much disease in this population is attributable to the exposure?

Model answerAFp = 0.500 means that in this population 50% of cases of the outcome would be eliminated if the exposure were entirely removed. It is the population-level ‘preventable fraction’ that combines the individual relative risk with how common the exposure is: a high-RR but rare exposure can produce a small AFp, and modest-RR but ubiquitous exposure can produce a large AFp.

3. If the prevalence of exposure (p_exp) were cut in half, would AF_p go up or down? Re-run the formula with p_exp / 2 to confirm.

Model answerAFp would decrease. The formula AFp = p_exp(RR−1) / [1 + p_exp(RR−1)] depends multiplicatively on prevalence, so halving p_exp roughly halves the numerator while only slightly lowering the denominator, so AFp drops from 0.50 to about 0.33. This is the public-health pivot: an intervention's population impact depends as much on who is exposed as on how strong the per-person effect is.

Saved.

Key Takeaways

A cause in epidemiology is any factor that changes disease severity or frequency.
The component-cause model shows how different groupings of factors form sufficient causes, and why no single factor need be necessary for a disease.
The strength of association can vary between populations even when the underlying causal mechanism is unchanged, due to differences in the prevalence of causal complements.
The causal-web model distinguishes direct and indirect causes and guides study design and disease control strategies.
The population attributable fraction can exceed 100% because components are shared across multiple sufficient causes.

✦ Pass the knowledge check with 100% to continue

HSCI 341 · Lesson 1

Fundamental Epidemiological Concepts and Approaches

Introduction &Causal Concepts

Learning objectives for this lesson:

Glossary: Key Terms, People & Concepts

What Is Epidemiology?

From Reading Evidence to Producing It

What Is Epidemiology?

Defining epidemiology

What it tracks

What makes it distinct

From Hippocrates to miasma

Hippocratic tradition (c. 400 BCE)

Miasma theory

Koch, Pasteur, and John Snow

From single agents to webs of causation

Bradford Hill (1965) viewpoints

What to take into the next section

Introduction and Overview

Learning Objectives

Defining Epidemiology

Core Insight

A Brief History of Causal Thinking

Key Historical Milestones

Why the History Matters

Key Takeaways

Scientific Inference & Key Research Components

Scientific Inference & Key Research Components

Induction and deduction

Inductive (Bacon, Jenner)

Deductive (Popper)

Prior knowledge and the weight of new evidence

From source population to causal inference

Directed acyclic graphs: formalizing your assumptions

Putting numbers on DAG pathways

What to take into the next section

Introduction and Overview

Learning Objectives

Why Scientific Inference Matters

Two Forms of Reasoning

Inductive Reasoning

Deductive Reasoning

Bayesian Thinking & Scientific Consensus

Key Components of Epidemiologic Research

The Central Goal

Directed Acyclic Graphs (DAGs)

What is a DAG?

The Building Blocks

How to Build One

What DAGs Are For

Mediation Analysis

The Question Mediation Asks

The Classical Baron & Kenny (1986) Approach

Beyond Baron & Kenny

DAGs vs. Mediation Analysis: Related, Not the Same

R Reflect on what you just ran

Key Takeaways

Seeking Causes & Models of Causation

Seeking Causes & Models of Causation

Any factor that changes disease frequency or severity

Bradford Hill viewpoints (1965)

Rothman (1976): necessary, sufficient, component

Necessary cause

Sufficient cause

Component cause

Why strength of association is population-specific

Direct and indirect causes in a web

Levin's formula (1953)

What to take into the next section

Introduction and Overview

Learning Objectives

What Is a "Cause"?

Pragmatic Focus

The Component-Cause Model

Example: Childhood Respiratory Disease (CRD)

Key Points from this Model

Causal Complements and Strength of Association

Worked Example: How Co-Factor Prevalence Matters

The Causal-Web Model

Direct (Proximal) Causes

Introduction &
Causal Concepts