Confounding &
Statistical Inference

Evaluating Epidemiological Research

Learning objectives for this lesson:

Explain how confounding distorts exposure–outcome associations and identify classic examples
Distinguish confounding by indication and time-varying confounding from conventional confounding
Recognize model misspecification and multicollinearity as threats to valid inference
Interpret Simpson’s paradox, ecological fallacy, atomistic fallacy, and the modifiable areal unit problem
Describe how missing data mechanisms (MCAR, MAR, MNAR) influence analytic validity
Critically evaluate whether epidemiological studies have adequately addressed confounding and inferential threats to validity

This course was developed by Dr. Kiffer G. Card, Faculty of Health Sciences, Simon Fraser University.

Reference

Glossary: Key Terms, People & Concepts

📚 Reference page, available throughout the lesson

This glossary collects the key concepts, people, and ideas you will meet in this lesson. Use it as a reference while you work through the material, or as a review before assessments. Type in the search box to filter entries.

Confounding & Causal Structure

Confounder A variable that causes both the exposure and the outcome and is not on the causal pathway between them. Failing to adjust for a confounder distorts estimated effects.

Mediator A variable on the causal pathway between exposure and outcome. Adjusting for it changes the question from a total effect to a direct effect, and risks overadjustment bias.

Collider A variable caused by both the exposure and the outcome (or by their causes). Adjusting for a collider opens a backdoor path and creates spurious associations.

Confounding by Indication A pharmacoepidemiology problem: patients are prescribed a treatment because of features of their condition that also predict outcome, so treated and untreated groups differ in baseline risk.

Time-Varying Confounding Confounders whose values change over time and which are themselves affected by prior exposure. Standard regression cannot handle these properly; g-methods are required.

Residual Confounding Confounding that remains after adjustment because the confounder was measured imprecisely, categorized too coarsely, or only partially captured.

Intersectionality A framework (Crenshaw, 1989) for analyzing how systems of power based on race, gender, class, and other social positions combine to shape exposure and outcome. Pushes back on variable-by-variable adjustment as a complete causal strategy.

Adjustment & Statistical Tools

Stratification Splitting the analysis into homogeneous subgroups defined by a confounder, then summarizing across strata. The conceptual ancestor of regression adjustment.

Standardization A reweighting technique (direct or indirect) that produces a summary measure of effect or risk under a specified distribution of confounders. Common in age- and sex-adjusted rates.

Multivariable Regression A modelling approach that estimates the association between exposure and outcome while holding other variables constant. Powerful but vulnerable to misspecification, multicollinearity, and adjusting for the wrong variables.

Model Misspecification A model that does not match the true data-generating process (wrong functional form, missing interactions, or wrong link function), leading to biased estimates and misleading inferences.

Multicollinearity High correlation among predictors that inflates standard errors and destabilizes coefficient estimates without (typically) biasing predictions.

Simpson’s Paradox A reversal of an association when data are aggregated across a confounder. The classic warning that “summing up” subgroup data without accounting for structure can flip conclusions.

Ecological Fallacy Mistakenly inferring individual-level relationships from group-level (ecological) data. The classic illustration: country-level wine consumption and heart disease.

Atomistic Fallacy The mirror error: inferring group-level patterns from purely individual-level data, ignoring contextual effects.

Modifiable Areal Unit Problem (MAUP) In spatial analysis, the same data analyzed at different geographic scales or zonations can produce different (even opposite) results.

Missing Data Mechanisms A taxonomy (Rubin) of how data come to be missing: MCAR (missing completely at random), MAR (missing at random conditional on observed variables), and MNAR (missing not at random). Each requires a different analytic strategy.

Statistical Inference

Null Hypothesis Significance Testing A framework that asks how unusual observed data would be if the null hypothesis were true. Foundational but routinely misinterpreted, especially around p-values.

P-Value The probability, assuming the null hypothesis is true, of observing data at least as extreme as those obtained. It is not the probability that the null is true, nor a measure of effect size.

Confidence Interval A range of values consistent with the data under the assumed model. A 95% CI means that, in repeated sampling, 95% of similarly constructed intervals would contain the true parameter.

Type I Error (α) A “false positive”: rejecting the null hypothesis when it is in fact true. Conventionally controlled at 5%.

Type II Error (β) A “false negative”: failing to reject the null when it is false. Statistical power equals 1 − β.

Statistical Power The probability of correctly rejecting the null when a specified alternative is true. Underpowered studies routinely miss real effects and exaggerate them when they do find them.

Statistical vs. Clinical Significance A finding can be statistically significant (p < 0.05) yet too small to matter for patients or policy, and clinically important effects can be statistically nonsignificant when samples are small.

Key People

Judea Pearl (1936–) Computer scientist whose work on DAGs and the do-calculus formalized confounding, mediation, and adjustment as causal questions.

James Robins Epidemiologist and biostatistician who developed g-methods (g-formula, marginal structural models, structural nested models) for time-varying confounding.

Miguel Hernán Epidemiologist (Harvard) whose target-trial framework operationalizes adjustment for time-varying confounding in observational drug and policy studies.

Sander Greenland Epidemiologist who has spent decades clarifying the meaning of confounding, collapsibility, and the (mis)use of statistical significance.

Donald Rubin Statistician whose causal-inference framework (potential outcomes) and missing-data taxonomy underpin much of modern observational analysis.

Kimberlé Crenshaw Legal scholar who coined “intersectionality” in 1989 to name how race, gender, and other systems of power compound, a foundational concept for population health equity.

No matching entries. Try a different search term.

Section 1 of 3

Confounding

⏱ Estimated reading time: 20 minutes

Section 1 of 3

Confounding

The third leg of the bias triad: definition, classic cases, and the structural limits of adjustment.

The definition

Three conditions for confounding

Confounding triangle

\[ \color{#6D28D9}{C} \longrightarrow \color{#0B7B6B}{E}, \quad \color{#6D28D9}{C} \longrightarrow \color{#C2410C}{Y}, \quad \color{#6D28D9}{C} \not\in \color{#0B7B6B}{E} \to \color{#C2410C}{Y} \]

C confounderE exposureY outcomeC causes both E and Y but is not on the E→Y path

1. Linked to the exposure

C predicts or determines who receives the exposure E.

2. Independent risk factor

C affects the outcome Y on a path that does not run through E.

3. Not a mediator

C is not a step on the causal path from E to Y. Adjusting for a mediator induces bias, it does not remove it.

Classic case

Hormone replacement therapy and the Women's Health Initiative

Twenty years of observational data said hormone therapy protected the heart. The randomized trial said the opposite.

Women who chose hormone therapy in observational studies were wealthier, better educated, leaner, and more engaged with preventive care, all independently linked to lower heart disease risk.

Trial result (2002)

Combined estrogen-progestin: hazard ratio 1.29 (95% confidence interval 1.02 to 1.63) for coronary heart disease. Harmful, not protective.

Simulation

The trap in miniature: R activity

What the simulation shows

\[ \underbrace{\color{#C2410C}{\widehat{\text{OR}}_{\text{crude}} \approx 0.61}}_{\text{confounded}} \quad \longrightarrow \quad \underbrace{\color{#0B7B6B}{\widehat{\text{OR}}_{\text{adjusted}} \approx 1.0}}_{\text{truth}} \]

crude OR ≈ 0.61 confounded by SESadjusted OR ≈ 1.0 true null after adjustment

Socioeconomic status simultaneously raises hormone therapy use (from 20% to 60%) and lowers heart disease risk (from 15% to 5%). The therapy itself is coded with no effect. The crude odds ratio misleads; stratifying and adjusting recover the truth of no effect.

Pharmacoepidemiology

Confounding by indication

The indication, disease severity, causes both the treatment and the worse outcomes. A naive comparison makes the treatment look harmful. Propensity scoring, active-comparator designs, and restriction help, but cannot remove indication factors you never measured.

Time-varying confounding

When a confounder is also a mediator

The feedback loop

\[ \color{#0B7B6B}{\text{Therapy}_{t}} \longrightarrow \underbrace{\color{#6D28D9}{\text{CD4}_{t+1}}}_{\text{mediator and confounder}} \longrightarrow \color{#0B7B6B}{\text{Therapy}_{t+2}} \longrightarrow \color{#C2410C}{\text{Survival}} \]

Therapy time-varying treatmentCD4 both mediator and confounderSurvival outcome

The CD4 immune-cell count is at once caused by earlier treatment and predictive of future treatment and survival. Standard regression adjustment over-corrects; ignoring it leaves confounding. Marginal structural models, using inverse probability weighting, resolve the dilemma.

Structural confounders

Race is not a confounder; racism is an exposure

The adjustment problem

Controlling for income and education when studying racial disparities partials out the very channels through which structural racism produces those disparities. Adjusting does not explain the disparity away, it reroutes it.

Intersectionality

Crenshaw (1989): a Black woman's experience of discrimination is not the sum of “being Black” plus “being a woman.” Additive regression models implicitly assume it is.

A defensible study states its theoretical commitments explicitly and distinguishes “holding a variable constant” (a statistical operation) from “what would happen if we changed it” (a causal claim).

Carry forward

What to take into the next section

Three conditions define a confounder; meeting them requires causal reasoning, not statistical tests.
The hormone therapy reversal is the classic warning that observational consistency does not guarantee validity.
Confounding by indication threatens any non-randomized drug comparison; design solutions help but do not fully replace the trial.
Time-varying confounding with feedback loops breaks standard regression; marginal structural models are the standard fix.
Structural variables like race index processes, not attributes: adjusting for them embeds a theoretical claim.

Introduction and Overview

Earlier lessons worked through bias one category at a time. This lesson finishes the inventory by addressing the third leg of the canonical bias triad, confounding, and then steps back to the statistical-inference issues that turn even unbiased estimates into wrong conclusions. The two content sections divide the work cleanly. This section takes confounding from the textbook definition through pharmacoepidemiology's standard nightmare (confounding by indication), through the more advanced problem of time-varying confounding, into the harder theoretical question of whether constructs like race or income are confounders to be controlled or structural exposures to be measured. A later section turns to the analytic issues that haunt even well-adjusted models: model misspecification, multicollinearity, Type I/II errors, Simpson's paradox, the ecological fallacy, the modifiable areal unit problem, and missing data. By the end, the toolkit needed for a later lesson's integrated appraisal will be complete.

Learning Objectives

State the three formal conditions a variable must satisfy to be a confounder, and distinguish confounders from mediators.
Use the WHI / HRT story to explain why observational and randomized estimates can diverge in opposite directions.
Identify confounding by indication in pharmacoepidemiologic studies and propose adjustment, restriction, or active-comparator strategies.
Define time-varying confounding with treatment-confounder feedback and explain why standard regression adjustment fails.
Articulate the conceptual debate over whether race, gender, and SES are confounders to be controlled or structural exposures to be measured.

What Is Confounding?

Confounding occurs when a third variable, the confounder, is associated with both the exposure and the outcome, distorting the observed relationship between them. Unlike mediators (which lie on the causal pathway), confounders represent alternative explanations for an association. If unaddressed, confounding can make a harmful exposure appear protective, a beneficial treatment appear harmful, or a null relationship appear significant.

Three Conditions for Confounding

A variable C is a confounder of the exposure–outcome relationship if it satisfies three conditions: (1) C is associated with the exposure, (2) C is an independent risk factor for the outcome (it raises or lowers outcome risk on its own, not only through the exposure), and (3) C is not on the causal pathway between the exposure and the outcome. If C is a mediator rather than a confounder, adjusting for it introduces bias rather than removing it. The modern formal definition is given by VanderWeele & Shpitser (2013), building on Greenland and Robins' (1986) foundational link between confounding, identifiability, and exchangeability; Hernán et al. (2002) further showed that confounder identification requires causal knowledge and not statistical association alone.

Classic Case: Hormone Replacement Therapy

One of the most consequential examples of confounding in modern epidemiology involves hormone replacement therapy (HRT) and cardiovascular disease (CVD) risk in postmenopausal women. For decades, observational studies consistently suggested that HRT reduced CVD risk by 30–50%. These findings influenced clinical guidelines worldwide.

Case Study: The Women’s Health Initiative (WHI)

The WHI was a large randomized controlled trial launched in 1991. When results were published by the Writing Group for the WHI Investigators (2002), they revealed that combined estrogen–progestin therapy actually increased the risk of coronary heart disease (HR 1.29, 95% CI: 1.02–1.63), stroke, and pulmonary embolism. The discrepancy was explained by confounding by socioeconomic status and health behaviors: women who chose HRT in observational studies tended to be wealthier, better educated, leaner, more physically active, and more engaged with preventive healthcare, all factors independently associated with lower CVD risk. Reading the trial estimate: the 95% confidence interval (1.02 to 1.63) lies entirely above 1, the value that marks no effect, so the data are compatible only with increased risk, not with protection.

R Stratified analysis: confounding revealed and removed

What you'll do: simulate a 5,000-person dataset in which HRT has zero effect on CVD, then watch how an SES confounder produces a misleading crude OR that disappears when you look within strata or adjust in regression. What to take away: stratification is the most direct demonstration of confounding: if the within-stratum estimates agree with each other but disagree with the crude estimate, you have confounding by the stratifying variable, and the within-stratum value is the unbiased one.

The classic remedy for confounding is to look within levels of the suspected confounder. The simulation below builds a true-null exposure-outcome relationship that's polluted by an SES confounder. The crude OR misleads; the stratum-specific ORs show the truth.

set.seed(230)
n <- 5000
ses <- rbinom(n, 1, 0.5)                                  # 1 = high SES
hrt <- rbinom(n, 1, prob = ifelse(ses == 1, 0.6, 0.2)) # high SES uses HRT more
# CVD risk: lower in high SES, NOT affected by HRT (true null)
cvd <- rbinom(n, 1, prob = ifelse(ses == 1, 0.05, 0.15))

# Crude (unadjusted) OR -- looks "protective"
exp(coef(glm(cvd ~ hrt, family = binomial))["hrt"])

# Within each SES stratum -- the true effect: ~1.0
tapply(seq_len(n), ses, function(i) {
  exp(coef(glm(cvd[i] ~ hrt[i], family = binomial))[2])
})

# Adjusted OR, controlling for SES
exp(coef(glm(cvd ~ hrt + ses, family = binomial))["hrt"])

Console output

hrt 0.61 # crude (confounded - looks protective) 0 1 0.99 1.05 # within-stratum (truth: ~1) hrt 1.02 # adjusted

This is the WHI lesson in miniature. The crude estimate (about 0.61) sits well below 1; since an odds ratio below 1 means lower odds of disease, it reads as protection, exactly the kind of number that fed twenty years of observational HRT enthusiasm. Adjusting for SES recovers the true null, an odds ratio near 1. In a later course you will build this intuition with Mantel-Haenszel summaries; in a later course you will use multivariable regression.

R Reflect on what you just ran

Use the questions below to interpret the output you produced. Look at your console before answering.

1. The crude OR for HRT vs CVD was about 0.61 (well below 1), but the simulation built HRT with zero effect on CVD. Walk through the three conditions of confounding and explain how SES, as set up in the simulation, satisfies each one (associated with exposure, associated with outcome, not on the causal pathway).

Model answerAll three conditions are satisfied by construction. (1) Associated with the exposure: the line hrt <- rbinom(n, 1, prob = ifelse(ses == 1, 0.6, 0.2)) hard-wires high-SES women to use HRT three times more often than low-SES women, so SES is strongly associated with HRT use. (2) Independent risk factor for the outcome: the CVD line gives high-SES women a 5% risk and low-SES women a 15% risk, so SES affects CVD even when HRT is held fixed (because HRT does not enter the cvd line at all). (3) Not on the causal pathway: SES is set before HRT in the data-generating process and is never caused by HRT, so it sits as a common cause, not a mediator. All three Rothman/Greenland conditions hold, which is exactly why the crude OR is biased and the SES-adjusted OR is unbiased.

2. The stratum-specific ORs (0.99 in low SES and 1.05 in high SES) and the SES-adjusted OR (1.02) are all close to 1. Why are they nearly identical to each other? What does that tell you about whether SES is acting as a confounder vs. an effect modifier in this simulated dataset?

Model answerThey are nearly identical because confounding is the only thing producing the bias. When you condition on SES, either by stratifying (0.99 and 1.05) or by entering SES as a covariate (1.02), you remove the spurious association and recover the true null. The fact that the two stratum-specific ORs agree with each other tells you SES is acting as a confounder, not an effect modifier: effect modification would show different ORs across strata (e.g. 0.5 in low SES, 1.8 in high SES). Here the within-stratum effect is the same (≈1) at both levels of SES, so a single summary estimate is appropriate, the classic signature of pure confounding.

3. The Women's Health Initiative trial overturned 20+ years of observational HRT findings. Using your simulation results, explain to a skeptical clinician why an RCT was needed even though the observational evidence was large and consistent.

Model answerThe simulation reproduces the exact failure mode of the pre-WHI observational literature: large samples, internally consistent findings, and a strongly ‘protective’ crude OR of about 0.61 driven entirely by a confounder (SES) that no one fully measured. Observational studies cannot rule out unmeasured confounding, no matter how big they are; adjusting for the SES you can see does nothing for the SES you cannot, and there are always lifestyle, health-seeking, and access variables that travel with HRT use. Randomization breaks the link between exposure and confounders by design: in the WHI, the assignment of HRT was independent of every measured and unmeasured baseline characteristic, so any difference in CVD between arms had to be the drug. The simulation shows in miniature what the WHI showed in practice: that an OR well below 1 can come from confounding alone, and only randomization can guarantee that the comparison groups are exchangeable.

Saved.

Why This Matters

The HRT story illustrates how even large, well-conducted observational studies can produce misleading results when confounding is not adequately addressed. It was the randomized design of the WHI, which balanced measured and unmeasured confounders across groups, that revealed the true direction of effect. This case became a defining moment in evidence-based medicine.

The HRT case is the textbook example of confounding by lifestyle and SES. The next form, confounding by indication, is the most pervasive issue in observational pharmacoepidemiology, and the reason any drug-effect estimate from administrative data should be read with care.

Confounding by Indication

Confounding by indication is a specific form of confounding that arises in pharmacoepidemiological studies when the reason a treatment is prescribed (the “indication”) is itself a risk factor for the outcome. Because sicker patients are more likely to receive treatment, naive comparisons of treated vs. untreated patients systematically overestimate harm or underestimate benefit. The three flip cards below show the standard scenario, its mirror image (confounding by contraindication), and the design tools used to address both.

Statin Mortality ExampleClick to explore

Confounding by ContraindicationClick to explore

Design SolutionsClick to explore

Confounding by indication is hard but at least the confounder sits at baseline. Time-varying confounders make the problem one step worse: they evolve, and their values feed back from prior treatment.

Time-Varying Confounding

Time-varying confounding occurs when a confounder changes over time and is simultaneously affected by prior treatment and predictive of future treatment and outcome. This creates a feedback loop that standard regression cannot resolve.

Case Study: HIV Treatment and CD4 Counts

In HIV care, treatment decisions depend on CD4 cell counts (a marker of immune function). A patient’s CD4 count at time t influences whether antiretroviral therapy (ART) is initiated or modified at time t+1. However, prior ART use also affects CD4 counts. This creates a feedback loop: CD4 count is both a confounder (it predicts treatment and mortality) and is affected by prior treatment.

Standard regression that adjusts for CD4 count introduces collider bias (over-adjustment), while failure to adjust leaves confounding uncontrolled. Marginal structural models (MSMs), using inverse probability of treatment weighting (IPTW), can break this cycle by creating a pseudo-population where treatment is unconfounded by time-varying factors.

The problem with standard adjustment: When you include a time-varying confounder (like CD4 count) in a conventional regression model, you block part of the treatment’s causal effect that operates through CD4 counts. This is because CD4 count is simultaneously a confounder and a mediator. Standard regression cannot simultaneously adjust for confounding and preserve the treatment effect that flows through the mediator.

Hernán, Brumback, and Robins (2000) demonstrated that standard Cox regression produced biased estimates of ART effectiveness, often failing to detect substantial survival benefits that were recovered by MSMs.

How MSMs work: Marginal structural models use a two-step process: (1) estimate the probability of receiving the observed treatment at each time point given past covariates (the “propensity”), then (2) weight each observation by the inverse of that probability. This creates a pseudo-population in which treatment is unconfounded by time-varying factors.

The resulting weighted analysis estimates the causal effect of a treatment strategy (e.g., “always treat when CD4 < 350”) rather than the observational association between treatment and outcome.

Feature	Standard Regression	Marginal Structural Model
Time-varying confounding	Introduces collider bias if adjusted	Properly handled via weighting
Treatment-confounder feedback	Cannot resolve	Explicitly modeled
Estimate type	Conditional on covariates	Marginal (population-level)
Causal interpretation	Limited without assumptions	Causal under exchangeability
Complexity	Simple to implement	Requires careful weight estimation

Marginal structural models handle the technical problem of time-varying confounding within a single causal pathway. The next subsection addresses a deeper question that no analytic technique alone can resolve: whether some of the variables we routinely treat as confounders should be reframed as the very exposures we ought to be measuring.

Beyond “Controlling For”: Intersectionality and the Limits of Variable-by-Variable Adjustment

So far this lesson has treated confounding as a problem with a technical solution: identify the confounders, adjust for them, and proceed to the “true” effect. That logic works well for the kind of variable a randomised trial would have balanced, such as a clinical comorbidity, a baseline lab value, a discrete behaviour. It works less well, and sometimes badly, when the “confounder” is not really a variable at all but a structural process that produces both the exposure and the outcome over the lifecourse.

A theoretical caution

Putting race, sex, or income in a regression as a covariate is a modelling choice with theoretical content. It treats those constructs as if they were stable, individual-level attributes whose effect on the outcome can be additively isolated from the exposure of interest. Many social epidemiologists argue that this framing misrepresents what these variables actually index, namely chronic exposure to racism, patriarchy, and economic deprivation, and produces estimates whose policy meaning is unclear (Krieger, 2014; VanderWeele & Robinson, 2014).

Race is not a confounder; racism is an exposure

A standard Table 1 in an epidemiological paper presents race/ethnicity as a baseline characteristic to be adjusted for. But race itself does not have a biological mechanism that causes hypertension, low birthweight, or COVID-19 mortality. What does the causing is racism, experienced across a lifecourse as residential segregation, job market discrimination, biased policing, differential medical treatment, and chronic vigilance, all of which become biologically embodied (Williams, Lawrence, & Davis, 2019; Krieger, 2014).

When researchers “control for race,” they often inadvertently obscure the structural process they should be measuring. Worse, adjusting for downstream consequences of racism (income, education, neighbourhood) can constitute over-adjustment bias, partialling out the very mediators through which the structural exposure operates (VanderWeele & Robinson, 2014; Schisterman, Cole, & Platt, 2009). The variable is in the model; the explanation has been removed.

Intersectionality and the additivity assumption

Crenshaw’s (1989) concept of intersectionality began in legal theory with a simple observation: a Black woman’s experience of discrimination is not the sum of “being Black” plus “being a woman.” The intersections produce qualitatively distinct exposures that neither single-axis category, nor an additive combination of them, can capture.

Standard regression, by default, models adjustment as additive: the coefficient on race is interpreted as the effect “holding gender, class, and other variables constant.” Bauer (2014) shows why this is theoretically inadequate for studying inequality. Holding gender constant while estimating a racial effect implicitly imagines a population in which racial categorisation is detached from gendered experience, a counterfactual that does not, in any meaningful sense, exist. Quantitative researchers can address this with explicit interaction terms, stratified analyses, intersectional MAIHDA (multilevel analysis of individual heterogeneity and discriminatory accuracy; Evans et al., 2018), or descriptive analyses that report rates within intersecting groups rather than coefficients adjusted across them.

Worked example: maternal mortality

In the United States, Black women die from pregnancy-related causes at roughly three times the rate of White women (Petersen et al., 2019). A conventional analysis might fit a model of maternal mortality with race, age, education, income, insurance, and parity as covariates and report a residual race coefficient that is smaller than the crude difference. The headline often becomes: “most of the disparity is explained by socioeconomic factors.”

Interpreted through a fundamental-causes lens (Phelan, Link, & Tehranifar, 2010), this is the wrong reading. Education, income, and insurance are mechanisms through which structural racism produces the disparity; adjusting for them does not explain the disparity away, it merely reroutes it. An intersectional analysis instead asks how Black women specifically experience obstetric care, what dismissal and pain under-recognition look like at this intersection, and what changes when interventions are designed for Black mothers rather than for “women” or for “low-income patients” in general. Different theories produce different analyses; different analyses produce different policy recommendations.

When biomedical models fall short

The biomedical model is comfortable with confounders that are themselves biomedical: cholesterol confounding the diet–CHD relationship, age confounding most things. It becomes unstable when the “confounder” is the social structure itself, because that confounder operates over decades, through dozens of mediating mechanisms, with effects that change as societal conditions change. Fundamental cause theory predicts exactly this: as one mechanism is closed off (e.g., smoking is reduced), the social gradient in mortality reappears through whatever mechanism is currently relevant (e.g., obesity, opioid overdose). The pattern is robust to mechanism-by-mechanism adjustment because the cause is upstream of any specific mechanism (Link & Phelan, 1995; Phelan et al., 2010).

What this means for appraisal

When a paper reports an effect estimate “adjusted for race, income, and education,” ask: What is the underlying theory of how these variables relate to the exposure and outcome? Are they confounders to be partialled out, or mediators that carry the causal effect of structural conditions? Would an intersectional or stratified analysis tell a different story? A defensible study makes its theoretical commitments explicit and is honest about the difference between “the effect of X holding Y constant” (a statistical operation) and “what would happen if we changed X” (a causal claim that depends on what Y actually represents in the world).

Key Takeaways: Confounding

Confounding produces spurious associations or masks real ones; the HRT/WHI case demonstrates how observational and experimental results can diverge
Confounding by indication is pervasive in pharmacoepidemiology; sicker patients receiving treatment biases naive comparisons
Time-varying confounding requires specialized methods (e.g., MSMs) because standard regression cannot simultaneously adjust for confounding and preserve causal pathways
Randomization remains the most powerful tool for addressing confounding, but advanced analytic methods can approximate causal inference from observational data
Treating race, gender, or income as ordinary confounders to be “controlled for” embeds a theoretical claim that may obscure the structural processes (racism, patriarchy, fundamental causes) those variables actually index
Intersectionality and ecosocial frameworks invite analyses that go beyond additive adjustment, recognising that what gets measured and modelled shapes what knowledge is produced and which interventions become thinkable

Section 2 of 3

Statistical Inference & Model Issues

⏱ Estimated reading time: 25 minutes

Section 2 of 3

Statistical Inference & Model Issues

Seven analytic problems that persist even when confounding has been addressed.

Model form

Misspecification: forcing the wrong shape

A straight line fitted to a J-shaped relationship reads as either uniformly harmful or uniformly protective depending on where the sample sits on the curve. Splines or categorical bins capture the real bend.

Correlated predictors

Multicollinearity inflates uncertainty

Variance inflation

\[ \color{#C2410C}{\text{Var}(\hat{\beta}_j)} = \frac{\color{#1D4ED8}{\sigma^2}}{\color{#6D28D9}{\text{SS}_{x_j}}(1 - \color{#BE185D}{R_j^2})} \]

Var(β̂_j) variance of the coefficientσ² residual varianceSS_xj spread of the predictorR²_j how well other predictors explain x_j (collinearity)

When predictor j is nearly a combination of the others, \(R_j^2\) approaches 1 and the variance of its coefficient explodes: the estimate is right on average but unreliable in any one sample.

Air quality example

Fine particulate matter, ozone, NO₂, and SO₂ share traffic and industrial sources. Entering all four at once produces unstable, sign-flipping coefficients. Solutions: principal component analysis, single-pollutant models, or sensitivity analyses.

False positives and negatives

Type one and Type two errors

The trade-off

\[ \color{#C2410C}{\alpha} = P(\text{reject }H_0 \mid H_0 \text{ true}), \quad \color{#1D4ED8}{\beta} = P(\text{fail to reject }H_0 \mid H_1 \text{ true}) \]

α Type I error: a false positiveβ Type II error: a false negative (power = 1 − β)

Type one (alpha)

False positive. Conventionally held at 5%. Lowered by stricter thresholds, but that raises Type two error in small samples.

Type two (beta)

False negative. Power is one minus beta. Underpowered studies miss real effects, and overstate them when they do detect one: the winner's curse.

Power curves showing statistical power rising with sample size, faster for larger effect sizes, with an 80 percent power reference line. — For a fixed effect size, power climbs with sample size; small effects need far larger samples to reach the conventional 80% threshold. Underpowered studies that do reach significance overstate the effect.

Sign reversal

Simpson's paradox: aggregation can flip direction

Overall

Treatment X: 55/100 (55%)
Standard: 63/100 (63%)
Standard appears better.

Mild cases

Treatment X: 27/30 (90%)
Standard: 51/60 (85%)
Treatment X is better.

Severe cases

Treatment X: 28/70 (40%)
Standard: 12/40 (30%)
Treatment X is better here too.

Treatment X was given disproportionately to severe cases. Within each severity group it performs at least as well, but the pooled result reverses. Stratification reveals the truth; aggregation hides it.

Levels of analysis

Ecological and atomistic fallacies

Ecological fallacy

Inferring individual effects from group-level data. State-level inequality correlates with mortality, but that does not mean a high-income person in an unequal state has worse health.

Atomistic fallacy

Assuming individual-level relationships hold at the group level. Individual income predicts health, but population health also depends on distribution, infrastructure, and social cohesion, which are group-level factors.

Both are resolved by multilevel models that separate variation within and between groups, rather than forcing one level's associations onto another.

Spatial analysis

Modifiable areal unit problem

Same data, same disease, different geographic unit, different conclusion. Reporting sensitivity to spatial aggregation is standard practice in spatial epidemiology.

Rubin's taxonomy

Missing data: three mechanisms

Completely at random

The chance of being missing is unrelated to any data. Complete-case analysis is unbiased but loses power. Example: an accidentally dropped sample.

At random

Missingness depends on observed variables, not the missing value itself. Multiple imputation is valid. Example: younger participants skipping a questionnaire.

Not at random

Missingness depends on the unobserved value. No standard fix; sensitivity analyses required. Example: sicker patients dropping out because they are sick.

Carry forward

Seven reasons valid data can still mislead

Model misspecification: the wrong shape produces wrong conclusions.
Multicollinearity: correlated predictors inflate coefficient uncertainty.
Low power and the winner's curse: small samples miss real effects and inflate detected ones.
Simpson's paradox: aggregation can reverse a treatment comparison.

Ecological fallacy: group trends do not transfer to individuals.
Atomistic fallacy: individual patterns do not transfer to groups.
Modifiable areal unit problem: spatial results depend on the geographic unit chosen.
Missing not at random: no standard method corrects missingness that depends on the unobserved value.

Introduction and Overview

An earlier section closed the bias inventory. Even after every bias has been addressed, an analysis can still produce wrong conclusions if the statistical model itself does not fit the data, or if the inferential framework is misused. This section walks through seven such issues in order: model misspecification, multicollinearity, Type I/II errors, Simpson's paradox (with a hands-on simulator), the ecological and atomistic fallacies (callbacks to an earlier lesson), the modifiable areal unit problem, and missing data. Each is short on its own, but together they constitute most of the analytic mistakes that survive peer review.

Learning Objectives

Recognize model misspecification (e.g., linear models forced onto J-shaped relationships) and choose appropriate functional forms.
Diagnose multicollinearity and apply remedies such as variable selection, principal components, or single-pollutant analyses.
Distinguish Type I and Type II errors and explain the “winner’s curse” that inflates effect sizes in low-powered studies.
Identify Simpson’s paradox, the ecological fallacy, the atomistic fallacy, and the modifiable areal unit problem in published work.
Classify missing data as MCAR, MAR, or MNAR and select an appropriate handling strategy (complete-case, multiple imputation, sensitivity analysis).

Model Misspecification

A statistical model is misspecified when the assumed functional form does not reflect the true relationship between variables. One of the most common errors is assuming a linear relationship when the true relationship is nonlinear.

Case Study: Alcohol and All-Cause Mortality

The relationship between alcohol consumption and mortality is often described as J-shaped: light-to-moderate drinkers appear to have lower mortality than both abstainers and heavy drinkers. If a researcher incorrectly fits a linear model to this data, they might conclude either that alcohol is uniformly harmful (positive slope) or uniformly protective (negative slope), depending on the distribution of consumption in their sample.

More recent analyses (e.g., Stockwell et al., 2016) have shown that the apparent protective effect of moderate drinking largely disappears when studies correct for “sick quitter” bias (former drinkers misclassified as abstainers) and use appropriate nonlinear models. Model specification can reverse a study’s conclusions, so it is far from a mere statistical nicety.

Multicollinearity

Multicollinearity occurs when two or more predictor variables in a regression model are highly correlated. This does not bias coefficient estimates but dramatically inflates their standard errors, making individual effects unstable and difficult to interpret.

Example: Environmental Pollutant Studies

In studies of air quality and respiratory disease, researchers often include multiple pollutants (PM2.5, ozone, NO2, SO2) simultaneously. Because these pollutants share common sources (traffic, industry), they are often highly correlated. A regression model including all of them may produce coefficients that flip sign, lose significance, or vary wildly between samples, even though the pollutants truly affect health. Solutions include principal component analysis, variable selection, or analyzing one pollutant at a time with mutual adjustment in sensitivity analyses.

Type I and Type II Errors

Type I error (false positive) is the probability of concluding an effect exists when it does not. Type II error (false negative) is the probability of failing to detect a real effect. These errors trade off: making it harder to achieve significance (lower alpha) reduces Type I error but increases Type II error, especially in small samples.

Power and the Winner’s Curse in Rare Disease Studies

Rare disease studies are particularly vulnerable because small sample sizes mean low statistical power. A study of a disease affecting 1 in 100,000 people may have only 50 cases, yielding power of 20–30% for moderate effect sizes, meaning 70–80% of true treatment effects will go undetected. Paradoxically, the studies that do find significant results in low-power settings tend to overestimate effect sizes, a phenomenon called the “winner’s curse.” (Multiple testing, p-hacking, and the broader integrity issues these problems create are addressed in an earlier lesson.)

Simpson’s Paradox

Simpson’s paradox occurs when a trend that appears in aggregated data reverses when the data are stratified by a confounding variable. The paradox highlights the danger of drawing causal conclusions from aggregate statistics without considering underlying group structure.

Case Study: Treatment Effectiveness by Disease Severity

Consider a new treatment tested at two hospitals. Hospital A treats mostly mild cases; Hospital B treats mostly severe cases. Overall, Treatment X appears to have a lower success rate than the standard. But when stratified by severity:

Figure. Each point is a patient. Within every severity group the trend line rises, with more Treatment X going with better recovery, yet the dashed pooled line falls. The reversal happens because Treatment X was given disproportionately to severe cases (Hospital B), who recover worse no matter what. This is the picture behind the table below: stratum-specific success rates favour Treatment X, but the aggregate hides the confounding by severity. It is also the “Simpson’s paradox (sign flip)” preset in the interactive simulator further down: each stratum-specific risk ratio points one way while the crude RR points the other.

Group	Treatment X Success	Standard Success	Interpretation
Overall	55/100 (55%)	63/100 (63%)	Standard appears better
Mild cases	27/30 (90%)	51/60 (85%)	Treatment X is better
Severe cases	28/70 (40%)	12/40 (30%)	Treatment X is better, but is given to more severe cases

The reversal occurs because Treatment X was disproportionately assigned to severe cases (which have worse outcomes regardless of treatment). The aggregated data hides this confounding by severity, producing a misleading conclusion. The correct causal interpretation requires stratification.

Hands-on: Confounding & Simpson's Paradox

What you'll do: the simulator below builds a two-stratum population and lets you set how strongly the confounder drives both treatment assignment and outcome risk. What to take away: within-stratum risk ratios and the crude (unadjusted) risk ratio can disagree dramatically with each other and, with the right combination of strengths, can have opposite signs, the formal definition of Simpson's paradox. The simulator complements the R-box from an earlier section: there you stratified to recover a true null; here you can build sign-reversal by hand.

🎲 Interactive: Confounding & Simpson’s Paradox

A two-strata population (e.g., mild vs. severe cases). Set how strongly the confounder C drives treatment assignment and how strongly C drives outcome. Watch the crude RR drift away from the stratum-specific RR. Push both sliders hard and you can flip the sign, Simpson’s paradox in action.

Stratum-specific vs. crude RR

Green bars = within-stratum RRs. Red bar = naive crude RR ignoring the confounder.

Risk by treatment within each stratum

True (within-stratum) RR 1.50

C→Treatment assignment 0.70

C→Outcome (baseline risk in C+) 0.60

Baseline risk in C− 0.10

Prevalence of C+ 0.40

RR within C+

–

RR within C−

–

Crude RR (unadjusted)

–

Presets:

Try the Simpson’s paradox preset: each stratum shows a treatment benefit (RR < 1), but the crude RR shows treatment harm (RR > 1). The confounder steered treatment toward the high-risk stratum and the aggregate hides it.

Simpson's paradox is what happens when stratification reveals confounding at the individual level. The complementary problem is what happens when we move between levels of analysis, from groups to individuals or vice versa. Lesson 6 introduced this material in detail; the next two subsections recall the essentials.

Ecological Fallacy and Atomistic Fallacy

Ecological FallacyClick to explore

Atomistic FallacyClick to explore

Modifiable Areal Unit Problem (MAUP)

The modifiable areal unit problem is a form of the ecological fallacy specific to spatial analysis. When individual-level data are aggregated into geographic units (census tracts, counties, provinces), the choice of unit size and boundary definitions can alter statistical results, sometimes dramatically.

Example: Disease Clustering

A study examining cancer incidence near an industrial facility might find a significant cluster when data are aggregated at the postal code level but not at the health region level. Conversely, aggregating at a smaller level might produce unstable estimates due to small case counts. Neither result is “wrong”; both are artifacts of the chosen boundaries. The MAUP means that spatial epidemiological conclusions depend partly on arbitrary geographic decisions rather than solely on underlying disease patterns.

The ecological-fallacy and MAUP issues arise from how the data were assembled. The last analytic problem in this section is what to do when those data have holes in them, a near-universal problem whose handling can either preserve or wreck a study's conclusions.

Missing Data

Missing data are ubiquitous in epidemiological research. The validity of analysis depends critically on the mechanism underlying missingness; the modern taxonomy of MCAR / MAR / MNAR was set out by Rubin (1976):

Missing Completely At Random (MCAR)

Data are MCAR when the probability of being missing is unrelated to both observed and unobserved data. Example: a lab sample is accidentally dropped. Complete case analysis is unbiased under MCAR but loses power.

Missing At Random (MAR)

Data are MAR when missingness depends on observed variables but not on the missing values themselves, after conditioning on observed data. Example: younger participants are more likely to skip a depression questionnaire, but among people of the same age, missingness is unrelated to depression severity. Multiple imputation and maximum likelihood methods produce valid estimates under MAR.

Missing Not At Random (MNAR)

Data are MNAR when missingness depends on the unobserved values themselves. Example: people with severe depression are less likely to complete follow-up questionnaires because of their depression. No standard analytic method can fully correct MNAR; sensitivity analyses with different assumptions about the missing data mechanism are essential.

Complete Case Analysis Under Non-MCAR

When data are MAR or MNAR, restricting analysis to complete cases introduces selection bias. The remaining sample is no longer representative of the study population. For example, if sicker patients are more likely to drop out of a clinical trial, complete case analysis overestimates treatment effectiveness. Multiple imputation, which generates plausible values for missing data based on observed relationships, is preferred for MAR data.

Reflection

Think of a health study you have encountered (in this course or elsewhere). Identify one potential statistical or analytic issue discussed in this section (model misspecification, multicollinearity, low power, Simpson’s paradox, ecological fallacy, MAUP, or missing data) that could threaten its conclusions. Explain why the issue applies and how it might have been addressed.

Model answerA strong response names a specific study (e.g., the Lancet alcohol-and-all-cause-mortality meta-analysis, the original Honolulu Heart Program coffee-CHD finding, a JUPITER-style trial sub-analysis, or a regional COVID-mortality comparison), picks one of the seven issues, and applies it concretely. Good examples: (a) a J-shaped alcohol-mortality curve forced through a linear model, which is misspecification, addressable with splines or category dummies; (b) state-level smoking-vs-lung-cancer correlations interpreted at the individual level, which is ecological fallacy, addressable with multilevel modelling on individual data; (c) a 40-person rare-disease trial reporting a null result, which is low power or Type II error, addressable with prospective sample-size calculation, multicentre pooling, or Bayesian designs that report posterior probabilities rather than reject/fail-to-reject; (d) census-tract obesity rates that change when tracts are aggregated to counties, which is the MAUP, addressable by reporting sensitivity to spatial unit; (e) a trial with 18% loss-to-follow-up handled by complete-case analysis where dropouts were sicker, which is non-MCAR missingness, addressable with multiple imputation or inverse-probability-of-censoring weights. Weak responses name a generic study and an issue without showing they apply.

Reflection saved.

Section 3 of 3

Final Assessment

⏱ Estimated time: 20 minutes

Bringing It All Together

This lesson completed the bias inventory of this course by closing out confounding (an earlier section) and the analytic issues that survive even unbiased data (an earlier section). The HRT / WHI story made vivid how confounding can reverse the direction of a clinical recommendation, while confounding by indication and time-varying confounding showed why specialised tools, including restriction, active comparators, and marginal structural models, are part of the modern epidemiologist's repertoire. The lesson also pushed past the conventional “variable-by-variable adjustment” framing to ask harder theoretical questions about whether race, gender, and SES belong in models as confounders to be controlled or as structural exposures to be measured.

An earlier section then walked through the analytic problems that remain even after confounding has been handled: model misspecification, multicollinearity, Type I/II error and the winner's curse, Simpson's paradox, the ecological and atomistic fallacies, the modifiable areal unit problem, and missing data. These issues are the most common reasons a result fails to replicate, and recognising them is the last skill the course owes you before the integrated appraisal of a later lesson. The reflection below asks you to apply the full bias inventory to a study of your choice; the final assessment then tests the conceptual material before a later lesson integrates the entire course into a single critical-appraisal exercise.

Key Takeaways from this lesson

Confounding requires a third variable to be associated with the exposure, an independent risk factor for the outcome, and not a mediator on the causal pathway.
The HRT / WHI reversal showed that even highly consistent observational findings can be erased by randomization, which balances unmeasured confounders.
Confounding by indication is the standing threat to observational drug studies; restriction, active-comparator designs, and propensity methods address it but cannot fully replace randomisation.
Time-varying confounding with treatment-confounder feedback breaks standard regression; marginal structural models with IPTW are the canonical fix.
Treating race, gender, and SES only as nuisance confounders can hide the structural exposures that actually drive disparities.
Model misspecification, multicollinearity, Simpson's paradox, ecological and atomistic fallacies, the MAUP, and missing-data mechanisms distort inference even when bias has been controlled.
Statistical inference is only as good as the question, the design, and the model behind it, the appraisal skill from a later lesson starts here.

R Activity: Crude vs stratified vs adjusted: an HRT/SES confounding demo

The companion R script r-activities/HSCI_230_Lesson_11_Confounding_and_Statistical_Inference.R simulates a 5,000-person cohort in which SES drives both HRT use and CVD risk, but HRT itself does nothing. You will see the crude OR look strongly “protective,” the stratum-specific ORs sit near 1.0 (the truth), and the SES-adjusted OR recover the null, a hands-on reproduction of the HRT/WHI reversal that anchored an earlier section.

set.seed(230)
n   <- 5000
ses <- rbinom(n, 1, 0.5)                                   # 1 = high SES
hrt <- rbinom(n, 1, prob = ifelse(ses == 1, 0.6, 0.2))     # high SES uses HRT more

# CVD risk: lower in high SES, NOT affected by HRT (true null)
cvd <- rbinom(n, 1, prob = ifelse(ses == 1, 0.05, 0.15))

# Crude (unadjusted) OR -- looks "protective"
exp(coef(glm(cvd ~ hrt, family = binomial))["hrt"])

# Within each SES stratum -- the true effect: ~1.0
tapply(seq_len(n), ses, function(i) {
  exp(coef(glm(cvd[i] ~ hrt[i], family = binomial))[2])
})

# Adjusted OR, controlling for SES
exp(coef(glm(cvd ~ hrt + ses, family = binomial))["hrt"])

Reflection

Of the confounding-type and statistical-inference threats covered in this lesson, which do you believe poses the greatest practical risk to a typical observational study you might encounter in the public-health literature? Explain your reasoning, drawing on at least two specific examples from the lesson.

Model answerThere is no single right answer; the strongest responses argue for a specific threat using the lesson's mechanisms, not generalities. A defensible case for confounding by indication as the highest practical risk: it is structurally guaranteed in routine clinical data (sicker patients get treated more), it cannot be solved by larger samples, and the standard fix, multivariable adjustment, can make the bias worse if the indication is mis-measured. A defensible case for missing data under MNAR: it cannot be diagnosed from the observed data, multiple imputation does not fix it, and it silently inflates apparent treatment effectiveness in any study with dropout correlated to outcome (the WHI example here, but also virtually all pragmatic trials). A defensible case for Simpson's paradox: it routinely reverses the direction of effect estimates in stratified analyses (kidney stones, UC Berkeley admissions), is invisible without the right DAG, and is the single most embarrassing failure mode in policy-facing research. Whatever threat is chosen, the response should name two lesson examples and explain why the threat is hard to remove with standard methods.

Reflection saved.

Final Knowledge Assessment

This lesson Complete!

You have successfully completed Confounding & Statistical Inference. Your responses have been downloaded.

A later lesson, Integrated Appraisal of Epidemiological Research, is the capstone of this course. It pulls everything from earlier lessons together: the foundational framing of an earlier lesson, the systematic-reviews scaffolding of an earlier lesson, the four observational designs of earlier lessons, the measurement and causal-specification work of an earlier lesson, and the full bias inventory of earlier lessons. The lesson uses standardised reporting checklists and worked examples of full appraisals so that, by the end, you can read any epidemiological paper systematically rather than impressionistically.

HSCI 230, Lesson 11

Evaluating Epidemiological Research

Confounding &Statistical Inference

Learning objectives for this lesson:

Glossary: Key Terms, People & Concepts

Confounding

Confounding

Three conditions for confounding

1. Linked to the exposure

2. Independent risk factor

3. Not a mediator

Hormone replacement therapy and the Women's Health Initiative

Trial result (2002)

The trap in miniature: R activity

Confounding by indication

When a confounder is also a mediator

Race is not a confounder; racism is an exposure

The adjustment problem

Intersectionality

What to take into the next section

Introduction and Overview

Learning Objectives

What Is Confounding?

Three Conditions for Confounding

Classic Case: Hormone Replacement Therapy

R Reflect on what you just ran

Why This Matters

Confounding by Indication

Time-Varying Confounding

Beyond “Controlling For”: Intersectionality and the Limits of Variable-by-Variable Adjustment

A theoretical caution

Race is not a confounder; racism is an exposure

Intersectionality and the additivity assumption

When biomedical models fall short

What this means for appraisal

Key Takeaways: Confounding

Statistical Inference & Model Issues

Statistical Inference & Model Issues

Misspecification: forcing the wrong shape

Multicollinearity inflates uncertainty

Air quality example

Type one and Type two errors

Type one (alpha)

Type two (beta)

Simpson's paradox: aggregation can flip direction

Overall

Mild cases

Severe cases

Ecological and atomistic fallacies

Ecological fallacy

Atomistic fallacy

Modifiable areal unit problem

Missing data: three mechanisms

Completely at random

At random

Not at random

Seven reasons valid data can still mislead

Introduction and Overview

Learning Objectives

Model Misspecification

Multicollinearity

Example: Environmental Pollutant Studies

Type I and Type II Errors

Power and the Winner’s Curse in Rare Disease Studies

Simpson’s Paradox

Hands-on: Confounding & Simpson's Paradox

🎲 Interactive: Confounding & Simpson’s Paradox

Stratum-specific vs. crude RR

Risk by treatment within each stratum

Ecological Fallacy and Atomistic Fallacy

Modifiable Areal Unit Problem (MAUP)

Example: Disease Clustering

Missing Data

Complete Case Analysis Under Non-MCAR

Reflection

Final Assessment

Bringing It All Together

Key Takeaways from this lesson

Reflection

Final Knowledge Assessment

Confounding &
Statistical Inference