HSCI 230 — Lesson 11

Confounding &
Statistical Inference

Evaluating Epidemiological Research — HSCI 230

Dr. Kiffer G. Card, Faculty of Health Sciences, Simon Fraser University

Learning objectives for this lesson:

  • Explain how confounding distorts exposure–outcome associations and identify classic examples
  • Distinguish confounding by indication and time-varying confounding from conventional confounding
  • Recognize model misspecification and multicollinearity as threats to valid inference
  • Interpret Simpson’s paradox, ecological fallacy, atomistic fallacy, and the modifiable areal unit problem
  • Describe how missing data mechanisms (MCAR, MAR, MNAR) influence analytic validity
  • Critically evaluate whether epidemiological studies have adequately addressed confounding and inferential threats to validity
Reference

Glossary — Key Terms, People & Concepts

📚 Reference page — available throughout the lesson

This glossary collects the key concepts, people, and ideas you will meet in this lesson. Use it as a reference while you work through the material, or as a review before assessments. Type in the search box to filter entries.

Confounding & Causal Structure
Confounder A variable that causes both the exposure and the outcome and is not on the causal pathway between them. Failing to adjust for a confounder distorts estimated effects.
Mediator A variable on the causal pathway between exposure and outcome. Adjusting for it changes the question from a total effect to a direct effect—and risks overadjustment bias.
Collider A variable caused by both the exposure and the outcome (or by their causes). Adjusting for a collider opens a backdoor path and creates spurious associations.
Confounding by Indication A pharmacoepidemiology problem: patients are prescribed a treatment because of features of their condition that also predict outcome, so treated and untreated groups differ in baseline risk.
Time-Varying Confounding Confounders whose values change over time and which are themselves affected by prior exposure. Standard regression cannot handle these properly; g-methods are required.
Residual Confounding Confounding that remains after adjustment because the confounder was measured imprecisely, categorized too coarsely, or only partially captured.
Intersectionality A framework (Crenshaw, 1989) for analyzing how systems of power based on race, gender, class, and other social positions combine to shape exposure and outcome. Pushes back on variable-by-variable adjustment as a complete causal strategy.
Adjustment & Statistical Tools
Stratification Splitting the analysis into homogeneous subgroups defined by a confounder, then summarizing across strata. The conceptual ancestor of regression adjustment.
Standardization A reweighting technique (direct or indirect) that produces a summary measure of effect or risk under a specified distribution of confounders. Common in age- and sex-adjusted rates.
Multivariable Regression A modelling approach that estimates the association between exposure and outcome while holding other variables constant. Powerful but vulnerable to misspecification, multicollinearity, and adjusting for the wrong variables.
Model Misspecification A model that does not match the true data-generating process—wrong functional form, missing interactions, or wrong link function—leading to biased estimates and misleading inferences.
Multicollinearity High correlation among predictors that inflates standard errors and destabilizes coefficient estimates without (typically) biasing predictions.
Simpson’s Paradox A reversal of an association when data are aggregated across a confounder. The classic warning that “summing up” subgroup data without accounting for structure can flip conclusions.
Ecological Fallacy Mistakenly inferring individual-level relationships from group-level (ecological) data. The classic illustration: country-level wine consumption and heart disease.
Atomistic Fallacy The mirror error: inferring group-level patterns from purely individual-level data, ignoring contextual effects.
Modifiable Areal Unit Problem (MAUP) In spatial analysis, the same data analyzed at different geographic scales or zonations can produce different (even opposite) results.
Missing Data Mechanisms A taxonomy (Rubin) of how data come to be missing: MCAR (missing completely at random), MAR (missing at random conditional on observed variables), and MNAR (missing not at random). Each requires a different analytic strategy.
Statistical Inference
Null Hypothesis Significance Testing A framework that asks how unusual observed data would be if the null hypothesis were true. Foundational but routinely misinterpreted, especially around p-values.
P-Value The probability, assuming the null hypothesis is true, of observing data at least as extreme as those obtained. It is not the probability that the null is true, nor a measure of effect size.
Confidence Interval A range of values consistent with the data under the assumed model. A 95% CI means that, in repeated sampling, 95% of similarly constructed intervals would contain the true parameter.
Type I Error (α) A “false positive”: rejecting the null hypothesis when it is in fact true. Conventionally controlled at 5%.
Type II Error (β) A “false negative”: failing to reject the null when it is false. Statistical power equals 1 − β.
Statistical Power The probability of correctly rejecting the null when a specified alternative is true. Underpowered studies routinely miss real effects and exaggerate them when they do find them.
Statistical vs. Clinical Significance A finding can be statistically significant (p < 0.05) yet too small to matter for patients or policy—and clinically important effects can be statistically nonsignificant when samples are small.
Key People
Judea Pearl (1936–) Computer scientist whose work on DAGs and the do-calculus formalized confounding, mediation, and adjustment as causal questions.
James Robins Epidemiologist and biostatistician who developed g-methods (g-formula, marginal structural models, structural nested models) for time-varying confounding.
Miguel Hernán Epidemiologist (Harvard) whose target-trial framework operationalizes adjustment for time-varying confounding in observational drug and policy studies.
Sander Greenland Epidemiologist who has spent decades clarifying the meaning of confounding, collapsibility, and the (mis)use of statistical significance.
Donald Rubin Statistician whose causal-inference framework (potential outcomes) and missing-data taxonomy underpin much of modern observational analysis.
Kimberlé Crenshaw Legal scholar who coined “intersectionality” in 1989 to name how race, gender, and other systems of power compound—a foundational concept for population health equity.
No matching entries. Try a different search term.
Section 1 of 3

Confounding

⏱ Estimated reading time: 20 minutes

Introduction and Overview

Lessons 7–10 worked through bias one category at a time. This lesson finishes the inventory by addressing the third leg of the canonical bias triad — confounding — and then steps back to the statistical-inference issues that turn even unbiased estimates into wrong conclusions. The two content sections divide the work cleanly. Section 1 takes confounding from the textbook definition through pharmacoepidemiology's standard nightmare (confounding by indication), through the more advanced problem of time-varying confounding, into the harder theoretical question of whether constructs like race or income are confounders to be controlled or structural exposures to be measured. Section 2 turns to the analytic issues that haunt even well-adjusted models: model misspecification, multicollinearity, Type I/II errors, Simpson's paradox, the ecological fallacy, the modifiable areal unit problem, and missing data. By the end, the toolkit needed for Lesson 12's integrated appraisal will be complete.

Learning Objectives

  • State the three formal conditions a variable must satisfy to be a confounder, and distinguish confounders from mediators.
  • Use the WHI / HRT story to explain why observational and randomized estimates can diverge in opposite directions.
  • Identify confounding by indication in pharmacoepidemiologic studies and propose adjustment, restriction, or active-comparator strategies.
  • Define time-varying confounding with treatment-confounder feedback and explain why standard regression adjustment fails.
  • Articulate the conceptual debate over whether race, gender, and SES are confounders to be controlled or structural exposures to be measured.

What Is Confounding?

Confounding occurs when a third variable—the confounder—is associated with both the exposure and the outcome, distorting the observed relationship between them. Unlike mediators (which lie on the causal pathway), confounders represent alternative explanations for an association. If unaddressed, confounding can make a harmful exposure appear protective, a beneficial treatment appear harmful, or a null relationship appear significant.

Three Conditions for Confounding

A variable C is a confounder of the exposure–outcome relationship if it satisfies three conditions: (1) C is associated with the exposure, (2) C is an independent risk factor for the outcome, and (3) C is not on the causal pathway between the exposure and the outcome. If C is a mediator rather than a confounder, adjusting for it introduces bias rather than removing it. The modern formal definition is given by VanderWeele & Shpitser (2013), building on Greenland and Robins' (1986) foundational link between confounding, identifiability, and exchangeability; Hernán et al. (2002) further showed that confounder identification requires causal knowledge, not just statistical association.

Classic Case: Hormone Replacement Therapy

One of the most consequential examples of confounding in modern epidemiology involves hormone replacement therapy (HRT) and cardiovascular disease (CVD) risk in postmenopausal women. For decades, observational studies consistently suggested that HRT reduced CVD risk by 30–50%. These findings influenced clinical guidelines worldwide.

Case Study: The Women’s Health Initiative (WHI)

The WHI was a large randomized controlled trial launched in 1991. When results were published by the Writing Group for the WHI Investigators (2002), they revealed that combined estrogen–progestin therapy actually increased the risk of coronary heart disease (HR 1.29, 95% CI: 1.02–1.63), stroke, and pulmonary embolism. The discrepancy was explained by confounding by socioeconomic status and health behaviors: women who chose HRT in observational studies tended to be wealthier, better educated, leaner, more physically active, and more engaged with preventive healthcare—all factors independently associated with lower CVD risk.

R Stratified analysis: confounding revealed and removed

What you'll do: simulate a 5,000-person dataset in which HRT has zero effect on CVD, then watch how an SES confounder produces a misleading crude OR that disappears when you look within strata or adjust in regression. What to take away: stratification is the most direct demonstration of confounding — if the within-stratum estimates agree with each other but disagree with the crude estimate, you have confounding by the stratifying variable, and the within-stratum value is the unbiased one.

The classic remedy for confounding is to look within levels of the suspected confounder. The simulation below builds a true-null exposure-outcome relationship that's polluted by an SES confounder. The crude OR misleads; the stratum-specific ORs show the truth.

set.seed(230)
n <- 5000
ses <- rbinom(n, 1, 0.5)                                  # 1 = high SES
hrt <- rbinom(n, 1, prob = ifelse(ses == 1, 0.6, 0.2)) # high SES uses HRT more
# CVD risk: lower in high SES, NOT affected by HRT (true null)
cvd <- rbinom(n, 1, prob = ifelse(ses == 1, 0.05, 0.15))

# Crude (unadjusted) OR -- looks "protective"
exp(coef(glm(cvd ~ hrt, family = binomial))["hrt"])

# Within each SES stratum -- the true effect: ~1.0
tapply(seq_len(n), ses, function(i) {
  exp(coef(glm(cvd[i] ~ hrt[i], family = binomial))[2])
})

# Adjusted OR, controlling for SES
exp(coef(glm(cvd ~ hrt + ses, family = binomial))["hrt"])
Console output
hrt 0.42 # crude (confounded - looks protective) 0 1 0.99 1.05 # within-stratum (truth: ~1) hrt 1.02 # adjusted

This is the WHI lesson in miniature. The crude estimate (0.42) is exactly the kind of number that fed twenty years of observational HRT enthusiasm. Adjusting for SES recovers the true null. In HSCI 341 you will build this intuition with Mantel-Haenszel summaries; in HSCI 410 you will use multivariable regression.

R Reflect on what you just ran

Use the questions below to interpret the output you produced. Look at your console before answering.

1. The crude OR for HRT vs CVD was 0.42, but the simulation built HRT with zero effect on CVD. Walk through the three conditions of confounding and explain how SES, as set up in the simulation, satisfies each one (associated with exposure, associated with outcome, not on the causal pathway).

Model answerAll three conditions are satisfied by construction. (1) Associated with the exposure: the line hrt <- rbinom(n, 1, prob = ifelse(ses == 1, 0.6, 0.2)) hard-wires high-SES women to use HRT three times more often than low-SES women, so SES is strongly associated with HRT use. (2) Independent risk factor for the outcome: the CVD line gives high-SES women a 5% risk and low-SES women a 15% risk — SES affects CVD even when HRT is held fixed (because HRT does not enter the cvd line at all). (3) Not on the causal pathway: SES is set before HRT in the data-generating process and is never caused by HRT, so it sits as a common cause, not a mediator. All three Rothman/Greenland conditions hold, which is exactly why the crude OR is biased and the SES-adjusted OR is unbiased.

2. The stratum-specific ORs (0.99 in low SES and 1.05 in high SES) and the SES-adjusted OR (1.02) are all close to 1. Why are they nearly identical to each other? What does that tell you about whether SES is acting as a confounder vs. an effect modifier in this simulated dataset?

Model answerThey are nearly identical because confounding is the only thing producing the bias. When you condition on SES — either by stratifying (0.99 and 1.05) or by entering SES as a covariate (1.02) — you remove the spurious association and recover the true null. The fact that the two stratum-specific ORs agree with each other tells you SES is acting as a confounder, not an effect modifier: effect modification would show different ORs across strata (e.g. 0.5 in low SES, 1.8 in high SES). Here the within-stratum effect is the same (≈1) at both levels of SES, so a single summary estimate is appropriate — the classic signature of pure confounding.

3. The Women's Health Initiative trial overturned 20+ years of observational HRT findings. Using your simulation results, explain to a skeptical clinician why an RCT was needed even though the observational evidence was large and consistent.

Model answerThe simulation reproduces the exact failure mode of the pre-WHI observational literature: large samples, internally consistent findings, and a strongly ‘protective’ crude OR of 0.42 driven entirely by a confounder (SES) that no one fully measured. Observational studies cannot rule out unmeasured confounding, no matter how big they are — adjusting for the SES you can see does nothing for the SES you cannot, and there are always lifestyle, health-seeking, and access variables that travel with HRT use. Randomization breaks the link between exposure and confounders by design: in the WHI, the assignment of HRT was independent of every measured and unmeasured baseline characteristic, so any difference in CVD between arms had to be the drug. The simulation shows in miniature what the WHI showed in practice — that a 0.42 OR can come from confounding alone, and only randomization can guarantee that the comparison groups are exchangeable.
Saved.

Why This Matters

The HRT story illustrates how even large, well-conducted observational studies can produce misleading results when confounding is not adequately addressed. It was the randomized design of the WHI—which balanced measured and unmeasured confounders across groups—that revealed the true direction of effect. This case became a defining moment in evidence-based medicine.

The HRT case is the textbook example of confounding by lifestyle and SES. The next form — confounding by indication — is the most pervasive issue in observational pharmacoepidemiology, and the reason any drug-effect estimate from administrative data should be read with care.

Confounding by Indication

Confounding by indication is a specific form of confounding that arises in pharmacoepidemiological studies when the reason a treatment is prescribed (the “indication”) is itself a risk factor for the outcome. Because sicker patients are more likely to receive treatment, naive comparisons of treated vs. untreated patients systematically overestimate harm or underestimate benefit. The three flip cards below show the standard scenario, its mirror image (confounding by contraindication), and the design tools used to address both.

💊
Statin Mortality Example
Click to explore
⚠️
Confounding by Contraindication
Click to explore
🛠
Design Solutions
Click to explore

Confounding by indication is hard but at least the confounder sits at baseline. Time-varying confounders make the problem one step worse: they evolve, and their values feed back from prior treatment.

Time-Varying Confounding

Time-varying confounding occurs when a confounder changes over time and is simultaneously affected by prior treatment and predictive of future treatment and outcome. This creates a feedback loop that standard regression cannot resolve.

Case Study: HIV Treatment and CD4 Counts

In HIV care, treatment decisions depend on CD4 cell counts (a marker of immune function). A patient’s CD4 count at time t influences whether antiretroviral therapy (ART) is initiated or modified at time t+1. However, prior ART use also affects CD4 counts. This creates a feedback loop: CD4 count is both a confounder (it predicts treatment and mortality) and is affected by prior treatment.

Standard regression that adjusts for CD4 count introduces collider bias (over-adjustment), while failure to adjust leaves confounding uncontrolled. Marginal structural models (MSMs)—using inverse probability of treatment weighting (IPTW)—can break this cycle by creating a pseudo-population where treatment is unconfounded by time-varying factors.

The problem with standard adjustment: When you include a time-varying confounder (like CD4 count) in a conventional regression model, you block part of the treatment’s causal effect that operates through CD4 counts. This is because CD4 count is simultaneously a confounder and a mediator. Standard regression cannot simultaneously adjust for confounding and preserve the treatment effect that flows through the mediator.

Hernán, Brumback, and Robins (2000) demonstrated that standard Cox regression produced biased estimates of ART effectiveness, often failing to detect substantial survival benefits that were recovered by MSMs.

How MSMs work: Marginal structural models use a two-step process: (1) estimate the probability of receiving the observed treatment at each time point given past covariates (the “propensity”), then (2) weight each observation by the inverse of that probability. This creates a pseudo-population in which treatment is unconfounded by time-varying factors.

The resulting weighted analysis estimates the causal effect of a treatment strategy (e.g., “always treat when CD4 < 350”) rather than the observational association between treatment and outcome.

FeatureStandard RegressionMarginal Structural Model
Time-varying confoundingIntroduces collider bias if adjustedProperly handled via weighting
Treatment-confounder feedbackCannot resolveExplicitly modeled
Estimate typeConditional on covariatesMarginal (population-level)
Causal interpretationLimited without assumptionsCausal under exchangeability
ComplexitySimple to implementRequires careful weight estimation

Marginal structural models handle the technical problem of time-varying confounding within a single causal pathway. The next subsection addresses a deeper question that no analytic technique alone can resolve: whether some of the variables we routinely treat as confounders should be reframed as the very exposures we ought to be measuring.

Beyond “Controlling For”: Intersectionality and the Limits of Variable-by-Variable Adjustment

So far this lesson has treated confounding as a problem with a technical solution: identify the confounders, adjust for them, and proceed to the “true” effect. That logic works well for the kind of variable a randomised trial would have balanced—a clinical comorbidity, a baseline lab value, a discrete behaviour. It works less well, and sometimes badly, when the “confounder” is not really a variable at all but a structural process that produces both the exposure and the outcome over the lifecourse.

A theoretical caution

Putting race, sex, or income in a regression as a covariate is a modelling choice with theoretical content. It treats those constructs as if they were stable, individual-level attributes whose effect on the outcome can be additively isolated from the exposure of interest. Many social epidemiologists argue that this framing misrepresents what these variables actually index—chronic exposure to racism, patriarchy, and economic deprivation—and produces estimates whose policy meaning is unclear (Krieger, 2014; VanderWeele & Robinson, 2014).

Race is not a confounder; racism is an exposure

A standard Table 1 in an epidemiological paper presents race/ethnicity as a baseline characteristic to be adjusted for. But race itself does not have a biological mechanism that causes hypertension, low birthweight, or COVID-19 mortality. What does the causing is racism—experienced across a lifecourse as residential segregation, job market discrimination, biased policing, differential medical treatment, and chronic vigilance—all of which become biologically embodied (Williams, Lawrence, & Davis, 2019; Krieger, 2014).

When researchers “control for race,” they often inadvertently obscure the structural process they should be measuring. Worse, adjusting for downstream consequences of racism (income, education, neighbourhood) can constitute over-adjustment bias, partialling out the very mediators through which the structural exposure operates (VanderWeele & Robinson, 2014; Schisterman, Cole, & Platt, 2009). The variable is in the model; the explanation has been removed.

Intersectionality and the additivity assumption

Crenshaw’s (1989) concept of intersectionality began in legal theory with a simple observation: a Black woman’s experience of discrimination is not the sum of “being Black” plus “being a woman.” The intersections produce qualitatively distinct exposures that neither single-axis category, nor an additive combination of them, can capture.

Standard regression, by default, models adjustment as additive: the coefficient on race is interpreted as the effect “holding gender, class, and other variables constant.” Bauer (2014) shows why this is theoretically inadequate for studying inequality. Holding gender constant while estimating a racial effect implicitly imagines a population in which racial categorisation is detached from gendered experience—a counterfactual that does not, in any meaningful sense, exist. Quantitative researchers can address this with explicit interaction terms, stratified analyses, intersectional MAIHDA (multilevel analysis of individual heterogeneity and discriminatory accuracy; Evans et al., 2018), or descriptive analyses that report rates within intersecting groups rather than coefficients adjusted across them.

Worked example: maternal mortality

In the United States, Black women die from pregnancy-related causes at roughly three times the rate of White women (Petersen et al., 2019). A conventional analysis might fit a model of maternal mortality with race, age, education, income, insurance, and parity as covariates and report a residual race coefficient that is smaller than the crude difference. The headline often becomes: “most of the disparity is explained by socioeconomic factors.”

Interpreted through a fundamental-causes lens (Phelan, Link, & Tehranifar, 2010), this is the wrong reading. Education, income, and insurance are mechanisms through which structural racism produces the disparity—adjusting for them does not explain the disparity away, it merely reroutes it. An intersectional analysis instead asks how Black women specifically experience obstetric care, what dismissal and pain under-recognition look like at this intersection, and what changes when interventions are designed for Black mothers rather than for “women” or for “low-income patients” in general. Different theories produce different analyses; different analyses produce different policy recommendations.

When biomedical models fall short

The biomedical model is comfortable with confounders that are themselves biomedical: cholesterol confounding the diet–CHD relationship, age confounding most things. It becomes unstable when the “confounder” is the social structure itself—because that confounder operates over decades, through dozens of mediating mechanisms, with effects that change as societal conditions change. Fundamental cause theory predicts exactly this: as one mechanism is closed off (e.g., smoking is reduced), the social gradient in mortality reappears through whatever mechanism is currently relevant (e.g., obesity, opioid overdose). The pattern is robust to mechanism-by-mechanism adjustment because the cause is upstream of any specific mechanism (Link & Phelan, 1995; Phelan et al., 2010).

What this means for appraisal

When a paper reports an effect estimate “adjusted for race, income, and education,” ask: What is the underlying theory of how these variables relate to the exposure and outcome? Are they confounders to be partialled out, or mediators that carry the causal effect of structural conditions? Would an intersectional or stratified analysis tell a different story? A defensible study makes its theoretical commitments explicit and is honest about the difference between “the effect of X holding Y constant” (a statistical operation) and “what would happen if we changed X” (a causal claim that depends on what Y actually represents in the world).

Key Takeaways: Confounding

  • Confounding produces spurious associations or masks real ones; the HRT/WHI case demonstrates how observational and experimental results can diverge
  • Confounding by indication is pervasive in pharmacoepidemiology—sicker patients receiving treatment biases naive comparisons
  • Time-varying confounding requires specialized methods (e.g., MSMs) because standard regression cannot simultaneously adjust for confounding and preserve causal pathways
  • Randomization remains the most powerful tool for addressing confounding, but advanced analytic methods can approximate causal inference from observational data
  • Treating race, gender, or income as ordinary confounders to be “controlled for” embeds a theoretical claim that may obscure the structural processes—racism, patriarchy, fundamental causes—those variables actually index
  • Intersectionality and ecosocial frameworks invite analyses that go beyond additive adjustment, recognising that what gets measured and modelled shapes what knowledge is produced and which interventions become thinkable
Knowledge Check — Section 1

1. In the Women’s Health Initiative, observational studies and the RCT produced opposite conclusions about HRT and cardiovascular risk. What best explains this discrepancy?

Women who chose HRT in observational studies were systematically healthier, wealthier, and more engaged in preventive care. These confounders independently lowered CVD risk, creating the illusion that HRT was protective. The WHI randomized participants, balancing these confounders across groups and revealing the true harmful effect.

2. A study finds that patients prescribed a new analgesic have worse pain outcomes than those not prescribed it. Which bias most likely explains this finding?

Confounding by indication occurs when the reason for prescribing treatment (more severe pain) is itself associated with the outcome (worse pain outcomes). Patients receiving the analgesic had more severe baseline pain, making their outcomes appear worse regardless of drug efficacy.

3. In HIV treatment studies, why can standard regression not adequately adjust for CD4 count when estimating the effect of antiretroviral therapy on survival?

CD4 count is affected by prior treatment (making it a mediator) and also predicts future treatment and survival (making it a confounder). Adjusting for it in standard regression blocks the treatment effect that operates through CD4, introducing collider bias. Marginal structural models address this by using inverse probability weighting.
Section 2 of 3

Statistical Inference & Model Issues

⏱ Estimated reading time: 25 minutes

Introduction and Overview

Section 1 closed the bias inventory. Even after every bias has been addressed, an analysis can still produce wrong conclusions if the statistical model itself does not fit the data, or if the inferential framework is misused. This section walks through seven such issues in order: model misspecification, multicollinearity, Type I/II errors, Simpson's paradox (with a hands-on simulator), the ecological and atomistic fallacies (callbacks to Lesson 6), the modifiable areal unit problem, and missing data. Each is short on its own, but together they constitute most of the analytic mistakes that survive peer review.

Learning Objectives

  • Recognize model misspecification (e.g., linear models forced onto J-shaped relationships) and choose appropriate functional forms.
  • Diagnose multicollinearity and apply remedies such as variable selection, principal components, or single-pollutant analyses.
  • Distinguish Type I and Type II errors and explain the “winner’s curse” that inflates effect sizes in low-powered studies.
  • Identify Simpson’s paradox, the ecological fallacy, the atomistic fallacy, and the modifiable areal unit problem in published work.
  • Classify missing data as MCAR, MAR, or MNAR and select an appropriate handling strategy (complete-case, multiple imputation, sensitivity analysis).

Model Misspecification

A statistical model is misspecified when the assumed functional form does not reflect the true relationship between variables. One of the most common errors is assuming a linear relationship when the true relationship is nonlinear.

Case Study: Alcohol and All-Cause Mortality

The relationship between alcohol consumption and mortality is often described as J-shaped: light-to-moderate drinkers appear to have lower mortality than both abstainers and heavy drinkers. If a researcher incorrectly fits a linear model to this data, they might conclude either that alcohol is uniformly harmful (positive slope) or uniformly protective (negative slope), depending on the distribution of consumption in their sample.

More recent analyses (e.g., Stockwell et al., 2016) have shown that the apparent protective effect of moderate drinking largely disappears when studies correct for “sick quitter” bias (former drinkers misclassified as abstainers) and use appropriate nonlinear models. Model specification is not just a statistical nicety—it can reverse a study’s conclusions.

Multicollinearity

Multicollinearity occurs when two or more predictor variables in a regression model are highly correlated. This does not bias coefficient estimates but dramatically inflates their standard errors, making individual effects unstable and difficult to interpret.

Example: Environmental Pollutant Studies

In studies of air quality and respiratory disease, researchers often include multiple pollutants (PM2.5, ozone, NO2, SO2) simultaneously. Because these pollutants share common sources (traffic, industry), they are often highly correlated. A regression model including all of them may produce coefficients that flip sign, lose significance, or vary wildly between samples—even though the pollutants truly affect health. Solutions include principal component analysis, variable selection, or analyzing one pollutant at a time with mutual adjustment in sensitivity analyses.

Type I and Type II Errors

Type I error (false positive) is the probability of concluding an effect exists when it does not. Type II error (false negative) is the probability of failing to detect a real effect. These errors trade off: making it harder to achieve significance (lower alpha) reduces Type I error but increases Type II error, especially in small samples.

Power and the Winner’s Curse in Rare Disease Studies

Rare disease studies are particularly vulnerable because small sample sizes mean low statistical power. A study of a disease affecting 1 in 100,000 people may have only 50 cases, yielding power of 20–30% for moderate effect sizes—meaning 70–80% of true treatment effects will go undetected. Paradoxically, the studies that do find significant results in low-power settings tend to overestimate effect sizes—a phenomenon called the “winner’s curse.” (Multiple testing, p-hacking, and the broader integrity issues these problems create are addressed in Lesson 1.)

Simpson’s Paradox

Simpson’s paradox occurs when a trend that appears in aggregated data reverses when the data are stratified by a confounding variable. The paradox highlights the danger of drawing causal conclusions from aggregate statistics without considering underlying group structure.

Case Study: Treatment Effectiveness by Disease Severity

Consider a new treatment tested at two hospitals. Hospital A treats mostly mild cases; Hospital B treats mostly severe cases. Overall, Treatment X appears to have a lower success rate than the standard. But when stratified by severity:

The same patients, two stories Better patient outcome → More Treatment X received → Pooled (aggregate) trend ↓ Mild cases (Hospital A) ↑ Severe cases (Hospital B) ↑ Within mild cases: treatment helps Within severe cases: treatment helps Pooled trend: treatment appears worse
Figure. Each point is a patient. Within every severity group the trend line rises — more Treatment X goes with better recovery — yet the dashed pooled line falls. The reversal happens because Treatment X was given disproportionately to severe cases (Hospital B), who recover worse no matter what. This is the picture behind the table below: stratum-specific success rates favour Treatment X, but the aggregate hides the confounding by severity. It is also the “Simpson’s paradox (sign flip)” preset in the interactive simulator further down — each stratum-specific risk ratio points one way while the crude RR points the other.
GroupTreatment X SuccessStandard SuccessInterpretation
Overall60/100 (60%)70/100 (70%)Standard appears better
Mild cases50/60 (83%)55/70 (79%)Treatment X is better
Severe cases10/40 (25%)15/30 (50%) → actually 15/30 (50%)Treatment X appears worse, but receives more severe cases

The reversal occurs because Treatment X was disproportionately assigned to severe cases (which have worse outcomes regardless of treatment). The aggregated data hides this confounding by severity, producing a misleading conclusion. The correct causal interpretation requires stratification.

Hands-on: Confounding & Simpson's Paradox

What you'll do: the simulator below builds a two-stratum population and lets you set how strongly the confounder drives both treatment assignment and outcome risk. What to take away: within-stratum risk ratios and the crude (unadjusted) risk ratio can disagree dramatically with each other and, with the right combination of strengths, can have opposite signs — the formal definition of Simpson's paradox. The simulator complements the R-box from Section 1: there you stratified to recover a true null; here you can build sign-reversal by hand.

🎲 Interactive: Confounding & Simpson’s Paradox

A two-strata population (e.g., mild vs. severe cases). Set how strongly the confounder C drives treatment assignment and how strongly C drives outcome. Watch the crude RR drift away from the stratum-specific RR. Push both sliders hard and you can flip the sign — Simpson’s paradox in action.

Stratum-specific vs. crude RR

Green bars = within-stratum RRs. Red bar = naive crude RR ignoring the confounder.

Risk by treatment within each stratum
RR within C+
RR within C−
Crude RR (unadjusted)
Presets:
Try the Simpson’s paradox preset: each stratum shows a treatment benefit (RR < 1), but the crude RR shows treatment harm (RR > 1). The confounder steered treatment toward the high-risk stratum and the aggregate hides it.

Simpson's paradox is what happens when stratification reveals confounding at the individual level. The complementary problem is what happens when we move between levels of analysis — from groups to individuals or vice versa. Lesson 6 introduced this material in detail; the next two subsections recall the essentials.

Ecological Fallacy and Atomistic Fallacy

🌎
Ecological Fallacy
Click to explore
👤
Atomistic Fallacy
Click to explore

Modifiable Areal Unit Problem (MAUP)

The modifiable areal unit problem is a form of the ecological fallacy specific to spatial analysis. When individual-level data are aggregated into geographic units (census tracts, counties, provinces), the choice of unit size and boundary definitions can alter statistical results—sometimes dramatically.

Example: Disease Clustering

A study examining cancer incidence near an industrial facility might find a significant cluster when data are aggregated at the postal code level but not at the health region level. Conversely, aggregating at a smaller level might produce unstable estimates due to small case counts. Neither result is “wrong”—both are artifacts of the chosen boundaries. The MAUP means that spatial epidemiological conclusions depend partly on arbitrary geographic decisions rather than solely on underlying disease patterns.

The ecological-fallacy and MAUP issues arise from how the data were assembled. The last analytic problem in this section is what to do when those data have holes in them — a near-universal problem whose handling can either preserve or wreck a study's conclusions.

Missing Data

Missing data are ubiquitous in epidemiological research. The validity of analysis depends critically on the mechanism underlying missingness; the modern taxonomy of MCAR / MAR / MNAR was set out by Rubin (1976):

Missing Completely At Random (MCAR)

Data are MCAR when the probability of being missing is unrelated to both observed and unobserved data. Example: a lab sample is accidentally dropped. Complete case analysis is unbiased under MCAR but loses power.

Missing At Random (MAR)

Data are MAR when missingness depends on observed variables but not on the missing values themselves, after conditioning on observed data. Example: younger participants are more likely to skip a depression questionnaire, but among people of the same age, missingness is unrelated to depression severity. Multiple imputation and maximum likelihood methods produce valid estimates under MAR.

Missing Not At Random (MNAR)

Data are MNAR when missingness depends on the unobserved values themselves. Example: people with severe depression are less likely to complete follow-up questionnaires because of their depression. No standard analytic method can fully correct MNAR; sensitivity analyses with different assumptions about the missing data mechanism are essential.

Complete Case Analysis Under Non-MCAR

When data are MAR or MNAR, restricting analysis to complete cases introduces selection bias. The remaining sample is no longer representative of the study population. For example, if sicker patients are more likely to drop out of a clinical trial, complete case analysis overestimates treatment effectiveness. Multiple imputation, which generates plausible values for missing data based on observed relationships, is preferred for MAR data.

Reflection

Think of a health study you have encountered (in this course or elsewhere). Identify one potential statistical or analytic issue discussed in this section (model misspecification, multicollinearity, low power, Simpson’s paradox, ecological fallacy, MAUP, or missing data) that could threaten its conclusions. Explain why the issue applies and how it might have been addressed.

Model answerA strong response names a specific study (e.g., the Lancet alcohol-and-all-cause-mortality meta-analysis, the original Honolulu Heart Program coffee-CHD finding, a JUPITER-style trial sub-analysis, or a regional COVID-mortality comparison), picks one of the seven issues, and applies it concretely. Good examples: (a) a J-shaped alcohol-mortality curve forced through a linear model — misspecification, addressable with splines or category dummies; (b) state-level smoking-vs-lung-cancer correlations interpreted at the individual level — ecological fallacy, addressable with multilevel modelling on individual data; (c) a 40-person rare-disease trial reporting a null result — low power / Type II error, addressable with prospective sample-size calculation, multicentre pooling, or Bayesian designs that report posterior probabilities rather than reject/fail-to-reject; (d) census-tract obesity rates that change when tracts are aggregated to counties — the MAUP, addressable by reporting sensitivity to spatial unit; (e) a trial with 18% loss-to-follow-up handled by complete-case analysis where dropouts were sicker — non-MCAR missingness, addressable with multiple imputation or inverse-probability-of-censoring weights. Weak responses name a generic study and an issue without showing they apply.
Reflection saved.
Knowledge Check — Section 2

1. A researcher models the relationship between alcohol consumption and mortality using a linear regression and concludes that alcohol is uniformly protective. What error has likely occurred?

The relationship between alcohol and mortality is nonlinear (J-shaped). A linear model cannot capture the increased mortality at both extremes of consumption. Forcing a linear fit may show only the downward slope, producing a misleading conclusion of uniform protection.

2. A treatment appears worse than the standard in overall data, but better when results are stratified by disease severity. This is an example of:

Simpson’s paradox occurs when a trend present in aggregated data reverses upon stratification by a confounding variable (here, disease severity). The treatment was disproportionately given to severe cases, making it appear worse overall despite being superior within each severity group.

3. In a clinical trial for a rare disease with only 40 participants, which statement is most accurate regarding statistical error?

Small sample sizes produce low statistical power (high probability of Type II error). With only 40 participants, the study may fail to detect genuine treatment effects. The Type I error rate is set by the alpha level and is not directly increased by small sample size, though significant findings from underpowered studies are more likely to be overestimates (the winner’s curse).
Section 3 of 3

Final Assessment

⏱ Estimated time: 20 minutes

Bringing It All Together

This lesson completed the bias inventory of HSCI 230 by closing out confounding (Section 1) and the analytic issues that survive even unbiased data (Section 2). The HRT / WHI story made vivid how confounding can reverse the direction of a clinical recommendation, while confounding by indication and time-varying confounding showed why specialised tools — restriction, active comparators, marginal structural models — are part of the modern epidemiologist's repertoire. The lesson also pushed past the conventional “variable-by-variable adjustment” framing to ask harder theoretical questions about whether race, gender, and SES belong in models as confounders to be controlled or as structural exposures to be measured.

Section 2 then walked through the analytic problems that remain even after confounding has been handled: model misspecification, multicollinearity, Type I/II error and the winner's curse, Simpson's paradox, the ecological and atomistic fallacies, the modifiable areal unit problem, and missing data. These issues are the most common reasons a result fails to replicate, and recognising them is the last skill the course owes you before the integrated appraisal of Lesson 12. The reflection below asks you to apply the full bias inventory to a study of your choice; the final assessment then tests the conceptual material before Lesson 12 integrates the entire course into a single critical-appraisal exercise.

Key Takeaways from Lesson 11

  • Confounding requires a third variable to be associated with the exposure, an independent risk factor for the outcome, and not a mediator on the causal pathway.
  • The HRT / WHI reversal showed that even highly consistent observational findings can be erased by randomization, which balances unmeasured confounders.
  • Confounding by indication is the standing threat to observational drug studies; restriction, active-comparator designs, and propensity methods address it but cannot fully replace randomisation.
  • Time-varying confounding with treatment-confounder feedback breaks standard regression; marginal structural models with IPTW are the canonical fix.
  • Treating race, gender, and SES only as nuisance confounders can hide the structural exposures that actually drive disparities.
  • Model misspecification, multicollinearity, Simpson's paradox, ecological and atomistic fallacies, the MAUP, and missing-data mechanisms distort inference even when bias has been controlled.
  • Statistical inference is only as good as the question, the design, and the model behind it — the appraisal skill from Lesson 12 starts here.
R Activity — Crude vs stratified vs adjusted: an HRT/SES confounding demo

The companion R script r-activities/HSCI_230_Lesson_11_Confounding_and_Statistical_Inference.R simulates a 5,000-person cohort in which SES drives both HRT use and CVD risk, but HRT itself does nothing. You will see the crude OR look strongly “protective,” the stratum-specific ORs sit near 1.0 (the truth), and the SES-adjusted OR recover the null — a hands-on reproduction of the HRT/WHI reversal that anchored Section 1.

set.seed(230)
n   <- 5000
ses <- rbinom(n, 1, 0.5)                                   # 1 = high SES
hrt <- rbinom(n, 1, prob = ifelse(ses == 1, 0.6, 0.2))     # high SES uses HRT more

# CVD risk: lower in high SES, NOT affected by HRT (true null)
cvd <- rbinom(n, 1, prob = ifelse(ses == 1, 0.05, 0.15))

# Crude (unadjusted) OR -- looks "protective"
exp(coef(glm(cvd ~ hrt, family = binomial))["hrt"])

# Within each SES stratum -- the true effect: ~1.0
tapply(seq_len(n), ses, function(i) {
  exp(coef(glm(cvd[i] ~ hrt[i], family = binomial))[2])
})

# Adjusted OR, controlling for SES
exp(coef(glm(cvd ~ hrt + ses, family = binomial))["hrt"])

Final Reflection

Of the confounding-type and statistical-inference threats covered in this lesson, which do you believe poses the greatest practical risk to a typical observational study you might encounter in the public-health literature? Explain your reasoning, drawing on at least two specific examples from the lesson.

Model answerThere is no single right answer; the strongest responses argue for a specific threat using the lesson's mechanisms, not generalities. A defensible case for confounding by indication as the highest practical risk: it is structurally guaranteed in routine clinical data (sicker patients get treated more), it cannot be solved by larger samples, and the standard fix — multivariable adjustment — can make the bias worse if the indication is mis-measured. A defensible case for missing data under MNAR: it cannot be diagnosed from the observed data, multiple imputation does not fix it, and it silently inflates apparent treatment effectiveness in any study with dropout correlated to outcome (the WHI example here, but also virtually all pragmatic trials). A defensible case for Simpson's paradox: it routinely reverses the direction of effect estimates in stratified analyses (kidney stones, UC Berkeley admissions), is invisible without the right DAG, and is the single most embarrassing failure mode in policy-facing research. Whatever threat is chosen, the response should name two lesson examples and explain why the threat is hard to remove with standard methods.
Reflection saved.
Final Assessment — Lesson 11 (10 Questions)

1. A variable is a confounder if it is associated with the exposure, is an independent risk factor for the outcome, and:

A confounder must not lie on the causal pathway (that would make it a mediator). Confounders represent common causes or alternative pathways, not intermediate steps in the causal chain.

2. The WHI trial showed that HRT increased cardiovascular risk, whereas prior observational studies suggested a benefit. The key difference was:

Randomization in the WHI balanced measured and unmeasured confounders across treatment groups. In observational studies, women who chose HRT were healthier and wealthier, creating a spurious protective association due to uncontrolled confounding.

3. Confounding by indication is best described as:

Confounding by indication occurs when the clinical indication for treatment (e.g., disease severity) is itself associated with the outcome. Sicker patients receive treatment and have worse outcomes regardless, creating spurious associations between treatment and harm.

4. Marginal structural models address time-varying confounding by:

MSMs use IPTW to weight observations by the inverse of the probability of receiving observed treatment, creating a pseudo-population in which treatment assignment is independent of time-varying confounders, enabling causal effect estimation.

5. A study models alcohol consumption as a linear predictor of mortality and concludes alcohol is uniformly protective. The most likely issue is:

The alcohol–mortality relationship is J-shaped. Forcing a linear model onto this relationship captures only part of the curve, leading to incorrect conclusions. Including polynomial or spline terms would better capture the nonlinear pattern.

6. In environmental health research, including PM2.5, ozone, NO2, and SO2 simultaneously in a regression model may produce unstable coefficients because of:

These pollutants share common sources (traffic, industry) and are highly correlated. Including all of them inflates standard errors and produces unstable, potentially misleading coefficients—hallmarks of multicollinearity.

7. States with higher income inequality have higher mortality rates. Concluding that individual inequality exposure harms individual health illustrates:

The ecological fallacy occurs when group-level relationships are assumed to hold for individuals. State-level correlations between inequality and mortality may reflect contextual factors (public service underfunding) rather than individual-level exposure to inequality.

8. Concluding that raising every individual’s income will proportionally improve population health, based on individual-level data showing income predicts health, illustrates:

The atomistic fallacy occurs when individual-level associations are assumed to hold at the group level. Population health depends on contextual factors (income distribution, social cohesion, infrastructure) not captured by individual income, so scaling up individual-level findings ignores these group-level effects.

9. A cancer cluster study finds significant clustering at the postal code level but not at the health region level. This reflects:

The MAUP means that spatial analysis results depend on the size and boundary definitions of geographic units. Changing from postal codes to health regions alters both the exposure estimates and the statistical stability of results, demonstrating how arbitrary spatial choices influence conclusions.

10. Sicker patients drop out of a clinical trial more frequently. Analyzing only those who complete the study most likely produces:

When sicker patients are more likely to drop out (data are MAR or MNAR, not MCAR), the remaining completers are healthier than the original study population. Analyzing only completers overestimates treatment effectiveness because the sickest patients—who may have responded poorly—are excluded.

Lesson 11 Complete!

You have successfully completed Confounding & Statistical Inference. Your responses have been downloaded.

Lesson 12 — Integrated Appraisal of Epidemiological Research — is the capstone of HSCI 230. It pulls everything from Lessons 1–11 together: the foundational framing of Lesson 1, the systematic-reviews scaffolding of Lesson 2, the four observational designs of Lessons 3–6, the measurement and causal-specification work of Lesson 7, and the full bias inventory of Lessons 8–11. The lesson uses standardised reporting checklists and worked examples of full appraisals so that, by the end, you can read any epidemiological paper systematically rather than impressionistically.