Sampling, Selection Processes &
External Validity

Evaluating Epidemiological Research

Learning objectives for this lesson:

Explain Berkson’s bias and identify when hospital-based sampling distorts exposure-outcome associations
Describe the healthy worker effect and interpret standardized mortality ratios in occupational epidemiology
Distinguish attrition bias from nonresponse bias and evaluate strategies for mitigating each
Recognize prevalence-incidence (Neyman) bias and explain how cross-sectional designs miss rapidly fatal cases
Identify survivorship bias in cohort studies and its implications for causal inference
Evaluate transportability of study findings to target populations with differing characteristics
Critically assess whether epidemiological studies have adequately addressed selection-related threats to validity

This course was developed by Dr. Kiffer G. Card, Faculty of Health Sciences, Simon Fraser University.

Reference

Glossary: Key Terms, People & Concepts

📚 Reference page, available throughout the lesson

This glossary collects the key concepts, people, and ideas you will meet in this lesson. Use it as a reference while you work through the material, or as a review before assessments. Type in the search box to filter entries.

Sampling & Population Concepts

Target Population The full population to which a study’s conclusions are intended to apply (e.g., “Canadian adults aged 18–64”). Inferences should be argued explicitly back to this group.

Study Population The subset of the target population from which the sample is actually drawn, limited by who is reachable through the sampling frame.

Sampling Frame The list or operational definition used to enumerate the study population (e.g., a voter registry, hospital records, a phone-number bank). Coverage gaps in the frame are a primary source of selection bias.

Probability Sampling Sampling in which every member of the frame has a known, non-zero probability of selection. Enables unbiased inference and valid standard errors.

Non-Probability Sampling Sampling without known selection probabilities (convenience, snowball, opt-in panels). Cheap and fast, but inferences require strong assumptions about who is and is not represented.

Simple Random Sampling Each unit in the frame has the same probability of selection, drawn independently. The benchmark against which other sampling designs are evaluated.

Stratified Sampling The frame is split into subgroups (strata) and a random sample is drawn within each. Improves precision and can guarantee representation of small groups.

Cluster Sampling Naturally occurring groups (schools, neighbourhoods, clinics) are sampled, then individuals within selected clusters are studied. Cheaper to field but reduces precision because units within clusters are correlated.

Generalizability The extent to which findings from a study sample apply to a broader population. Requires both internal validity and an argument that the sample resembles the target on relevant effect modifiers.

Internal Validity Whether the estimated association reflects the true effect within the studied sample (free of confounding, selection, and information bias). A precondition for any external claim.

External Validity Whether findings from the study sample transfer to other populations, settings, or time periods. Distinct from internal validity and harder to establish.

Transportability Formal generalization of study results to a new target population, accounting for differences in effect modifiers between samples and targets.

Selection Biases

Selection Bias A distortion of the exposure–outcome association caused by who ends up in (or stays in) the study. Arises whenever inclusion or retention depends on both exposure and outcome.

Berkson’s Bias A selection bias that arises in hospital-based studies because admission probabilities differ across exposure–outcome combinations. Can fabricate or reverse associations seen in the source population.

Healthy Worker Effect Workers tend to be healthier than the general population because illness selects people out of work. Comparing occupational cohorts to the general population biases occupational mortality risks toward the null.

Attrition Bias Bias in longitudinal studies caused by differential loss to follow-up that depends on both exposure and outcome (e.g., the sickest exposed participants drop out fastest).

Nonresponse Bias Bias arising in cross-sectional surveys when those who respond differ systematically from those who do not, on variables related to the outcome.

Response Bias A broad term for distortions introduced by how respondents engage with surveys, such as agreeing, providing socially desirable answers, or skipping items in patterned ways.

Prevalence–Incidence (Neyman) Bias In cross-sectional or prevalent-case studies, rapidly fatal or rapidly resolving cases are missed, so prevalent cases are unrepresentative of all incident cases. Distorts exposure–outcome estimates.

Survivorship Bias A bias arising when only survivors are observed, e.g., a cohort study of older adults that misses those already dead from the exposure’s effects.

Self-Selection Bias A form of selection bias where participants opt themselves in or out of a study in ways correlated with both exposure and outcome (e.g., motivated participants in a smoking-cessation trial).

Standardized Mortality Ratio (SMR) The ratio of observed deaths in a study population to the deaths expected based on age- and sex-specific rates from a reference population. A staple of occupational epidemiology, but vulnerable to the healthy worker effect.

Key People

Joseph Berkson (1899–1982) Statistician at the Mayo Clinic whose 1946 note showed that hospital-based case-control studies could fabricate associations, the bias now bearing his name.

Jerzy Neyman (1894–1981) Statistician who articulated the prevalence–incidence bias and made foundational contributions to sampling theory and confidence intervals.

Miguel Hernán Epidemiologist whose work has reframed selection bias and external validity using causal diagrams and target-trial logic.

Elias Bareinboim Computer scientist who, with Judea Pearl, developed the formal theory of transportability, namely when and how causal effects estimated in one population can be carried to another.

No matching entries. Try a different search term.

Section 1 of 4

Selection Bias Mechanisms

⏱ Estimated reading time: 20 minutes

Section 1 of 4

Selection Bias Mechanisms

How enrollment itself can distort exposure–outcome associations before data collection even begins.

The core insight

Where selection bias lives

Bigger samples do not resolve selection bias. The question is always whether missingness depends on both exposure and outcome at once.

Definition

What selection bias is

The observed association diverges from the true association because the selection process, not any third variable, controls who enters the data.

Confounding: a third variable distorts the association.

Selection bias: the selection mechanism itself distorts the association.

Key consequence

Because it is structural, it cannot be fixed by adding more participants drawn by the same biased process.

Berkson, 1946

Berkson’s bias

In a hospital sample, anyone with conditions A and B is more likely to be there than someone with A alone or B alone. Even if A and B are independent in the population, they appear positively associated in the hospital.

Classic case: diabetes and cholecystitis looked associated in hospital studies but not in population studies.

McMichael, 1976

The healthy worker effect

Workers are healthier than the general population by definition, because employment itself pre-selects for health.

Standardized Mortality Ratio

\[ \color{#0B7B6B}{\text{SMR}} = \frac{\color{#C2410C}{\text{Observed deaths}}}{\color{#6D28D9}{\text{Expected deaths}}} \]

SMR standardized mortality ratioObserved deaths in the cohortExpected deaths at general-population rates

An SMR < 1.0 for all-cause mortality in an occupational cohort does not mean the workplace is safe. It reflects selection, not protection.

Reading the data

SMRs in asbestos workers

All causes

SMR ≈ 0.85

Healthy worker effect masks overall risk

Cardiovascular

SMR ≈ 0.78

Strong selection effect for non-occupational causes

Lung cancer

SMR ≈ 1.45

Elevated despite selection; true risk higher still

Mesothelioma

SMR ≈ 8.20

Exposure-specific signal overcomes the filter

Values are illustrative of patterns from occupational asbestos studies.

Carry forward

Design remedies and what comes next

Berkson’s bias: use population-based controls, not hospital controls.
Healthy worker effect: internal comparisons, cause-specific outcomes, lagged analyses.
Both biases arise at enrollment, and a later section asks what happens to a sound recruitment after participants start leaving.

Introduction and Overview

An earlier lesson worked on what we measure and how we model causation. This lesson turns to the third great source of bias in observational research: who ends up in the study at all. Even with perfect measurement and a correctly specified DAG, a study can produce a wrong answer if its sample systematically differs from the target population, either because of how participants were chosen at the outset, because of who dropped out during follow-up, or because the people available to study are themselves a survival-filtered subgroup. Across three content sections we work through the canonical mechanisms in order: this section covers selection biases that arise at enrollment (Berkson's bias and the healthy worker effect); a later section covers attrition during follow-up and nonresponse at data collection; a later section covers prevalence–incidence (Neyman) bias, survivorship bias, and the related question of transportability, whether a study's findings hold in a different population. By the end of the lesson, you should have a working vocabulary for “the sample is wrong” that complements the “the variables are wrong” vocabulary from an earlier lesson, ready to be combined with the information-bias inventory of a later lesson. The unifying structural account, selection bias as conditioning on a common effect of exposure and outcome, comes from Hernán, Hernández-Díaz, & Robins (2004).

Learning Objectives

Define selection bias as a structural problem in how data are generated, distinct from confounding.
Explain the mechanism of Berkson’s bias and recognize when hospital-based case-control designs produce spurious associations.
Describe the healthy worker effect and interpret SMRs in occupational cohort studies.
Identify design strategies (internal comparisons, lagged analyses, cause-specific outcomes) that mitigate enrollment-stage selection bias.

What Is Selection Bias?

Selection bias occurs when the relationship between exposure and outcome differs between study participants and the target population because of the process by which individuals were selected into (or remained in) the study. Unlike confounding, which involves a third variable, selection bias distorts the exposure-outcome relationship through the very mechanics of who ends up being studied.

Key Concept: Selection Bias

Selection bias arises when the association observed in the study sample systematically differs from the association in the target population, due to the procedures used to select participants or factors that influence study participation. It is a structural problem in how data are generated, not a statistical problem that can be fixed by larger sample sizes (Greenland, 2003).

In this section, we examine two classic mechanisms of selection bias: Berkson’s bias and the healthy worker effect. Both illustrate how the process of selecting participants into a study can create spurious associations or mask real ones.

Berkson’s Bias

Case Study: Diabetes and Cholecystitis

In the 1940s, hospital-based case-control studies observed a strong association between diabetes mellitus and cholecystitis (gallbladder inflammation). Researchers hypothesized a biological mechanism linking the two conditions. However, when population-based studies were later conducted, the association largely disappeared. What went wrong?

The answer lies in Berkson’s bias, first described by Berkson (1946). This form of selection bias occurs specifically in hospital-based case-control studies. The core insight is that people with two conditions are more likely to be hospitalized than people with only one condition. When you draw both cases and controls from a hospital population, you artificially inflate the co-occurrence of conditions.

The Mechanism of Berkson’s Bias

Consider two conditions, A and B, that are independent in the general population. A person with condition A has some probability of hospitalization (p_A), and a person with condition B has probability p_B. A person with both conditions has a hospitalization probability of approximately p_A + p_B − p_A × p_B, which is always greater than either alone. This differential hospitalization creates a spurious positive association between the two conditions within the hospital sample, even when none exists in the population.

Click each card to explore key aspects of Berkson’s bias.

Hospital ControlsClick to explore

Population Re-AnalysisClick to explore

When It Does Not ApplyClick to explore

Berkson's bias is the textbook hospital-sampling problem. The next mechanism is its occupational cousin: selection that happens not at the moment of recruitment but at the moment people enter (or stay in) the labour force.

The Healthy Worker Effect

Case Study: Asbestos Exposure and Mortality

Occupational epidemiologists studying asbestos-exposed workers in the mid-20th century found a puzzling result: the overall mortality rate among asbestos workers was lower than the general population, despite known hazardous exposure. The standardized mortality ratio (SMR) for all-cause mortality was consistently below 1.0 in many studies. How could workers exposed to a known carcinogen have lower mortality?

This paradox is explained by the healthy worker effect (McMichael, 1976), a form of selection bias inherent to occupational cohort studies. Workers are a selected subgroup of the population: they must be healthy enough to obtain and maintain employment. People who are chronically ill, disabled, or otherwise frail are less likely to enter the workforce.

How it works: The general population includes people who are too ill to work, institutionalized, or otherwise selected out of the labor force. When we compare workers to this general population, we are comparing a “healthier” group to a more heterogeneous one. The result: overall mortality appears lower among workers, masking the true hazard of occupational exposures.

Importantly, the healthy worker effect is strongest for causes of death unrelated to the occupational exposure (such as cardiovascular disease) and weakens or reverses for exposure-specific outcomes (such as mesothelioma in asbestos workers).

The standardized mortality ratio (SMR) compares observed deaths in a worker cohort to expected deaths based on general population rates. As a reading guide, an SMR of 0.85 means the cohort had about 15% fewer deaths than those rates predict, while 1.45 means about 45% more. An SMR < 1.0 does not mean the workplace is safe; it reflects the selection of healthier individuals into the workforce.

Cause of Death	SMR Among Asbestos Workers	Interpretation
All causes	0.85	Healthy worker effect masks overall risk
Cardiovascular disease	0.78	Strong healthy worker effect for non-occupational causes
Lung cancer	1.45	Elevated despite healthy worker effect; true risk is likely higher
Mesothelioma	8.20	Very strong exposure-specific signal overcomes selection bias

Note: These values are illustrative based on patterns observed in occupational asbestos studies.

Researchers have developed several strategies to mitigate the healthy worker effect:

Internal comparisons: Compare exposed workers to unexposed workers within the same workforce, rather than to the general population
Healthy worker survivor effect adjustment: Account for the fact that workers who remain employed over time are progressively healthier (those who become ill leave the workforce)
Cause-specific analyses: Focus on outcomes with known biological links to the exposure, where the effect of occupational hazard exceeds the healthy worker selection
Lagged analyses: Introduce exposure lag periods to account for latency between exposure and disease onset

Hands-on: Selection & Recruitment Bias Simulator

What you'll do: the simulator below holds a true population fixed and lets you set the participation probability for each combination of exposure and outcome. The presets reproduce the classic mechanisms: Berkson's bias, healthy worker effect, volunteer self-selection, and loss to follow-up. What to take away: the same true population can produce a 2×2 table whose odds ratio is double, half, or even reversed compared with the truth, depending on who chooses to enroll. You are not adjusting any analysis; you are watching the bias appear in the data themselves. Try the “Berkson's bias” preset first to reproduce the diabetes–cholecystitis story you just read.

🎯 Interactive: Selection & Recruitment Bias Simulator

A true population of 2,000 people with a known exposure (E) and outcome (Y). You set the participation probability for each subgroup. When participation depends on both E and Y, the observed exposure–outcome association drifts away from the truth. That is selection bias.

Population (true) vs. Observed sample

Each tile = a person. Bright = enrolled in study; faded = excluded. Color encodes E/Y combination.

2×2 tables & effect estimates

True (population)

	Y+	Y−
E+	–	–
E−	–	–

Observed (sample)

	Y+	Y−
E+	–	–
E−	–	–

True OR

–

Observed OR

–

Bias factor

–

P(enroll | E+, Y+) 0.80

P(enroll | E+, Y−) 0.80

P(enroll | E−, Y+) 0.80

P(enroll | E−, Y−) 0.80

True OR (population) 2.00

True exposure prevalence 0.30

Presets:

Try the Berkson’s preset: cases (Y+) and exposed (E+) are each more likely to be hospitalized; both happening together inflates participation in the E+/Y+ cell. The observed OR departs from the true OR even when the true association is null.

R Simulate the healthy-worker effect with a convenience sample

What you'll do: build a 10,000-person "population" with a known true mean BMI of 27. Draw a simple random sample of 200 people and compare it to a convenience sample of 200 gym-goers, where the probability of being a gym-goer is higher for people with lower BMI. Then replicate each sampling strategy 1,000 times and visualise the sampling distributions.

What to take away: the convenience sample produces a biased mean that does not improve with more replicates; selection bias is not a sample-size problem.

set.seed(230)

# 10,000-person "population" with true mean BMI = 27
N        <- 10000
bmi      <- rnorm(N, mean = 27, sd = 5)
gym_goer <- rbinom(N, 1, prob = plogis(2 - 0.15*bmi))   # lower BMI -> more likely

mean(bmi)                              # truth: ~27

# (1) Simple random sample of 200
srs <- sample(bmi, size = 200)
mean(srs)

# (2) Convenience sample: 200 gym-goers
conv <- sample(bmi[gym_goer == 1], size = 200)
mean(conv)

# Stretch: 1000 replicates of each strategy
srs_means  <- replicate(1000, mean(sample(bmi, 200)))
conv_means <- replicate(1000, mean(sample(bmi[gym_goer == 1], 200)))

par(mfrow = c(1, 2))
hist(srs_means,  main = "Simple Random Sample", xlim = c(20, 30), xlab = "Mean BMI")
abline(v = 27, col = "red", lwd = 2)
hist(conv_means, main = "Convenience (gym)",    xlim = c(20, 30), xlab = "Mean BMI")
abline(v = 27, col = "red", lwd = 2)

Console output (approx.)

[1] 26.99 # true population mean [1] 26.93 # SRS mean -- centred near truth [1] 24.41 # convenience (gym) mean -- biased low # srs_means histogram: centred near 27 (truth) # conv_means histogram: centred near 24.4, well below the red line at 27

Reading the two histograms. The SRS sampling distribution is centred at the red line (truth). The convenience-sample distribution sits about 2.5 BMI units to the left of the red line, no matter how many replicates you run. This is the healthy-worker effect in miniature: the workers (gym-goers) are systematically healthier than the general population.

R Reflect on what you just ran

Use the questions below to interpret the output you produced. Look at your console and histograms before answering.

1. The true population mean BMI was ~27. What was mean(srs) and what was mean(conv)? By how many BMI units did the convenience sample miss the true mean, and in which direction?

Model answerWith the seeded simulation mean(srs) sits very close to 27 (the population truth), usually 26.8–27.2 depending on sampling jitter, while mean(conv) drops several BMI units, typically ~24–25. So the convenience sample misses the truth by 2–3 BMI units in the downward direction. The point of the simulation is that the SRS is centred on the right answer and varies only by sampling error, while the convenience sample is centred on the wrong answer by construction.

2. The line gym_goer <- rbinom(N, 1, prob = plogis(2 - 0.15*bmi)) sets the probability of being a gym-goer. Trace through what happens for a person with BMI = 20 vs BMI = 35. How does this code mechanically generate the bias you saw in mean(conv)?

Model answerThe logistic plogis(2 - 0.15*bmi) evaluates to P(gym-goer | BMI=20) = plogis(2 - 3) = plogis(-1) ≈ 0.27; note plogis(2 - 0.15*20) = plogis(2 - 3) = plogis(-1) ≈ 0.27. For BMI=35: plogis(2 - 5.25) = plogis(-3.25) ≈ 0.037. So a lean person (BMI 20) has ~27% chance of being a gym-goer (sampled by the convenience design), while an obese person (BMI 35) has only ~4%. When you sample preferentially from the gym, you systematically over-represent low-BMI individuals, so the population mean of the sampled subset is pulled below the true mean. The bias is mechanical: it is built into the selection probability function, not introduced by chance.

3. The two replicated histograms have similar widths but very different centres. In one sentence, explain why doubling the convenience sample size from 200 to 400 would NOT fix this problem, and connect it to the asbestos-worker SMR example from earlier in this section.

Model answerDoubling n from 200 to 400 narrows the histogram (smaller sampling variance) but does not shift its centre; the convenience sample is still drawn under the same selection rule that excludes high-BMI individuals, so the average of any sample drawn that way remains biased. This is exactly the lesson of the asbestos-worker SMR example: the standardized mortality ratio < 1.0 in long-tenured workers is not noise that bigger studies fix; it is structural healthy-worker survivor bias, and the only remedy is a different sampling frame (or analytic adjustment for the selection mechanism), not more of the same.

Saved.

Key Takeaways

Berkson’s bias creates spurious associations in hospital-based case-control studies because hospitalization probability depends on having multiple conditions
The healthy worker effect masks true occupational risks because workers are systematically healthier than the general population
Both biases are structural; they arise from how participants are selected, not from measurement or confounding
Population-based designs and internal comparisons are primary strategies for avoiding these biases

Section 2 of 4

Attrition & Nonresponse Bias

⏱ Estimated reading time: 20 minutes

Section 2 of 4

Attrition & Nonresponse Bias

Selection biases that develop after recruitment, in longitudinal follow-up and in survey participation.

Longitudinal studies

Attrition bias

The question is never simply “how many were lost?” It is whether loss depended on both exposure and outcome.

Non-differential

Loss unrelated to the exposure-outcome relationship. Estimates remain unbiased; precision suffers.

Differential

Loss tied to the exposure, the outcome, or both. The association in the remaining sample is distorted.

Framingham Heart Study

When high-risk participants leave

Smoking

Retained: 28%
Lost: 42%

Systolic BP

Retained: 132 mmHg
Lost: 141 mmHg

BMI

Retained: 26.4
Lost: 28.1 kg/m²

Education

Retained: 72% HS
Lost: 54% HS

Values are illustrative of patterns from Framingham attrition analyses. Because loss was tied to both risk factors and outcome risk, associations were attenuated in the retained group.

Cross-sectional surveys

Nonresponse bias

Survey weights fix the part of nonresponse driven by observed demographics. They cannot fix the part driven by the outcome itself.

Missing at random (MAR)

Nonresponse depends on observed variables used in weighting. Weights can correct this.

Missing not at random (MNAR)

Nonresponse depends on the outcome itself. No weighting fully removes this bias.

MNAR in practice

When nonresponse is driven by the outcome

Stigmatized conditions

HIV, substance use, mental illness: those most affected are least likely to participate.

Health-conscious responders

Healthier individuals may be more willing to take part, inflating apparent health literacy or preventive behaviour rates.

The implication

Survey-based prevalence estimates for stigmatized or disabling conditions are structural undercounts.

Carry forward

Mitigations and what comes next

Attrition: baseline comparisons of completers vs. dropouts; IPCW; multiple imputation; sensitivity analyses.
Nonresponse: post-stratification weighting for MAR; sensitivity analysis for MNAR; transparency about response rates.
A later section asks about people who were gone before the study began, the survival filter that produces Neyman and survivorship bias.

Introduction and Overview

An earlier section covered selection bias at the moment of enrollment. The mechanisms in this section operate later: attrition reshapes the cohort over time as participants drop out, and nonresponse hollows out a survey before any follow-up has even started. Both are forms of selection bias, but they are easy to miss because the original recruitment was sound. The key diagnostic question for both is the same: does the dropout (or non-participation) depend on both the exposure and the outcome?

Learning Objectives

Distinguish attrition bias (post-enrollment loss to follow-up) from nonresponse bias (failure to participate at the outset).
Recognize when loss to follow-up is differential with respect to exposure and outcome, and explain why that, not the absolute attrition rate, is what matters.
Evaluate strategies for detecting and adjusting for attrition (baseline comparisons, sensitivity analyses, IPCW, multiple imputation).
Identify common drivers of nonresponse and assess how they distort prevalence and exposure-outcome estimates in survey-based research.

Attrition Bias in Longitudinal Studies

Longitudinal studies follow participants over time, but not everyone stays. When loss to follow-up is related to both the exposure and the outcome, the resulting estimates become biased. This is attrition bias, a form of selection bias that occurs after study enrollment, gradually reshaping the study sample in ways that distort associations.

Case Study: The Framingham Heart Study

The Framingham Heart Study, initiated in 1948, is one of the most influential longitudinal studies in cardiovascular epidemiology. Over decades of follow-up, researchers noticed that participants lost to follow-up had systematically higher risk profiles; they were more likely to smoke, have higher blood pressure, and have lower socioeconomic status. Because these same factors predict cardiovascular events, the loss of high-risk participants led to underestimation of risk factor–outcome associations in certain analyses.

When Attrition Bias Matters Most

Attrition bias is most problematic when the probability of dropping out depends on both the exposure and the outcome (or factors strongly related to both). If loss to follow-up is random with respect to the exposure-outcome relationship, estimates remain unbiased even with substantial attrition. The critical question is not “how many participants were lost?” but “is loss to follow-up differential with respect to the exposure and outcome?”

Mechanisms of Differential Attrition

Several mechanisms can produce differential attrition in epidemiological studies:

Illness-related dropout: Participants who become sicker may be unable or unwilling to continue (e.g., advanced cancer patients missing follow-up visits)
Exposure-related migration: People who experience adverse effects of an exposure may relocate (e.g., workers who develop respiratory symptoms leaving a polluted area)
Competing mortality: Participants who die from causes related to the study exposure are lost from the sample, removing the most affected individuals
Socioeconomic barriers: Disadvantaged participants often face transportation, childcare, or work-schedule barriers to continued study participation

Quantifying and Addressing Attrition Bias

Epidemiologists use several strategies to detect and address attrition bias:

Compare baseline characteristics: Compare those retained versus those lost to follow-up on key exposure, outcome, and confounder variables
Sensitivity analyses: Conduct worst-case and best-case scenarios for missing outcomes among those lost to follow-up
Inverse probability of censoring weighting (IPCW): Weight remaining participants to represent those who were lost, based on predictors of attrition
Multiple imputation: Use statistical models to fill in plausible values for missing outcome data

None of these methods perfectly eliminates attrition bias, but they provide important evidence about its potential magnitude and direction.

Framingham Heart Study: Lessons Learned

Analysis of Framingham participants who were lost to follow-up revealed:

Characteristic	Retained Participants	Lost to Follow-Up
Current smoking (%)	28%	42%
Mean systolic BP (mmHg)	132	141
Mean BMI (kg/m²)	26.4	28.1
High school education (%)	72%	54%

Values are illustrative of patterns reported in Framingham attrition analyses.

Because those lost to follow-up were more likely to have the exposures and more likely to develop cardiovascular outcomes, the estimated associations between risk factors and heart disease were attenuated in the retained sample.

Nonresponse Bias in Cross-Sectional Surveys

Attrition is the longitudinal-cohort version of selection during follow-up. Cross-sectional surveys have the same problem at the moment of recruitment: nonresponse bias occurs at the point of data collection in cross-sectional studies and surveys (Galea & Tracy, 2007). When people who choose not to respond differ systematically from those who do respond, the resulting data do not represent the target population.

Case Study: NHANES and Health Survey Nonresponse

The National Health and Nutrition Examination Survey (NHANES) is designed to be nationally representative of the U.S. population. However, response rates have declined over time. Research comparing early responders to late responders (a proxy for nonrespondents) and using linked administrative data has shown that nonrespondents tend to be less healthy; they have higher rates of smoking, obesity, and chronic disease. Standard survey weights partially correct for this, but cannot fully account for unmeasured differences between respondents and nonrespondents.

The direction of nonresponse bias depends on how nonresponse relates to the variables being studied:

Health behavior surveys: If people with unhealthier behaviors (smoking, sedentary lifestyles) are less likely to participate, prevalence estimates will underestimate the true burden of these behaviors
Stigmatized conditions: People with HIV, substance use disorders, or mental illness may avoid surveys, leading to underestimation of prevalence
Health-conscious responders: Conversely, people interested in health may be more likely to participate, inflating apparent health literacy or preventive behavior prevalence

Surveys like NHANES use post-stratification weights to adjust for nonresponse. These weights are calibrated to known population totals (from census data) on variables like age, sex, race/ethnicity, and geography.

The logic: if young men are underrepresented in the survey relative to the census, each young man who did respond receives a higher weight. However, this only corrects for nonresponse that is explained by the weighting variables. If nonresponse is driven by unmeasured factors (like health status itself), weighting alone is insufficient.

Key limitation: Weighting can only correct for nonresponse that is “missing at random” (MAR), where nonresponse depends on observed variables used in the weighting model. If nonresponse is “missing not at random” (MNAR), that is, related to the outcome itself, no amount of weighting will fully remove the bias.

For example, if people who are severely depressed are less likely to complete a mental health survey because of their depression, no adjustment for age, sex, or socioeconomic status can recover the true depression prevalence. This is a fundamental limitation of survey-based research.

Reflection

Consider a longitudinal cohort study of cannabis use and psychotic symptoms among young adults. Over 5 years, 30% of participants are lost to follow-up, and those lost are more likely to be heavy cannabis users. How might this attrition bias the study’s findings? What strategies would you recommend to assess or mitigate this bias?

Model answer30% loss-to-follow-up that is differentially higher among heavy cannabis users threatens the study at three levels. First, the remaining cohort under-represents heavy users, so prevalence and incidence of both exposure and outcome are biased downwards. Second, if heavy users who drop out are precisely the ones who would have developed psychotic symptoms, the apparent exposure-outcome association is attenuated, a classical case of informative censoring. Third, depending on whether dropout is driven by symptom severity or by exposure intensity, the bias direction can reverse. Mitigations: (i) build a tracing protocol with alternate contacts and provincial data linkage to recover outcomes for dropouts; (ii) compare baseline characteristics of completers vs. dropouts and report explicitly; (iii) use inverse-probability-of-censoring weights or multiple imputation under MAR; (iv) run sensitivity analyses for plausible MNAR scenarios (e.g., assume all dropouts who were heavy users had the outcome and show how the effect estimate moves).

Reflection saved.

Section 3 of 4

Survivorship Bias & Transportability

⏱ Estimated reading time: 25 minutes

Section 3 of 4

Survivorship Bias & Transportability

Biases from upstream survival filters, and the question of whether valid findings travel to new populations.

Neyman, 1955

Prevalence–incidence (Neyman) bias

Prevalence–incidence relationship

\[ \color{#0B7B6B}{P} \approx \color{#C2410C}{I} \times \color{#6D28D9}{D} \]

P prevalenceI incidence rateD average disease duration

Prevalence (P) equals incidence (I) times average disease duration (D). If a risk factor raises I but shortens D through higher fatality, its net effect on P may be small or zero, making it appear unimportant in cross-sectional data.

Bias direction: toward the null for factors that increase case fatality; away from the null for factors that improve survival.

Survivorship bias

The ghost population you cannot see

Studying those who “survived” a selection process produces conclusions about survivors, not about the original population.

Examples beyond mortality: patients who persist on a drug for two years; cancer survivors sampled for quality-of-life research; elderly cohorts that exclude those who died before old age.

Case study

HIV long-term non-progressors

CCR5–Δ32

Survivors: 15–20%
All HIV-infected: ~10%

CD8 response

Survivors: 80%
All HIV-infected: ~40%

Younger at infection

Survivors: 65%
All HIV-infected: ~45%

Access to care

Survivors: 90%
All HIV-infected: ~50%

Values illustrative of patterns in early HIV natural-history studies.

External validity

Transportability

Internal validity does not guarantee external validity. Findings travel only when effect modifiers are distributed comparably between the study and the target population.

Internal validity

Is the observed association causal within this study?

Transportability

Would the same causal effect hold in a different target population?

Wrapping up

Addressing transportability and closing the lesson

IP selection weighting: reweight the study sample to the target population’s covariate distribution.
Standardization: propagate subgroup effects forward using the target population’s distribution.
Target trial emulation: replicate the trial design in observational data from the target population.
The reflection and knowledge check below draw on the full inventory before the final assessment.

Introduction and Overview

Earlier sections covered selection biases that arise during the study: how participants were enrolled, who dropped out. This section addresses biases that arise from a more upstream filter: by the time we study a population, some people are already gone (dead, recovered, lost to history). When the people we end up with are the survival-filtered residue of the people we wish we could study, conventional analysis answers a different question than we think it answers. The section closes by zooming out one step further to ask the related external-validity question: even when our internal estimates are unbiased, do they apply to anyone outside our sample?

Learning Objectives

Explain prevalence–incidence (Neyman) bias and use the P ≈ I × D relationship to predict its direction.
Identify survivorship bias across mortality, treatment persistence, and successful-aging contexts and recognize when conclusions overgeneralize.
Distinguish internal validity from external validity and articulate the conditions under which study findings transport to a new target population.
Choose study designs (incident-case, nested case-control, restricted target population) that reduce survival-filter and transportability problems.

Prevalence-Incidence (Neyman) Bias

One of the most subtle forms of selection bias occurs when we use cross-sectional data to study risk factors for diseases. The problem: cross-sectional studies capture prevalent cases, people who currently have the disease, rather than incident cases, people who are newly developing it. If some cases die quickly while others survive for years, the cross-sectional sample will overrepresent long-surviving cases and underrepresent rapidly fatal ones.

Case Study: Myocardial Infarction Risk Factors

Early cross-sectional studies of myocardial infarction (MI) survivors examined which risk factors were associated with having had an MI. However, patients with the most severe risk profiles, particularly those with extremely high cholesterol or severe hypertension, were more likely to die from their initial MI before they could be included in a prevalence sample. The result: cross-sectional studies underestimated the strength of these risk factors because the most affected individuals were already dead and absent from the sample.

The Neyman Bias Mechanism

Named after Jerzy Neyman (1955), prevalence-incidence bias occurs because prevalent cases are a subset of all incident cases, specifically those who survived long enough to be sampled. If the risk factor under study is associated with case fatality (more severe disease or faster death), it will appear less strongly associated with the disease in cross-sectional data than it truly is. The bias is toward the null (making a real association look weaker or absent) for risk factors that increase case fatality, and away from the null (making it look stronger) for factors that improve survival.

Prevalence = I x DClick to explore

Incident Case DesignsClick to explore

Neyman bias is the cross-sectional version of the survival-filter problem. The same logic shows up in cohort studies whenever the sample is built around “people still here,” whether that means people still alive, still on treatment, or still showing up for follow-up.

Survivorship Bias in Cohort Studies

▸ INTERACTIVE STORY: WALD'S PLANES
Open full screen ↗

Walk through the WWII bomber-armor problem (Abraham Wald) that named survivorship bias. Next ▶ advances scenes.

A 7-scene retelling of Abraham Wald's most famous insight: returning bombers mapped with bullet holes, the obvious-but-wrong conclusion, the ghost planes that didn't return, and the rule that armor belongs where the holes AREN'T.

Case Study: HIV Long-Term Survivors

In the early years of the HIV epidemic, before effective antiretroviral therapy, researchers studied “long-term non-progressors”, individuals who remained healthy for years despite HIV infection. Studies of this group identified genetic factors (such as CCR5-delta32 mutations) and immune characteristics associated with slow progression. However, drawing general conclusions about HIV pathogenesis from these survivors was problematic: they represented a highly selected subset of all HIV-infected individuals. The majority of infected individuals had already died, and the survivors differed systematically in ways beyond the factors being studied.

Survivorship bias occurs when we analyze only those who “survived” a process, whether survival means remaining alive, staying in a study, or maintaining a condition, and draw conclusions that we mistakenly generalize to the full original population.

Survivorship Bias Beyond Mortality

While the term “survivorship” evokes mortality, this bias extends to any selective retention process:

Treatment persistence studies: Patients who remain on a medication for 2 years are systematically different from those who discontinued; they had fewer side effects, better response, and likely better adherence behaviors. Analyzing only those who persisted overestimates treatment efficacy.
Cancer survivor cohorts: Studies of quality of life among “cancer survivors” may overestimate well-being because those who died (often with the worst quality of life) are excluded.
Successful aging studies: Research on cognitive function in elderly cohorts necessarily excludes those who died before reaching old age, potentially missing the very exposures that most strongly impair cognition.

HIV Cohort Example: What Survivors Tell Us (and Don’t)

Characteristic	Long-Term Survivors	All HIV-Infected (Estimated)
CCR5-delta32 heterozygote (%)	15–20%	~10%
Strong CD8+ T-cell response (%)	80%	~40%
Younger age at infection (%)	65%	~45%
Access to care (%)	90%	~50%

Values are illustrative of patterns described in early HIV natural history studies.

Because long-term survivors differed on multiple dimensions (genetic, immunological, demographic, and socioeconomic), so findings from survivor cohorts could not be straightforwardly generalized to all people living with HIV.

Survivorship and Neyman-style selection can also be read structurally as collider bias: conditioning on “made it into the sample” opens a non-causal path between exposure and outcome (Cole et al., 2010; Munafò et al., 2018). In causal-diagram terms, being in the sample is a shared effect of exposure and outcome, and selecting on a shared effect links the two even when neither one causes the other. Neyman bias and survivorship bias are about whether the sample we have represents the population we are trying to study. The last topic of the section steps further out: even when the sample does represent its source population, the study's findings may not apply elsewhere. This is the external-validity question that the rest of evidence-based practice depends on.

Transportability of Study Results

Even a perfectly internally valid study may yield misleading conclusions when its findings are applied to a different population. This is the problem of transportability, also called external validity or generalizability, and it is increasingly recognized as a critical challenge in evidence-based practice (Pearl & Bareinboim, 2014; Westreich, Edwards, Lesko, Stuart, & Cole, 2017).

Case Study: Antihypertensive Therapy Across Populations

The ALLHAT trial (Antihypertensive and Lipid-Lowering Treatment to Prevent Heart Attack Trial) found that thiazide-type diuretics were as effective as newer, more expensive antihypertensives in reducing cardiovascular events. However, ALLHAT participants were predominantly older adults with established hypertension and multiple comorbidities. When clinicians applied these results to younger patients with fewer comorbidities, the effectiveness of different drug classes sometimes differed, illustrating a target population mismatch.

Transportability vs. Internal Validity

Internal validity asks: “Is the observed association causal within this study?” Transportability asks: “Would the same causal effect hold in a different target population?” A study can have perfect internal validity yet poor transportability if the study population differs from the target population in ways that modify the effect of interest. Effect modification is the key concept linking internal validity to external generalizability.

Study results are transportable to a target population when:

The causal mechanism operates the same way in both populations
There are no effect modifiers whose distribution differs between the study and target populations
The versions of treatment (or exposure) are comparable across settings
The outcome measurement is equivalent in both populations

If any of these conditions are violated, the effect observed in the study may not replicate in the target population, not because the study was wrong, but because the populations differ in ways that matter.

Effect modification is the core mechanism that threatens transportability. If the treatment effect differs across subgroups (e.g., by age, comorbidity, or genetics), and the study and target populations have different compositions of these subgroups, the average treatment effect will differ between populations.

Population	% With Comorbidities	Mean Treatment Effect (BP Reduction)
ALLHAT trial participants	78%	−12 mmHg
Young adults (<40 years)	15%	−8 mmHg
Elderly with renal disease	92%	−14 mmHg

Values are illustrative of how treatment effects can vary across populations due to effect modification.

Researchers have developed formal methods for assessing and improving transportability:

Inverse probability of selection weighting: Reweight the study sample so that its covariate distribution matches the target population (Westreich et al., 2017)
Standardization: Estimate the treatment effect within subgroups defined by effect modifiers, then average across the target population’s subgroup distribution
Sensitivity analysis for transportability: Assess how much unmeasured effect modification would be needed to qualitatively change conclusions
Target trial emulation: Use observational data from the target population to emulate the trial design and compare results

Reflection

A randomized controlled trial of a new diabetes medication was conducted exclusively among patients aged 50–75 at academic medical centers. A community health clinic serving a predominantly young, uninsured population wants to adopt this medication. What factors would you consider when evaluating whether the trial results are transportable to this new setting?

Model answerTransportability is a separate question from internal validity. Factors to consider: (a) baseline risk: the community population has different absolute diabetes risk than the trial population, so even a true relative effect translates to different absolute benefit (and absolute risk reduction is what policy needs). (b) Effect modification: age and uninsured status are likely modifiers (different comorbidities, medication adherence, baseline glycemic control); if the trial enrolled only 50–75-year-olds, the effect in 25–45-year-olds is extrapolation. (c) Healthcare context: academic medical centres deliver intervention with specialist follow-up; a community clinic with under-resourced staffing may not reproduce the trial's adherence support. (d) Comorbidities and concomitant medications: uninsured young adults often have different patterns of co-occurring conditions and access to other therapies. Methodological remedy: run a formal transportability analysis (Pearl & Bareinboim) or a target-trial emulation in the community population, and ask the trialists to publish subgroup effects that can be propagated forward.

Reflection saved.

Section 4 of 4

Final Assessment

⏱ Estimated time: 25 minutes

Bringing It All Together

This lesson built a working inventory of selection-related biases in the order they typically arise. An earlier section handled biases that operate at the moment of enrollment, such as Berkson’s bias in hospital-based case-control designs and the healthy worker effect in occupational cohorts. An earlier section turned to losses that happen after enrollment: differential attrition in longitudinal studies and nonresponse in surveys, both invisible unless you ask whether the loss depends on both exposure and outcome. An earlier section tackled the upstream survival filter, prevalence–incidence (Neyman) bias and survivorship bias, then stepped outside the sample altogether to ask the external-validity question: when does an internally valid finding transport?

Read together, these mechanisms form a structured checklist for "the sample is wrong." Combined with an earlier lesson’s vocabulary for "the variables are wrong" and the information-bias inventory ahead in a later lesson, you are building the appraisal toolkit you will use for the rest of the course. The diagnostic question that runs through all three sections is the same: who is in the data, who is missing, and does that missingness depend on the very relationship we are trying to estimate?

The final reflection asks you to apply the full inventory to a single hypothetical study; the 15-question assessment then checks the conceptual material directly. From here, a later lesson turns to information bias, what happens when the people in the study are right but the measurements on them are wrong.

Key Takeaways from this lesson

Berkson’s bias: hospital-based sampling inflates co-occurrence of conditions because being hospitalized is itself a selection filter.
Healthy worker effect: workforce participation pre-selects healthier people, so SMRs below 1.0 against the general population can mask real occupational hazard.
Attrition and nonresponse: what matters is not the absolute rate of loss but whether loss is differential with respect to exposure and outcome; only then does it bias the estimate.
Neyman bias and survivorship bias: cross-sectional and "survivor"-based samples are filtered by prior survival, biasing risk-factor estimates in directions predictable from P ≈ I × D.
Transportability: internal validity does not guarantee external validity; findings travel only when effect-modifier distributions are comparable between study and target populations.
Diagnostic discipline: for every study, ask who got into the sample, who left, and whether that depends on both exposure and outcome; structural problems cannot be fixed by larger sample sizes.

R Activity: Convenience sampling and the healthy-worker effect

The companion R script r-activities/HSCI_230_Lesson_8_Sampling_Selection_and_External_Validity.R simulates a 10,000-person population with a known true mean BMI of 27, then draws a simple random sample and a convenience sample of gym-goers (whose enrollment probability depends on BMI). You compare each sample mean to the truth, then replicate 1,000 times to see why selection bias shifts the entire sampling distribution, the same structural problem behind the healthy-worker effect.

set.seed(230)

# 10,000-person "population" with true mean BMI = 27
N        <- 10000
bmi      <- rnorm(N, mean = 27, sd = 5)
gym_goer <- rbinom(N, 1, prob = plogis(2 - 0.15*bmi))   # lower BMI -> more likely

mean(bmi)                              # truth: ~27

# (1) Simple random sample of 200
srs <- sample(bmi, size = 200)
mean(srs)

# (2) Convenience sample: 200 gym-goers
conv <- sample(bmi[gym_goer == 1], size = 200)
mean(conv)

## -----------------------------------------------------------------------------
## Stretch: replicate 1000 times and look at the sampling distribution
## -----------------------------------------------------------------------------
srs_means  <- replicate(1000, mean(sample(bmi, 200)))
conv_means <- replicate(1000, mean(sample(bmi[gym_goer == 1], 200)))

par(mfrow = c(1, 2))
hist(srs_means,  main = "Simple Random Sample", xlim = c(20, 30), xlab = "Mean BMI")
abline(v = 27, col = "red", lwd = 2)
hist(conv_means, main = "Convenience (gym)",     xlim = c(20, 30), xlab = "Mean BMI")
abline(v = 27, col = "red", lwd = 2)
par(mfrow = c(1, 1))

Reflection

You are designing a study to estimate the effect of air pollution exposure on childhood asthma in a large metropolitan area. Describe at least three different selection biases that could threaten your study and explain one specific design decision you would make to address each. Be sure to distinguish between biases that affect internal validity versus external validity (transportability).

Model answerThree selection biases for a child-asthma / air-pollution study, with one design decision each. (1) Recruitment / participation bias (internal validity): families in high-pollution neighbourhoods who participate may be more health-aware than non-participants. Fix: use a probability sample of the metropolitan census frame with active follow-up and reporting of refusal rates by neighbourhood. (2) Differential loss-to-follow-up (internal validity): high-pollution-neighbourhood families are more likely to move, biasing 5-year incidence estimates downward in the exposed group. Fix: build administrative-data linkage (provincial health records) so attrition does not erase outcomes; report exposure-stratified attrition. (3) Generalizability / transportability (external validity): the cohort, even if internally clean, may not transport to a different metro with different pollution mix and age structure. Fix: pre-specify the target population, characterise covariate distributions, and produce both within-study and transported-effect estimates. Distinguish clearly: (1)–(2) threaten the effect here; (3) threatens the effect elsewhere.

Reflection saved.

Final Knowledge Assessment

This lesson Complete!

Congratulations! You have successfully completed the lesson on Sampling, Selection Processes, and External Validity. Your responses have been downloaded automatically.

You demonstrated understanding of Berkson’s bias, the healthy worker effect, attrition and nonresponse bias, prevalence-incidence bias, survivorship bias, and transportability.

A later lesson stays inside the same broad inventory but pivots to its third leg. Where an earlier lesson covered measurement and causal-specification problems and this lesson covered selection problems, A later lesson: Information Bias and Data Quality turns to information bias: the systematic errors that arise from how exposure and outcome are measured once participants are in the study. The three-part bias triad (information, selection, confounding) will be complete by the end of a later lesson.

HSCI 230, Lesson 8

Evaluating Epidemiological Research

Sampling, Selection Processes &External Validity

Learning objectives for this lesson:

Glossary: Key Terms, People & Concepts

Selection Bias Mechanisms

Selection Bias Mechanisms

Where selection bias lives

What selection bias is

Key consequence

Berkson’s bias

The healthy worker effect

SMRs in asbestos workers

All causes

Cardiovascular

Lung cancer

Mesothelioma

Design remedies and what comes next

Introduction and Overview

Learning Objectives

What Is Selection Bias?

Key Concept: Selection Bias

Berkson’s Bias

The Mechanism of Berkson’s Bias

The Healthy Worker Effect

Hands-on: Selection & Recruitment Bias Simulator

🎯 Interactive: Selection & Recruitment Bias Simulator

Population (true) vs. Observed sample

2×2 tables & effect estimates

True (population)

Observed (sample)

R Reflect on what you just ran

Key Takeaways

Attrition & Nonresponse Bias

Attrition & Nonresponse Bias

Attrition bias

Non-differential

Differential

When high-risk participants leave

Smoking

Systolic BP

BMI

Education

Nonresponse bias

Missing at random (MAR)

Missing not at random (MNAR)

When nonresponse is driven by the outcome

Stigmatized conditions

Health-conscious responders

The implication

Mitigations and what comes next

Introduction and Overview

Learning Objectives

Attrition Bias in Longitudinal Studies

When Attrition Bias Matters Most

Nonresponse Bias in Cross-Sectional Surveys

Reflection

Survivorship Bias & Transportability

Survivorship Bias & Transportability

Prevalence–incidence (Neyman) bias

The ghost population you cannot see

HIV long-term non-progressors

CCR5–Δ32

CD8 response

Younger at infection

Access to care

Transportability

Internal validity

Transportability

Addressing transportability and closing the lesson

Introduction and Overview

Learning Objectives

Prevalence-Incidence (Neyman) Bias

The Neyman Bias Mechanism

Survivorship Bias in Cohort Studies

Transportability of Study Results

Transportability vs. Internal Validity

Reflection

Final Assessment

Bringing It All Together

Sampling, Selection Processes &
External Validity