Conceptualization, Measurement &
Causal Specification

Evaluating Epidemiological Research

Learning objectives for this lesson:

Explain construct validity and identify threats such as measurement non-invariance and construct-irrelevant variance
Distinguish between reliability and validity and describe how measurement error attenuates epidemiological associations
Recognize when ordinal variables are inappropriately treated as interval-level data and the consequences for study findings
Use directed acyclic graphs (DAGs) to identify collider bias, overadjustment bias, and confounding
Explain the obesity paradox and smoking-birth weight paradox as examples of causal specification errors
Distinguish residual confounding, reverse causation, and simultaneity bias using empirical examples
Critically evaluate whether epidemiological studies have adequately addressed measurement and causal specification issues

This course was developed by Dr. Kiffer G. Card, Faculty of Health Sciences, Simon Fraser University.

Reference

Glossary: Key Terms, People & Concepts

📚 Reference page, available throughout the lesson

This glossary collects the key concepts, people, and ideas you will meet in this lesson. Use it as a reference while you work through the material, or as a review before assessments. Type in the search box to filter entries.

Key Concepts & Ideas

Construct A theoretical concept (e.g., depression, socioeconomic status, health) that cannot be observed directly and must be inferred from observable indicators. Constructs are the targets of measurement in epidemiology.

Operationalization The process of translating an abstract construct into a concrete, measurable variable: deciding which questions, scales, or indicators will stand in for the construct in your data.

Construct Validity The degree to which a measurement instrument actually captures the theoretical concept it is intended to measure. A depression scale has strong construct validity only if it reflects depression rather than anxiety, fatigue, or social desirability.

Face Validity The most basic form of validity: at face value, do the items appear to measure what they claim to measure? A weak but necessary first check.

Content Validity The extent to which a measure’s items adequately cover the full domain of the construct. A depression scale missing items on sleep or appetite would have weak content validity.

Criterion Validity How well a measure predicts or correlates with an external gold-standard criterion (concurrent or predictive validity). For example, does a depression screener track clinician diagnoses?

Reliability The consistency or reproducibility of a measurement, i.e. whether the instrument produces the same answer on repeated administration, across raters, or across items? High reliability is necessary but not sufficient for validity.

Measurement Error The discrepancy between the measured value and the true value of a variable. Error can be random (noise that attenuates associations) or systematic (bias that distorts them in a particular direction).

Random (Non-Differential) Measurement Error Mistakes that occur unpredictably and independently of exposure or outcome status. In simple cases this attenuates associations toward the null but does not bias them in a particular direction.

Systematic (Differential) Measurement Error Error that depends on exposure or outcome status and shifts estimates in a predictable direction. Recall bias is a classic example.

Attenuation Bias The pull of an estimate toward the null caused by random measurement error in the exposure. Poorly measured exposures (e.g., self-reported diet) systematically understate true effects.

Measurement Invariance The property that a scale measures the same construct in the same way across groups (e.g., across racial/ethnic groups, languages, or time). Without invariance, group comparisons may reflect measurement differences rather than true differences.

Differential Item Functioning (DIF) When individuals from different groups with the same underlying level of a trait have different probabilities of endorsing an item. DIF is a sign of measurement non-invariance.

Ordinal vs. Interval Scales Ordinal scales rank responses without assuming equal spacing (e.g., Likert agreement); interval scales assume equal distances between values. Treating ordinal data as interval can produce biased estimates and even reverse the direction of effects.

Exposure The factor whose effect on health you want to estimate (e.g., smoking, air pollution, an intervention). Defining and measuring exposure carefully is the first step of any epidemiological study.

Outcome The health state or event you are trying to explain or predict (e.g., disease incidence, mortality, symptom severity).

Causal Specification The set of decisions about which variables to treat as exposures, outcomes, confounders, mediators, or colliders, and how they relate. Misspecification produces biased estimates even when measurement is excellent.

Reverse Causation When the supposed outcome actually causes the supposed exposure (e.g., early disease symptoms reduce physical activity, making inactivity look like a cause of disease).

Simultaneity Bias A bias arising when exposure and outcome influence each other contemporaneously. Standard regression cannot disentangle direction in such bidirectional systems.

Residual Confounding Confounding that remains after adjustment because the confounder was measured imprecisely, categorized too coarsely, or only partially captured. Even “adjusted” estimates can carry meaningful confounding.

Biomedical Model A framework that locates disease in individual bodies and explains population patterns as the sum of individual risk factors. Powerful for some questions but tends to under-measure social and structural conditions.

Social Determinants of Health The conditions in which people are born, grow, live, work, and age, including income, education, housing, working conditions, and discrimination, that shape population health and inequities.

Fundamental Cause Theory Link & Phelan’s argument that socioeconomic status is a “fundamental cause” of health inequalities because it gives access to flexible resources (knowledge, money, power, beneficial social connections) that protect health no matter what the proximate risks are.

Health Equity The absence of unfair and avoidable differences in health among groups defined socially, economically, demographically, or geographically (Braveman, 2014). Equity is about which differences are unjust, a question distinct from whether averages differ.

Methods, Biases & Study Designs

Directed Acyclic Graph (DAG) A diagram of presumed causal relationships using arrows to encode cause-and-effect, with no feedback loops. DAGs help identify confounders, mediators, and colliders and decide which variables to adjust for.

Confounder A variable that causes both the exposure and the outcome. Adjusting for a confounder removes bias.

Mediator A variable on the causal pathway between exposure and outcome. Adjusting for a mediator blocks part of the very effect you are trying to estimate.

Collider A variable caused by both the exposure and the outcome (or by variables associated with each). Adjusting for a collider induces spurious associations where none existed.

Collider Stratification Bias A bias introduced by restricting analysis to, or stratifying on, a collider. The obesity paradox is a canonical example.

Overadjustment Bias Bias caused by adjusting for a mediator (or collider). Overadjustment can underestimate the total effect of an exposure or generate spurious paradoxes such as the birth weight paradox.

Obesity Paradox The counterintuitive finding that overweight or obese patients with a chronic disease appear to have better survival than normal-weight patients. Often explained by collider stratification bias rather than a true protective effect.

Birth Weight Paradox The puzzling finding that maternal smoking appears protective for low-birth-weight infants. Hernandez-Diaz et al. (2006) showed it is an artifact of overadjustment when birth weight is treated as a confounder rather than a mediator.

Key People

Judea Pearl (1936–) Computer scientist whose work on causal graphs and the do-calculus formalized DAGs as tools for causal inference in epidemiology and the social sciences.

Miguel Hernán Epidemiologist (Harvard) who, with James Robins, has pushed modern epidemiology toward explicit causal questions, target trial emulation, and rigorous use of DAGs.

Sander Greenland Epidemiologist whose writing on confounding, collapsibility, and the misuse of statistical significance has shaped how the field thinks about bias and causal inference.

Paula Braveman Physician and population health scientist whose work has clarified the definition of health equity and pushed measurement toward upstream social conditions.

Bruce Link & Jo Phelan Sociologists who developed Fundamental Cause Theory, arguing that socioeconomic status persistently produces health inequalities because it provides flexible resources that protect health.

No matching entries. Try a different search term.

Section 1 of 4

Construct Validity & Measurement

⏱ Estimated reading time: 20 minutes

Section 1 of 4

Construct Validity & Measurement

From the theoretical origins of what gets measured, to differential item functioning, scale level, and attenuation bias.

Why this matters

The unstated assumption behind every study design

Other lessons showed how study designs produce measures of association. This section asks whether those measures mean what we think they mean.

Three questions run underneath the work that follows: are the instruments right, is the causal model right, and what biases survive even when both are?

The core concept

Construct validity

What it asks

Does the instrument capture the intended construct rather than something adjacent, like fatigue, anxiety, or social desirability?

The prior question

Where did the construct come from? Every measure inherits the theoretical assumptions of the framework that defined it.

Theory-laden measurement

Four frameworks that widen the lens

Social determinants

Fundamental causes

Ecosocial theory

Health equity

Each framework shifts measurement priorities toward upstream conditions that purely individual-level instruments cannot see: neighbourhood context, income trajectories, chronic stress, discrimination, and power.

Krieger (1994, 2001, 2011); Link & Phelan (1995); Braveman (2014).

Case study

The CES-D and differential item functioning

The Center for Epidemiologic Studies Depression Scale shows significant differential item functioning across racial and ethnic groups. The same underlying depression level produces different item responses in different populations.

Comparing group scores without testing measurement invariance may reflect measurement differences rather than true health differences.

The key point

Identical scores can carry different meanings in different populations. Cross-group comparisons require prior confirmation of measurement invariance.

Iwata et al., 2002; Kim et al., 2011.

Case study

Self-rated health: powerful but problematic

Its strength

A single self-rated item predicts mortality remarkably well in large cohort studies.

Its problems

Lower-SES individuals rate health more favourably relative to objective indicators; predictive validity varies across racial/ethnic groups; cultural response styles differ across nations.

Sen (2002); Franks et al. (2003); Jylhä et al. (1998).

Scale level & reliability

Ordinal is not interval; noise attenuates

Ordinal ≠ interval

Treating Likert categories as equally spaced can bias estimates and even reverse effect directions (Liddell & Kruschke, 2018).

Attenuation bias

Random measurement error in the exposure pulls the estimated slope toward zero. Reliability coefficients as low as 0.4–0.6 on food-frequency questionnaires help explain decades of apparent null findings in nutritional epidemiology.

Scatterplot of outcome against a noisily measured exposure, with the true regression slope near one and the attenuated slope near one quarter. — Adding random error to the exposure flattens the fitted slope from about 1.0 (true) to about 0.24 (observed), the signature of attenuation bias.

Carry forward

What to take into the next section

Measurement is theory-laden. Which constructs get measured determines what causal stories can be told with the data.
Validity and reliability are not the same. A reliable instrument can still be validly wrong.
Random error attenuates. Noisy exposure measurement pulls estimates toward null, systematically, regardless of sample size.

Introduction and Overview

Earlier lessons surveyed the four observational designs and showed that each one is built on the same kind of 2×2 table with the same kind of measure of association. The unstated assumption running through all of those lessons is that the variables in the table actually mean what we think they mean: that “exposure” really is exposure, “disease” really is disease, and the chosen confounders are the right ones in the right relationship to the exposure and outcome. This lesson stops to interrogate that assumption. The three content sections each pick at a different layer of it: this section asks whether our instruments measure the constructs we say they measure (and whose theory of disease determined what got measured at all); a later section uses directed acyclic graphs to expose the most common causal-specification mistakes, such as conditioning on colliders, adjusting for mediators, and the “paradoxes” both produce; a later section works through three biases that survive even careful measurement and adjustment: residual confounding, reverse causation, and simultaneity. By the time you reach the sampling and selection, the inventory of biases this lesson opens will be ready to be combined with the additional biases that arise from who ends up in the study at all.

Learning Objectives

Define construct validity, reliability, measurement non-invariance, and construct-irrelevant variance, and explain how each can bias an epidemiological study.
Explain why measurement is theory-laden, that is, how the choice of biomedical, social-determinants, fundamental-causes, or ecosocial frameworks shapes which variables are measured at all.
Apply Krieger’s and Link & Phelan’s arguments to predict what a study’s evidence base will and will not be able to see.
Treat the choice of which population subgroups to disaggregate as a substantive theoretical decision, not a reporting afterthought.
Recognise classical (random) measurement error and its tendency to attenuate associations toward the null (regression dilution bias).

Why Measurement Matters in Epidemiology

Epidemiological research depends on our ability to accurately measure the constructs we study, namely exposures, outcomes, and confounders. When measurements fail to capture the underlying phenomenon of interest, even a perfectly designed study can yield misleading results. This section examines how measurement problems introduce systematic error into epidemiological research.

Key Concept: Construct Validity

Construct validity refers to the degree to which a measurement instrument actually captures the theoretical concept it is intended to measure. A scale designed to measure “depression” has strong construct validity only if it truly reflects the underlying depressive construct rather than anxiety, fatigue, cultural distress, or social desirability.

Click each card below to explore the core measurement concepts that underpin epidemiological research.

Construct ValidityClick to explore

Measurement Non-InvarianceClick to explore

Construct-Irrelevant VarianceClick to explore

ReliabilityClick to explore

Theory Before Instruments: How Frameworks Shape What We Measure

Construct validity asks whether an instrument captures the construct it is supposed to measure. A prior question is rarely asked but more consequential: where does the construct come from? Every variable in an epidemiological study is the residue of a theoretical decision, a claim about what causes disease, what counts as a relevant exposure, and where the boundary of the “cause” should be drawn. That decision is upstream of any psychometric work, and it determines what the rest of the analysis can possibly see.

Why this matters: measurement is theory-laden

What gets measured shapes what knowledge is produced and, by extension, what interventions become thinkable. If a study of cardiovascular disease measures cholesterol, blood pressure, and smoking but not neighbourhood disinvestment, occupational exposures, or experiences of discrimination, the resulting evidence base will reliably point clinicians toward statins and behavioural counselling rather than toward housing, labour, or anti-racism policy. The instruments did not pick themselves; a theory of disease causation picked them (Krieger, 2011).

Public health has historically been dominated by the biomedical model, which locates disease in individual bodies and explains population patterns as the aggregation of individual risk factors. The biomedical model is powerful for some questions (it gave us germ theory, vaccines, antibiotics), but it systematically under-measures the conditions in which bodies live, work, and age. Several theoretical frameworks have emerged to push back against this narrowness.

Social determinants of health (SDOH)

The social determinants of health framework, popularised by the WHO Commission on Social Determinants of Health (Solar & Irwin, 2010) and earlier by Marmot’s Whitehall studies (Marmot et al., 1991), holds that the conditions in which people are born, grow, live, work, and age are the dominant drivers of population health. Income, education, housing, food security, working conditions, and social inclusion explain a substantially larger share of the variance in health outcomes than medical care does (McGinnis, Williams-Russo, & Knickman, 2002).

If you accept this framework, your measurement priorities shift: a study of asthma incidence should measure mould exposure, landlord responsiveness, and traffic proximity alongside inhaler adherence.

Fundamental causes (Link & Phelan, 1995)

Link and Phelan (1995) proposed that socioeconomic status is a fundamental cause of disease because it (1) influences multiple disease outcomes, (2) operates through multiple risk-factor mechanisms, (3) involves access to resources that can be deployed to avoid risks or minimise consequences, and (4) reproduces health inequalities even as the specific intervening mechanisms change over time.

This is why educational gradients in mortality have persisted across centuries even as the leading causes of death have shifted from infectious to chronic disease. The mechanism changed; the gradient did not. Measurement implication: adjusting for downstream behavioural mediators (smoking, diet) does not “explain away” the SES–mortality association, because flexible resources will simply find a new pathway. Treating SES purely as a confounder to be statistically controlled is a theoretical commitment, and arguably a mistaken one (Phelan, Link, & Tehranifar, 2010).

Ecosocial theory (Krieger, 1994; 2001)

Nancy Krieger’s ecosocial theory asks how we “literally embody, biologically, the societal and ecological context into which we are born, develop, interact, and endeavor to live meaningful lives” (Krieger, 2001, p. 672). The theory unifies social and biological levels of analysis using the concept of embodiment: chronic exposure to discrimination, poverty, environmental hazards, and labour stress is literally inscribed in cortisol patterns, telomere length, allostatic load, and epigenetic marks.

Ecosocial theory pushes researchers toward measuring exposures across the lifecourse, at multiple spatial scales (body, neighbourhood, region, nation), and with explicit attention to power, history, and accountability for population health (Krieger, 2011).

Health equity as a measurement principle

Braveman (2014) defines health equity as the absence of unfair and avoidable differences in health among population groups defined socially, economically, demographically, or geographically. Equity is not the same as equality of average health; it is a claim about which differences are unjust.

Operationalising equity requires measuring along the axes where injustice is suspected to operate: race and income, and also Indigeneity, immigration status, gender identity, disability, and their intersections. A study that reports only an overall mean has, by omission, taken a position on which differences are worth noticing. Choosing not to disaggregate is itself a theoretical choice.

Worked example: two studies of the same outcome

Imagine two research teams each studying type 2 diabetes incidence in the same population.

Team A works from a biomedical frame. They measure BMI, fasting glucose, HbA1c, dietary intake (FFQ), self-reported physical activity, and family history. Their conclusion: incidence is driven by individual lifestyle and genetic risk; intervention should target diet and exercise counselling.

Team B works from an ecosocial frame. They measure the same biomarkers and also neighbourhood food environment, shift-work history, lifetime experiences of racial discrimination, household income trajectory since childhood, and exposure to a major recession. Their conclusion: behavioural risk factors mediate roughly half of the social gradient; the rest reflects chronic stress and structural disinvestment. Intervention should include income support, labour protections, and neighbourhood investment alongside clinical care.

Both teams are doing “valid” epidemiology in the construct-validity sense. They reach different conclusions because they measured different things, and they measured different things because they began from different theories about what causes disease.

A caution and a balance

None of this means biomedical measurement is wrong. Cholesterol, viral load, and tumour staging are real, important, and often actionable. The argument is that biomedical models are frequently insufficient: they capture proximal mechanisms but obscure the upstream conditions that produce the patterns we observe. A defensible epidemiological study makes its theoretical commitments explicit, justifies its choice of constructs, and acknowledges what its instruments cannot see.

The four frameworks above name competing theories of what causes disease. The next case studies move from theory to instrument, showing how widely-used measurement tools quietly inherit the assumptions of whatever theory built them.

Case Study: The CES-D Scale and Differential Item Functioning

The Center for Epidemiologic Studies Depression Scale (CES-D)

The CES-D is one of the most widely used screening instruments for depressive symptoms in population-based research. However, studies have documented significant differential item functioning (DIF) across racial, ethnic, and cultural groups. For example, Iwata et al. (2002) found that Japanese respondents endorsed somatic items (e.g., “my sleep was restless”) at higher rates than American respondents with equivalent levels of underlying depression, while American respondents endorsed affective items (e.g., “I felt sad”) at higher rates.

Similarly, Kim et al. (2011) demonstrated that multiple CES-D items function differently across non-Hispanic White, African American, and Hispanic adults in the United States. Items related to interpersonal difficulties showed DIF by race/ethnicity, meaning that a CES-D score of 16 (the traditional clinical cutoff) does not carry the same meaning across these groups.

Why This Matters for Epidemiological Research

When researchers compare depression prevalence across racial/ethnic groups using the CES-D, they may be comparing “apples to oranges.” Observed disparities in depression could reflect true differences in depressive symptomatology, differences in how groups express and report distress, or some combination of both. Without establishing measurement invariance, we cannot distinguish these explanations.

The CES-D case showed how a multi-item instrument can mean different things in different groups. The next example takes the same lesson to its limit: a single-item measure that turns out to be one of the strongest predictors in epidemiology, while inheriting all the same problems.

Self-Rated Health: A Deceptively Simple Measure

Self-rated health (SRH), typically measured as a single item (“How would you rate your overall health?”), is one of the strongest predictors of mortality in epidemiological research. Yet SRH responses are shaped by comparison groups, expectations, and cultural frameworks.

Finding	Study	Implication
Lower-SES individuals report better SRH relative to their objective health indicators than higher-SES individuals	Sen (2002), Health: Perception versus Observation	SRH may underestimate health inequalities across socioeconomic strata
The predictive validity of SRH for mortality varies across racial/ethnic groups	Franks et al. (2003), Social Science & Medicine	Using SRH as a uniform outcome measure may introduce differential misclassification
Cultural differences in response styles (e.g., modesty norms) affect SRH reporting	Jylhä et al. (1998), Social Science & Medicine	Cross-national comparisons using SRH require careful calibration

Validity questions about what we are measuring naturally lead to a related question about the scale on which we record the answer. Most epidemiological surveys use ordinal Likert-type response options; almost every analysis treats them as if the gaps between categories were equal. They usually are not.

The Problem of Scale Level: Ordinal vs. Interval

Epidemiological and health behavior research frequently uses Likert-type scales, which are ordinal response options such as “Strongly Agree” to “Strongly Disagree.” Researchers routinely assign numeric values (1–5) and analyze these as if the intervals between categories are equal. But is the “distance” between “Strongly Agree” and “Agree” really the same as between “Neutral” and “Disagree”?

The intuition in miniature: think of the order in which runners finish a race. Knowing that someone came first, second, or third tells you the ranking but not the gaps. First and second might be a hundredth of a second apart while third trails by a minute. Likert responses behave the same way. The labels are ordered, but the codes 1, 2, 3, 4, 5 we attach to them do not promise that each step is the same size on the underlying thing we care about.

Why does treating ordinal data as interval matter?

Simulation studies by Liddell and Kruschke (2018) demonstrated that treating ordinal Likert data as metric (interval) in linear regression models can produce inflated Type I error rates, biased parameter estimates, and even reversals of effect direction. The bias is most severe when:

Response distributions are skewed (e.g., most respondents cluster at one end of the scale)
The spacing between response categories is unequal in the latent construct
Interactions between variables are being tested

Appropriate alternatives include ordinal logistic regression or Bayesian ordinal models that respect the rank-order nature of the data.

Real-world example: Physical activity measurement

Many large surveys measure physical activity using ordinal categories (e.g., “inactive,” “somewhat active,” “active,” “very active”). When researchers code these as 1–4 and fit linear models, they assume that the difference in health impact between “inactive” and “somewhat active” equals that between “active” and “very active.” In reality, evidence suggests the health benefits of physical activity follow a curvilinear pattern, with the largest gains at the lower end of the activity spectrum (Arem et al., 2015).

Construct validity, response-scale form, and cultural invariance are all about whether the instrument is asking the right thing. The last measurement issue this section covers is about whether it is asking it consistently.

Reliability and Attenuation Bias

Even when a measure is conceptually valid, poor reliability introduces random measurement error that systematically weakens (attenuates) observed associations. This is one of the most pervasive yet underappreciated problems in nutritional epidemiology.

Case: Dietary Intake and Disease Risk

Food frequency questionnaires (FFQs) are the most common method for measuring dietary intake in large cohort studies. However, test-retest reliability studies reveal substantial within-person variability. Willett (2013) demonstrated that single FFQ assessments can have reliability coefficients as low as 0.4–0.6 for many nutrients, meaning that 40–60% of the observed variance reflects random error rather than true between-person differences.

This measurement error leads to attenuation of diet-disease associations. For example, the true relative risk for the association between dietary fat intake and breast cancer risk may be 1.5, but observed relative risks in studies using FFQs might be only 1.1–1.2 due to regression dilution bias. This has contributed to decades of “null findings” in nutritional epidemiology that may reflect measurement limitations rather than true absence of effect (Kipnis et al., 2003).

Correction Methods

Researchers can use regression calibration and measurement error models to adjust for known attenuation. These methods require validation substudies with more precise measurements (e.g., biomarkers, doubly labeled water) to estimate the degree of measurement error and correct the observed associations (Carroll et al., 2006).

R Watch attenuation bias shrink a true effect toward zero

What you'll do: simulate 2,000 people with a true linear exposure-outcome slope of 1.0, then add measurement noise to the exposure and re-estimate the slope. Vary the noise size and watch the attenuation grow.

What to take away: random error in an exposure variable does more than add noise; it systematically pulls the estimated slope toward zero. The dirtier your instrument, the smaller your estimated effect.

set.seed(230)
n <- 2000

# True exposure X, outcome Y with true slope = 1
X <- rnorm(n, mean = 10, sd = 2)
Y <- 2 + 1*X + rnorm(n, sd = 1)

# Noisy version of X (e.g., FFQ-measured dietary intake)
X_noisy <- X + rnorm(n, sd = 2)

# Compare slopes from "perfect" vs "noisy" exposure
coef(lm(Y ~ X))["X"]               # expect ~1.00 (truth)
coef(lm(Y ~ X_noisy))["X_noisy"]   # expect < 1.00 (attenuated)

# Stretch: how does noise size change the attenuation?
sds <- c(0, 0.5, 1, 2, 4)
sapply(sds, function(s) {
  Xn <- X + rnorm(n, sd = s)
  unname(coef(lm(Y ~ Xn))[2])
})

Console output (approx.)

X 1.001 # near the true slope of 1.00 X_noisy 0.498 # attenuated toward zero by ~50% [1] 1.001 0.940 0.802 0.498 0.198 # sd=0 0.5 1.0 2.0 4.0 -- slope shrinks as error grows

Reading the slopes. The clean regression recovers the true slope (~1.0). Adding noise with SD = 2 cuts the slope in half. Doubling noise SD to 4 shrinks it to ~0.2. This is exactly the mechanism that turns plausible diet-disease relationships into "null findings" when food-frequency questionnaires are the only exposure measure.

R Reflect on what you just ran

Use the questions below to interpret the output you produced. Look at your console before answering.

1. The true slope was 1.00. The clean lm(Y ~ X) recovered ~1.001 and the noisy lm(Y ~ X_noisy) recovered ~0.498. By what percentage was the noisy slope attenuated toward zero?

Model answerThe slope dropped from ~1.001 to ~0.498, an attenuation of about 50%. In classical (non-differential) measurement error terms, the observed slope = true slope × reliability, where reliability = σ²_X / (σ²_X + σ²_error). Here the true exposure X has variance 2² = 4 and the added noise also has variance 2² = 4, so reliability = 4/(4+4) = 0.5, almost exactly the attenuation factor you observed. This is the formal version of "noisy measurement biases regression slopes toward zero," and it is mechanical: it does not go away with sample size.

2. Read the stretch vector of slopes (sd = 0, 0.5, 1, 2, 4). Describe the pattern as measurement error SD grows. If a new biomarker cut measurement error SD from 2 down to 1, how close to the true slope of 1.0 would the new estimate get, based on your output?

Model answerThe slope shrinks monotonically as error SD grows: ~1.00 at SD=0, ~0.94 at SD=0.5, ~0.80 at SD=1, ~0.50 at SD=2, ~0.20 at SD=4 (your numbers will jitter, but they track the reliability factor 4/(4+SD²)). Halving error SD from 2 to 1 moves the recovered slope from ~0.50 up to ~0.80, so the new biomarker recovers a much larger fraction of the true effect, climbing from about half of the truth to about four-fifths of it. The take-home: investing in measurement precision can do more for effect estimation than enlarging the sample, because attenuation bias is a property of the noise-to-signal ratio, not of n.

3. Suppose X represents true dietary sodium and X_noisy represents self-reported FFQ-measured sodium. Connect your slope numbers to the Willett/Kipnis claim in this section: why might decades of "null findings" in nutritional epidemiology reflect measurement limitations rather than the absence of an effect?

Model answerSelf-reported FFQs are known to have measurement error SDs comparable to or larger than the true between-person variation in sodium, with reliability often below 0.4. From your simulation, that level of noise produces observed slopes ~0.20–0.40 of the truth. So even if dietary sodium genuinely affected blood pressure with a true slope of 1.0, the FFQ-based observational literature could easily report 0.2–0.3 with CIs spanning zero, the pattern Willett, Kipnis, and others have long argued explains apparent null findings. The remedy is not bigger studies; it is better measurement (biomarkers, recovery studies, regression calibration, repeated 24-h recalls) and analytic correction for known reliability.

Saved.

Section Takeaways

Every measured variable encodes a theoretical commitment about what causes disease; biomedical framings are powerful but often insufficient on their own.
Frameworks such as the social determinants of health, fundamental causes, and ecosocial theory direct measurement toward upstream conditions that purely individual-level instruments cannot see.
What gets measured shapes what becomes thinkable as an intervention, since omission is itself a theoretical choice, with implications for health equity.
Construct validity determines whether we are measuring what we think we are measuring, a prerequisite for valid epidemiological inference.
Measurement non-invariance means that identical scores may not be comparable across population subgroups.
Treating ordinal data as interval can bias effect estimates and inflate false-positive rates.
Poor reliability attenuates observed associations, potentially masking true causal effects.

Section 3 of 4

Residual Confounding, Reverse Causation & Simultaneity

⏱ Estimated reading time: 25 minutes

Section 3 of 4

Residual Confounding, Reverse Causation & Simultaneity

Three biases that survive correct measurement and correct causal structure.

Bias 1

Residual confounding

The bias that remains after adjustment, because adjustment was incomplete.

Imprecise measurement

Smoking as ever/never leaves intensity unadjusted within categories.

Coarse categorization

Broad income bands leave fine-grained SES differences untouched.

Omitted dimensions

Adjusting for education alone leaves wealth, neighbourhood, and occupational prestige uncontrolled.

The landmark case

Hormone replacement therapy and cardiovascular disease

Observational studies showed 30–50% lower cardiovascular risk with HRT. The Women's Health Initiative trial found the opposite.Rossouw et al., 2002; Hernan et al., 2008

Healthy user bias and SES confounding persisted even after adjustment for measured covariates. The discrepancy is a canonical example of what happens when an imprecisely measured gradient is mistaken for a treatment effect.

Bias 2

Reverse causation: the arrow may point the other way

Lag analyses that exclude early follow-up are a primary diagnostic tool: if the association attenuates sharply when early events are removed, reverse causation is likely at work.

Bias 3

Simultaneity: both arrows point at once

When two variables mutually cause each other, standard regression cannot identify the causal direction.

Obesity predicts later depression (OR = 1.55). Depression predicts later obesity (OR = 1.58).

Luppino et al., 2010.

Fixes: cross-lagged panel models, structural equations, or instrumental variables.

Wrapping up

The three biases as a checklist

Residual confounding

Ask: Was the confounder measured precisely? Categorized finely enough?

Fixes: Better measurement, E-values, triangulation.

Reverse causation

Ask: Could declining health have caused the exposure?

Fixes: Lag analysis, repeated measures, Mendelian randomization.

Simultaneity

Ask: Could the outcome also cause the exposure?

Fixes: Cross-lagged models, structural equations, instrumental variables.

Introduction and Overview

An earlier section covered measurement; an earlier section covered the structure of the causal model. This section addresses three biases that survive both. Residual confounding remains even after correct DAG specification, because we never measure confounders perfectly. Reverse causation arises when the apparent cause is actually downstream of the apparent effect. Simultaneity is the limit case where two variables cause each other and the very framing of the analysis is wrong. All three are common enough in the published literature that you will encounter examples within a few weeks of any reading list.

Learning Objectives

Define residual confounding and explain why it persists even after “adjusting for” a confounder.
Use the HRT–cardiovascular discordance to illustrate how imprecisely measured socioeconomic confounders can mimic large protective effects.
Identify reverse causation in published associations and propose study designs (longitudinal data, instrumental variables, Mendelian randomisation) that can adjudicate it.
Define simultaneity (mutual causation) and explain why standard regression adjustment cannot resolve it.
Read an observational study with all three biases (residual confounding, reverse causation, simultaneity) on a checklist before believing its causal claim.

Residual Confounding

Residual confounding occurs when adjustment for a confounder is incomplete, either because the confounder is measured imprecisely, categorized too coarsely, or only partially captured by the available variables. Even when researchers “adjust for” a confounder, residual confounding can persist and bias effect estimates.

Definition: Residual Confounding

The bias that remains after adjustment for a confounder, due to imperfect measurement or incomplete capture of the confounding variable. It is a form of unmeasured confounding within measured variables.

Hormone Replacement Therapy and Cardiovascular Disease

For decades, observational studies consistently showed that postmenopausal women using hormone replacement therapy (HRT) had a 30–50% lower risk of cardiovascular disease (CVD) compared with nonusers. This finding influenced clinical guidelines worldwide.

However, the Women's Health Initiative (WHI) randomized trial found that HRT actually increased cardiovascular risk (Rossouw et al., 2002). What explained the discrepancy?

Subsequent analyses by Humphrey et al. (2002) and Hernan et al. (2008) showed that the observational studies suffered from residual confounding by socioeconomic status, healthy user bias, and smoking. Women who chose HRT were healthier, wealthier, and more health-conscious than those who did not, and even after adjusting for measured covariates, the adjustment was incomplete.

Why Adjustment Fails

Residual confounding arises through several mechanisms:

Measurement error in the confounder: If smoking is measured as ever/never rather than pack-years, considerable confounding by smoking intensity remains unadjusted.
Coarse categorization: Adjusting for income in broad categories ($0–$30K, $30K–$60K, $60K+) leaves within-category confounding by fine-grained socioeconomic differences.
Omitted dimensions: “Socioeconomic status” encompasses education, income, wealth, occupational prestige, and neighborhood context. Adjusting for education alone leaves residual confounding by the other components.

Simulation studies by Fewell et al. (2007) demonstrated that even modest measurement error in a strong confounder can leave substantial residual confounding, sufficient to create or mask associations of the magnitude commonly reported in epidemiological research.

Addressing Residual Confounding

Improve measurement: Use continuous rather than categorical measures of confounders when possible; use validated instruments with known measurement properties.
Sensitivity analyses: Quantitative bias analysis (e.g., E-values) can estimate how strong unmeasured or residual confounding would need to be to explain away an observed association (VanderWeele & Ding, 2017).
Negative control exposures/outcomes: Variables that should not be associated with the outcome (or exposure) can help detect residual confounding (Lipsitch et al., 2010).
Triangulation: Compare results across study designs with different confounding structures (e.g., observational vs. Mendelian randomization).

Residual confounding is what is left over after we have tried to adjust for the variables we know about. The next bias is what happens when our entire ordering of cause and effect is wrong.

Reverse Causation

Reverse causation occurs when the presumed outcome actually causes (or influences) the presumed exposure, rather than the other way around. This is particularly problematic in cross-sectional and case-control studies where the temporal sequence of events is unclear.

Case: Physical Activity and Chronic Illness

Numerous observational studies report that physical inactivity is associated with increased risk of chronic diseases including cardiovascular disease, diabetes, and cancer. While this association is likely at least partly causal, reverse causation is a major concern: people who are developing chronic illness may reduce their physical activity because of early symptoms, fatigue, or functional limitations.

Ding et al. (2020) examined data from the UK Biobank and found that excluding the first several years of follow-up (to allow for a “lag period”) substantially attenuated the association between physical activity and mortality, consistent with reverse causation. Individuals who died early in follow-up were more likely to have been inactive at baseline because they were already sick.

Strategies to Address Reverse Causation

Lag analyses: Exclude events occurring in the first few years of follow-up to remove individuals whose exposure was influenced by pre-existing disease.

Prospective design with repeated measures: Track changes in exposure over time to determine temporal ordering.

Instrumental variable approaches: Use genetic variants (Mendelian randomization) that influence exposure but are not affected by disease status.

Reverse causation flips the direction of a single arrow. The third bias goes one step further: it allows arrows in both directions at once.

Simultaneity Bias

Simultaneity bias (also called bidirectional causation) arises when two variables mutually cause each other, making it impossible to identify the causal direction from observational data alone. Standard regression models assume that the predictor causes the outcome, not vice versa; when causation runs in both directions, ordinary regression estimates are biased.

Case: Obesity and Depression

The relationship between obesity and depression has been the subject of hundreds of studies. Meta-analyses by Luppino et al. (2010) found evidence for bidirectional causation:

Obesity at baseline increased the risk of subsequent depression (OR = 1.55, 95% CI: 1.22–1.98)
Depression at baseline increased the risk of subsequent obesity (OR = 1.58, 95% CI: 1.33–1.87)

This bidirectional relationship means that a cross-sectional study finding an association between obesity and depression cannot determine whether obesity causes depression, depression causes obesity, or both. Moreover, standard regression approaches that treat one variable as the “exposure” and the other as the “outcome” will produce biased estimates because each variable is both a cause and consequence of the other.

Bias Type	Core Problem	Primary Study Design Vulnerability	Key Mitigation Strategy
Residual Confounding	Incomplete adjustment for measured confounders	All observational designs	Better measurement, sensitivity analysis, triangulation
Reverse Causation	Outcome influences exposure rather than vice versa	Cross-sectional, short-follow-up cohorts	Lag analyses, repeated measures, Mendelian randomization
Simultaneity	Two variables mutually cause each other	Cross-sectional, standard regression models	Longitudinal cross-lagged models, instrumental variables

Reflection

Consider the finding that people who eat more fruits and vegetables tend to have lower rates of depression. Describe how residual confounding, reverse causation, and simultaneity could each offer alternative explanations for this association. Which do you think is most plausible, and why?

Model answerResidual confounding: people who eat more fruits and vegetables also eat less ultra-processed food, exercise more, smoke less, have higher SES, and have richer social networks, each independently protective against depression. Adjustment for self-reported correlates rarely removes the structural confounding entirely. Reverse causation: depressed people lose appetite, energy for food preparation, and motivation to shop, so depression reduces fruit-and-vegetable intake; the observed cross-sectional association can run from outcome to exposure. Simultaneity: mood and dietary choices co-vary day to day, so even prospectively measured intake could be partly a function of subclinical low mood at the time of report. Most plausible: reverse causation, because the time-scale (a depressed week reduces food prep that day) makes the effect visible immediately, whereas confounding by SES is real but slower-acting. A defensible answer can argue for any of the three; what matters is being specific about mechanism.

Reflection saved.

Section Takeaways

Residual confounding persists even after statistical adjustment when confounders are measured imprecisely or categorized too coarsely.
The HRT-CVD discrepancy between observational studies and the WHI trial is a landmark example of residual confounding by healthy user bias.
Reverse causation is especially problematic in studies of physical activity and chronic disease, where declining health may reduce activity.
Simultaneity bias arises when two variables are mutually causal (e.g., obesity and depression) and requires specialized analytical approaches.

Section 4 of 4

Final Assessment

⏱ Estimated time: 30 minutes

Bringing It All Together

This lesson took apart three of the deepest assumptions baked into every observational analysis: that the variables mean what we say they mean, that we have controlled for the right things in the right way (Hernán, 2004), and that the cause comes before the effect. Each section then built up the corresponding repertoire of biases: measurement (an earlier section), causal-specification (an earlier section), and the residual problems that survive both (an earlier section). Together they form a checklist you can apply to any study you read for the rest of this course and through later courses.

The deeper move was the one Krieger and Link & Phelan have been making for decades: instruments do not pick themselves. Whether a study can “see” structural racism, neighbourhood disinvestment, occupational exposure, or chronic discrimination depends entirely on which theoretical framework chose its variables in the first place. A perfectly executed analysis of variables drawn from the wrong framework will reliably produce evidence that points away from the actual drivers of population health. That is why the conceptualisation step matters before the measurement step, and the measurement step before the causal-specification step.

A later lesson takes the next layer: who ended up in the study at all. Sampling, selection, and external validity are the biases that arise from which subjects we get to observe, biases that combine with the measurement and causal-specification problems documented here to produce the final, integrated picture of why an observational study’s causal claim might be wrong.

Key Takeaways from this lesson

Construct validity is theory-laden. The choice of which constructs to measure is a theoretical commitment that shapes which interventions ever become thinkable.
Reliability and validity are not the same. A measure can be perfectly reliable and still capture the wrong construct; classical (random) error tends to attenuate associations toward the null.
DAGs are the discipline of causal specification. Confounders should be adjusted for, mediators should not (if the total effect is the target), and colliders create bias when conditioned on.
More adjustment is not always better. The “obesity paradox” and other apparent paradoxes are typically collider-bias artefacts produced by over-adjustment, not new biology.
Residual confounding is the rule, not the exception. The HRT–cardiovascular discordance shows what happens when an imprecisely measured SES gradient is mistaken for a treatment effect.
Reverse causation and simultaneity survive even careful measurement and DAG specification; resolving them requires longitudinal data, instrumental variables, or Mendelian randomisation, not more covariate adjustment.

This lesson took apart the assumptions baked into every observational analysis: that variables mean what we say they mean, that we have controlled for the right things in the right way, and that the cause comes before the effect. Each section then built up the corresponding repertoire of biases: measurement, causal-specification, and the residual problems that survive both. The reflection below asks you to put all three layers to work on a single hypothetical study; the comprehensive assessment that follows tests the conceptual material across the three sections.

R Activity: Attenuation bias from noisy exposure measurement

The companion R script r-activities/HSCI_230_Lesson_7_Conceptualization_Measurement_and_Causal_Specification.R simulates a true linear association between an exposure and an outcome, then re-fits the model after adding classical measurement error to the exposure (e.g., a food-frequency questionnaire). You will see the regression slope shrink toward zero, the textbook signature of attenuation bias, and watch the shrinkage grow as the measurement-error SD increases.

set.seed(230)
n <- 2000

# True exposure X, outcome Y with true slope = 1
X <- rnorm(n, mean = 10, sd = 2)
Y <- 2 + 1*X + rnorm(n, sd = 1)

# Noisy version of X (e.g., FFQ-measured dietary intake)
X_noisy <- X + rnorm(n, sd = 2)

# Compare slopes from "perfect" vs "noisy" exposure
coef(lm(Y ~ X))["X"]             # expect ~1.00 (truth)
coef(lm(Y ~ X_noisy))["X_noisy"] # expect < 1.00 (attenuated)

## -----------------------------------------------------------------------------
## Stretch: how does noise size change the attenuation?
## -----------------------------------------------------------------------------
sds <- c(0, 0.5, 1, 2, 4)
sapply(sds, function(s) {
  Xn <- X + rnorm(n, sd = s)
  unname(coef(lm(Y ~ Xn))[2])
})
# Larger SD of measurement error -> larger attenuation toward zero.

Reflection

You are reviewing a study that reports a statistically significant association between a self-reported behavioral exposure (measured with a Likert scale) and a chronic disease outcome. Drawing on what you learned in this lesson, describe at least three distinct measurement or causal specification issues that could threaten the validity of this finding. For each, explain how the researchers might address it.

Model answerThree threats to validity for a Likert-measured self-reported behavioural exposure and a chronic outcome: (1) Non-classical measurement error: Likert categories are ordinal, not continuous, so treating them as numeric introduces ceiling/floor effects and rounding; fix by analysing exposure as a factor or with monotone-spline / ordered-categorical models, and validating against a gold-standard instrument in a sub-sample. (2) Recall bias: if the chronic outcome influences how respondents report the exposure (depression, chronic pain), the misclassification is differential and inflates the effect estimate; fix by collecting exposure prospectively in a nested design, anchoring to objective records where possible, and running sensitivity analyses for differential misclassification. (3) Misspecified causal structure: the regression adjusts for a list of "available" covariates without a DAG, risking over-adjustment for mediators (e.g., adjusting for BMI when studying physical activity and CVD) or collider bias (adjusting for variables affected by both exposure and outcome); fix by drawing the DAG before specifying covariates and computing the appropriate adjustment set from it (Pearl / dagitty). All three are addressable with design and analytic choices made at study planning, not after data are in hand.

Reflection saved.

Final Knowledge Assessment

This 15-question assessment covers all topics from this lesson. You must score 100% to complete the lesson. Review the explanations for any incorrect answers and try again.

This lesson Complete!

Congratulations! You have successfully completed this lesson: Conceptualization, Measurement, and Causal Specification.

A later lesson picks up the bias inventory from a different angle. Where this lesson focused on what we measure and how we model causation, A later lesson: Sampling, Selection, and External Validity turns to who ends up in the study at all. Every threat to validity covered there, including volunteer bias, loss to follow-up, and healthy-worker effects, combines with the measurement and specification problems you have just met.

Your responses have been downloaded automatically.

HSCI 230, Lesson 7

Evaluating Epidemiological Research

Conceptualization, Measurement &Causal Specification

Learning objectives for this lesson:

Glossary: Key Terms, People & Concepts

Construct Validity & Measurement

Construct Validity & Measurement

The unstated assumption behind every study design

Construct validity

What it asks

The prior question

Four frameworks that widen the lens

The CES-D and differential item functioning

The key point

Self-rated health: powerful but problematic

Its strength

Its problems

Ordinal is not interval; noise attenuates

Ordinal ≠ interval

Attenuation bias

What to take into the next section

Introduction and Overview

Learning Objectives

Why Measurement Matters in Epidemiology

Key Concept: Construct Validity

Theory Before Instruments: How Frameworks Shape What We Measure

Why this matters: measurement is theory-laden

A caution and a balance

Case Study: The CES-D Scale and Differential Item Functioning

Why This Matters for Epidemiological Research

Self-Rated Health: A Deceptively Simple Measure

The Problem of Scale Level: Ordinal vs. Interval

Reliability and Attenuation Bias

Correction Methods

R Reflect on what you just ran

Section Takeaways

Causal Specification Errors

Causal Specification Errors

Directed acyclic graphs: three structures

Why conditioning on a collider creates bias

The obesity paradox as collider bias

The birth weight paradox: mediator plus collider

Smoking → Low BW → Mortality

Smoking → Low BW ← Birth defects

What to take into the next section

Introduction and Overview

Learning Objectives

Directed Acyclic Graphs (DAGs) for Causal Reasoning

Three Key DAG Structures

A note on vocabulary

Hands-on: Causal DAG Playground

🔗 Interactive: Causal DAG Playground

DAG

Paths from E to Y

Case Study: The Obesity Paradox

Critical Implication

Case Study: Overadjustment Bias in Perinatal Epidemiology

Reflection

Section Takeaways

Residual Confounding, Reverse Causation & Simultaneity

Residual Confounding, Reverse Causation & Simultaneity

Residual confounding

Imprecise measurement

Coarse categorization

Omitted dimensions

Hormone replacement therapy and cardiovascular disease

Reverse causation: the arrow may point the other way

Simultaneity: both arrows point at once

The three biases as a checklist

Residual confounding

Reverse causation

Simultaneity

Introduction and Overview

Learning Objectives

Residual Confounding

Definition: Residual Confounding

Hormone Replacement Therapy and Cardiovascular Disease

Why Adjustment Fails

Addressing Residual Confounding

Reverse Causation

Conceptualization, Measurement &
Causal Specification