# Lesson 7 — Conceptualization, Measurement & Causal Specification (v3 expanded)

*Companion-podcast transcript • Sarah & Kiffer* 
*~5557 words • ~30.1 min audio*

---

**Sarah:** Welcome back to Office Hours. I'm Sarah.

**Kiffer:** And I'm Kiffer. Today we're working through Lesson 7, Conceptualization, Measurement, and Causal Specification. And I want to flag at the start that this might be the most ambitious lesson earlier in this series, because it does something the previous lessons quietly avoided. It interrogates the assumptions baked into every observational analysis we've discussed so far.

**Sarah:** Yeah, let me set the stage. Lessons 3 through 6 surveyed the four observational designs. Cross-sectional, case-control, cohort, and ecological. And every one of those designs is built on the same kind of two-by-two contingency table with the same kind of measure of association.

**Kiffer:** Right. And the unstated assumption running through all of those lessons is that the variables in the table actually mean what we think they mean. That exposure really is exposure. That disease really is disease. That confounders are the right ones in the right relationships. Lesson 7 stops to interrogate that.

**Sarah:** Layer one. Section 1 asks whether our instruments measure the constructs we say they measure, and whose theory of disease determined what got measured at all.

**Kiffer:** Layer two. Section 2 uses directed acyclic graphs, called DAGs for short, to expose the most common causal-specification mistakes. Conditioning on colliders, adjusting for mediators, and the famous paradoxes both produce.

**Sarah:** Layer three. Section 3 covers three biases that survive even careful measurement and adjustment. Residual confounding. Reverse causation. And simultaneity bias.

**Kiffer:** Okay. Let's start with Section 1. Construct validity and measurement.

**Sarah:** Define construct validity for me first. Because students hear that phrase a lot and sometimes confuse it with reliability or with external validity.

**Kiffer:** Sure. Construct validity is the degree to which a measurement instrument actually captures the theoretical concept it is intended to measure. A scale designed to measure depression has strong construct validity only if it truly reflects the underlying depressive construct. Not anxiety. Not fatigue. Not cultural distress.

**Sarah:** And the lesson surrounds construct validity with three related concepts. Walk me through each one.

**Kiffer:** Yeah. First, measurement non-invariance. The instrument doesn't measure the same thing in different groups. The same total score can carry different meanings across populations. Second, construct-irrelevant variance. The score is influenced by things that have nothing to do with the construct of interest. The reading-comprehension level required to understand items. The respondent's mood. The way the question was phrased.

**Sarah:** And third, reliability.

**Kiffer:** Reliability is consistency. Whether a measure produces stable results over time, across raters, or across items in the same scale. Reliability and validity are related but distinct. A bathroom scale that always reads five pounds light is highly reliable but not valid. Reliability is necessary for validity, but not sufficient.

**Sarah:** Okay. So before we get into the psychometrics, the lesson goes upstream. It asks where the construct comes from in the first place.

**Kiffer:** Right. Every variable in an epidemiological study is the residue of a theoretical decision. A claim about what causes disease, what counts as a relevant exposure, where the boundary of the cause should be drawn. That decision is upstream of any psychometric work. It determines what the rest of the analysis can possibly see.

**Sarah:** Can you give me a concrete version of what that means? Because it sounds abstract.

**Kiffer:** Sure. Imagine a study of cardiovascular disease that measures cholesterol, blood pressure, and smoking, but not neighborhood disinvestment, occupational exposures, or experiences of discrimination. The resulting evidence base will reliably point clinicians toward statins and behavioral counseling rather than toward housing, labor, or anti-racism policy. The instruments did not pick themselves. A theory of disease causation picked them.

**Sarah:** And the dominant theory in modern medicine is what gets called the biomedical model.

**Kiffer:** Right. The biomedical model locates disease in individual bodies and explains population patterns as the aggregation of individual risk factors. It's powerful for some questions. It gave us germ theory, vaccines, antibiotics. But it systematically under-measures the conditions in which bodies live, work, and age.

**Sarah:** And the lesson walks through four frameworks that have emerged to push back. Let's go through each.

**Kiffer:** First framework. The social determinants of health. The conditions in which people are born, grow, live, work, and age. Income, education, housing, food security, working conditions, social inclusion. The framework was popularized by the World Health Organization Commission on Social Determinants of Health, with a 2010 report led by Orielle Solar and Alec Irwin.

**Sarah:** And the empirical foundation for the social-determinants framework comes from earlier work. The Whitehall Studies.

**Kiffer:** Yeah. Michael Marmot is a British epidemiologist at University College London. Starting in the late 1960s, he and his collaborators followed British civil servants in London. The Whitehall Studies followed tens of thousands of office workers. Mortality risk varied by employment grade in a clear gradient. A graded effect even at the top of the hierarchy.

**Sarah:** And that gradient persisted even after adjusting for behavioral risk factors like smoking and diet. Which is what made the work so influential.

**Kiffer:** Exactly. If the gradient persists after adjusting for behavior, then there must be something about social position itself, not just lifestyle, that affects health. That insight launched a whole research program.

**Sarah:** And the measurement implication of accepting the social-determinants framework is that priorities shift. A study of asthma incidence shouldn't just measure inhaler adherence. It should measure mold exposure, landlord responsiveness, and traffic proximity.

**Kiffer:** Right. Second framework. Fundamental causes. Bruce Link and Jo Phelan in 1995.

**Sarah:** What's the core argument?

**Kiffer:** Socioeconomic status is a fundamental cause of disease. Four parts. First, it influences multiple disease outcomes. Second, it operates through multiple risk-factor mechanisms. Third, it involves access to flexible resources, knowledge, money, social connections, that can be deployed to avoid risks. Fourth, it reproduces health inequalities even as the specific intervening mechanisms change over time.

**Sarah:** And the empirical support for this is striking. Educational gradients in mortality have persisted across centuries even as the leading causes of death have shifted from infectious disease to chronic disease. The mechanisms changed. The gradient did not.

**Kiffer:** And there's a really important methodological implication. Adjusting for downstream behavioral mediators like smoking and diet doesn't explain away the socioeconomic-mortality association, because flexible resources just find new pathways. Treating socioeconomic status purely as a confounder to be statistically controlled is a theoretical commitment. And arguably a mistaken one.

**Sarah:** Third framework. Ecosocial theory. Nancy Krieger.

**Kiffer:** Yeah. Nancy Krieger is a social epidemiologist at the Harvard T. H. Chan School of Public Health. She articulated ecosocial theory in 1994 and 2001. The framing question is, how do we, quote, literally embody, biologically, the societal and ecological context into which we are born, end quote.

**Sarah:** Walk me through what embodiment means concretely. Because it sounds metaphorical, but Krieger means it physically.

**Kiffer:** She means it physically. Chronic exposure to discrimination, poverty, environmental hazards, and labor stress is literally inscribed in cortisol patterns, telomere length, allostatic load, and epigenetic marks. The social conditions get under the skin. Ecosocial theory pushes researchers toward measuring exposures across the lifecourse, at multiple spatial scales.

**Sarah:** Fourth framework. Health equity. Paula Braveman.

**Kiffer:** Paula Braveman is at the University of California, San Francisco. In a 2014 paper she defines health equity as the absence of unfair and avoidable differences in health among population groups defined socially, economically, demographically, or geographically. Equity isn't the same as equality of average health. It's a claim about which differences are unjust.

**Sarah:** And the operational point. Measuring along the axes where injustice is suspected to operate. Not just race and income. Indigeneity, immigration status, gender identity, disability, and their intersections.

**Kiffer:** And here's the sharpest line in the lesson. A study that reports only an overall mean has, by omission, taken a position on which differences are worth noticing. Choosing not to disaggregate is itself a theoretical choice.

**Sarah:** Okay, the lesson illustrates all of this with a worked example I find really useful. Two research teams each studying type 2 diabetes incidence in the same population.

**Kiffer:** Right. Team A works from a biomedical frame. They measure body mass index, fasting glucose, hemoglobin A1c, dietary intake from a food-frequency questionnaire, self-reported physical activity, and family history of diabetes. Their conclusion is that incidence is driven by individual lifestyle and genetic risk. Their intervention recommendation is diet and exercise counseling.

**Sarah:** Team B works from an ecosocial frame. They measure all the same biomarkers, plus neighborhood food environment, shift-work history, lifetime experiences of racial discrimination, household income trajectory since childhood, and exposure to a major recession. Their conclusion is that behavioral risk factors mediate roughly half of the social gradient. The other half reflects chronic stress and structural disinvestment.

**Kiffer:** Both teams are doing valid epidemiology in the construct-validity sense. The measurements are reliable. The analyses are sound. They reach different conclusions because they measured different things. And they measured different things because they began from different theories about what causes disease.

**Sarah:** Which is the framework that runs through the rest of Section 1. Measurement is theory-laden. The instrument inherits the theory of whoever built it.

**Kiffer:** And the lesson uses two case studies to drive the point home with widely-used measurement tools. Case study one. The Center for Epidemiologic Studies Depression scale, called the CES-D for short.

**Sarah:** What is the CES-D and where did it come from?

**Kiffer:** The CES-D is a twenty-item self-report scale developed by Lenore Radloff at the United States National Institute of Mental Health in 1977. It's one of the most widely used screening instruments for depressive symptoms. The total score ranges from zero to sixty, with a clinical cutoff of sixteen.

**Sarah:** And there's a problem with how it travels across cultural and racial groups.

**Kiffer:** Right. Iwata and colleagues in 2002 found that Japanese respondents endorsed somatic items, like, my sleep was restless, at higher rates than American respondents with equivalent levels of underlying depression. American respondents endorsed affective items, like, I felt sad, at higher rates.

**Sarah:** And Kim and colleagues in 2011 demonstrated that multiple CES-D items function differently across non-Hispanic White, African American, and Hispanic adults in the United States. Items related to interpersonal difficulties showed differential item functioning by race and ethnicity.

**Kiffer:** And let me define that term explicitly. Differential item functioning, abbreviated as DIF, is when an item operates differently for different groups even after accounting for the underlying construct. It means the same total score doesn't carry the same meaning across groups.

**Sarah:** So a CES-D score of sixteen, the clinical cutoff, doesn't carry the same meaning across these groups.

**Kiffer:** Exactly. And it shows up in a single-item measure too. Case study two. Self-rated health. And this one is fascinating because it shows the same problem in a tool that has only one question.

**Sarah:** Walk me through it.

**Kiffer:** Self-rated health is typically measured as a single question. How would you rate your overall health? Excellent, very good, good, fair, or poor. And against all psychometric expectations, that one question is one of the strongest predictors of mortality in epidemiological research. Sometimes more predictive than objective biomarkers.

**Sarah:** But the responses are shaped by comparison groups, expectations, and cultural frameworks.

**Kiffer:** Right. Amartya Sen, the Indian economist and Nobel laureate, wrote a piece in the British Medical Journal in 2002 titled Health, Perception versus Observation. He documented that lower-socioeconomic-status individuals report better self-rated health relative to their objective health than higher-socioeconomic-status individuals.

**Sarah:** Why? What's the mechanism?

**Kiffer:** If your reference group is sicker, you rate yourself as healthier by comparison. People in resource-deprived contexts use a different comparison standard. So self-rated health may underestimate health inequalities across socioeconomic strata, exactly because the people most affected by inequality are recalibrating against a sicker peer group.

**Sarah:** And the other findings.

**Kiffer:** Franks and colleagues in 2003 in the Archives of Internal Medicine found that the predictive validity of self-rated health for mortality varies across racial and ethnic groups. And Marja Jylhä and colleagues in 1998 documented that cultural differences in response styles, like modesty norms, affect self-rated health reporting. Cross-national comparisons require careful calibration.

**Sarah:** Then the lesson moves to another foundational measurement issue. Ordinal versus interval scales.

**Kiffer:** And this one trips up an enormous amount of published research. Most epidemiological surveys use Likert-type scales. Strongly agree, agree, neutral, disagree, strongly disagree. Researchers routinely assign the numeric values one through five and analyze those values in linear regression as if the gaps between categories were equal.

**Sarah:** And the question the lesson asks is, is the distance between strongly agree and agree really the same as the distance between neutral and disagree?

**Kiffer:** Almost certainly not. Torrin Liddell and John Kruschke at Indiana University ran simulations in 2018 in the Journal of Experimental Social Psychology to quantify what goes wrong when you treat ordinal Likert data as interval-level.

**Sarah:** What did they find?

**Kiffer:** Three things. Inflated Type I error rates, meaning you reject true null hypotheses too often. Biased parameter estimates. And in the most extreme cases, reversals of effect direction. Where the simulated true effect was positive but the misspecified model returned a negative one.

**Sarah:** When is the bias worst?

**Kiffer:** Three conditions. Skewed response distributions, with respondents clustering at one end. Unequal spacing between categories in the latent construct. And tests of interactions between variables. The recommendation is ordinal logistic regression or Bayesian ordinal models.

**Sarah:** And then the last measurement topic in this section. Reliability and attenuation bias.

**Kiffer:** Right. Even when a measure is conceptually valid, poor reliability introduces random measurement error that systematically attenuates observed associations. And the technical term for this is regression dilution bias.

**Sarah:** Walk through what attenuation actually means.

**Kiffer:** Yeah. Imagine the true relative risk for dietary fat intake and breast cancer is one point five. But if your dietary measurement, typically a food-frequency questionnaire with reliability around zero point four to zero point six, contains a lot of random error, the observed relative risk might be only one point one or one point two. The estimate gets pulled toward the null.

**Sarah:** And there's a shorthand for this in the methodological literature. Lambda.

**Kiffer:** Right. The regression dilution ratio, called lambda, equals the variance between persons divided by the total variance, which includes measurement-error variance within persons. As lambda gets smaller, attenuation gets worse. If lambda is one, measurement is perfect. If lambda is zero point five, your observed association is roughly half the true association.

**Sarah:** And this has had real consequences in nutritional epidemiology. Decades of null findings in diet-disease research may not reflect the absence of effects. They may reflect measurement limitations.

**Kiffer:** Yeah. Kipnis and colleagues in 2003 made that case explicitly. And the correction methods, regression calibration and measurement-error models, require validation substudies with more precise measurements like biomarkers or doubly labeled water to estimate the degree of error and correct the observed associations.

**Sarah:** Okay. Section 1 takeaway in one sentence.

**Kiffer:** Every measured variable encodes a theoretical commitment, and even when those commitments are made well, the instrument can still mean different things in different groups, and any random error in the measurement bends your estimates toward the null.

**Sarah:** Section 2. Causal specification. And this section assumes you've solved the measurement problem. Now the question is whether you've put the variables in the right relationship to one another.

**Kiffer:** Right. And the foundational paper for using DAGs in epidemiology is Greenland, Pearl, and Robins in 1999 in the journal Epidemiology. Sander Greenland is at the University of California, Los Angeles. Judea Pearl, also at U.C.L.A., won the Turing Award in 2011 for formalizing causal reasoning. James Robins is at the Harvard T. H. Chan School of Public Health. Their 1999 paper, Causal Diagrams for Epidemiologic Research, brought DAGs into mainstream epidemiology.

**Sarah:** Define a directed acyclic graph for me.

**Kiffer:** A directed acyclic graph is a visual representation of causal assumptions. Each node is a variable. Each arrow is a hypothesized direct causal effect. Directed means arrows go from cause to effect. Acyclic means no loops. No variable causes itself, even by going around through other variables.

**Sarah:** And there are three canonical structures every DAG is built from. Confounder. Mediator. Collider.

**Kiffer:** Let's go through each one with the example the lesson uses. First, the confounder.

**Sarah:** A confounder is a variable that causes both the exposure and the outcome. The classic example is the coffee, smoking, and lung cancer triangle. Smoking causes coffee drinking, because historically smokers and coffee drinkers overlapped substantially. Smoking causes lung cancer. Coffee on its own does not cause lung cancer.

**Kiffer:** And a naive analysis of coffee and lung cancer that does not adjust for smoking will pick up the smoking effect, making coffee look like a risk factor when it isn't. Adjusting for smoking, by stratification or regression or matching, removes the bias and the spurious association disappears.

**Sarah:** Second structure. Mediator.

**Kiffer:** A mediator is a variable on the causal pathway between exposure and outcome. The example is smoking, inflammation, and cancer. Smoking causes inflammation. Inflammation causes cancer. Inflammation is the mediator. It is part of the causal mechanism by which smoking produces cancer.

**Sarah:** And the rule for mediators is the opposite of the rule for confounders. Adjusting for a mediator blocks part of the causal effect. So if you want the total effect of smoking on cancer, leave inflammation alone. If you adjust for inflammation, you've estimated only the direct effect.

**Kiffer:** Third structure. Collider. And this one is the most counterintuitive.

**Sarah:** A collider is a variable caused by both the exposure and the outcome, or by variables associated with each. The lesson uses a Hollywood example.

**Kiffer:** Right. Talent and attractiveness. Imagine these are two truly independent traits. Among the general population, no relationship between them. Now both traits independently increase the chance of becoming a famous Hollywood actor. Talent points to fame. Attractiveness points to fame. So fame is a collider. Two arrows colliding into it from the two independent traits.

**Sarah:** And here's the spooky thing. If we restrict our analysis to famous people, that is, condition on the collider, we create a spurious negative association between talent and attractiveness.

**Kiffer:** Walk through why.

**Sarah:** Among famous people, if you're not particularly attractive, the most likely explanation for your fame is talent. And if you're not particularly talented, the explanation is attractiveness. So among the famous, those two traits look negatively correlated. But that's an artifact of the conditioning. The traits are independent in the underlying population.

**Kiffer:** And Felix Elwert and Christopher Winship, sociologists at Wisconsin and Harvard, formalized this in a 2014 review in the Annual Review of Sociology titled Endogenous Selection Bias. The takeaway is, never adjust for a collider. Doing so opens a biasing path that wasn't there before.

**Sarah:** Okay. Now the lesson uses two famous case studies to make these issues concrete with real published research. The obesity paradox and the birth weight paradox.

**Kiffer:** First, the obesity paradox in cardiovascular disease.

**Sarah:** Set up the paradox.

**Kiffer:** Yeah. Multiple observational studies have reported that among patients with chronic diseases like heart failure, chronic kidney disease, and type 2 diabetes, overweight and obese patients appear to have better survival than normal-weight patients. The finding has been called the obesity paradox and has generated debate, with some commentators suggesting adipose tissue might be metabolically protective in chronic illness.

**Sarah:** And Hailey Banack and Jay Kaufman at McGill University demonstrated in a 2014 paper in Preventive Medicine that this paradox is largely an artifact of collider stratification bias.

**Kiffer:** Right. Walk through the structure. Obesity causes chronic disease. So do other risk factors like smoking, frailty, and genetic susceptibility. Chronic-disease status is a collider, with obesity and other risk factors both pointing into it. Obesity also points to mortality. The other risk factors also point to mortality.

**Sarah:** When researchers restrict their analysis to patients who already have the chronic disease, they're conditioning on the collider. And what falls out of that?

**Kiffer:** Among the chronically ill, normal-weight patients are more likely to have the disease due to other severe risk factors like smoking or frailty. So the normal-weight patients in the sample are a sicker subset overall. Conditioning on chronic-disease status induces a spurious protective association between obesity and mortality. The data are real. The analysis is technically correct. The mistake is the design choice.

**Sarah:** Second case study. The birth weight paradox.

**Kiffer:** Maternal smoking during pregnancy is a well-established cause of both low birth weight and infant mortality. Both. But several studies that adjusted for birth weight reported the puzzling finding that maternal smoking appeared to have a reduced or null association with infant mortality among low-birth-weight infants. Smoking somehow looked protective for the smallest babies.

**Sarah:** And Sonia Hernandez-Diaz, Enrique Schisterman, and Miguel Hernan published a paper in 2006 in the American Journal of Epidemiology titled The Birth Weight Paradox Uncovered, which used DAGs to explain what was going on.

**Kiffer:** Their argument has two parts. Part one. Birth weight is a mediator on the path from smoking to infant mortality. Smoking causes low birth weight. Low birth weight causes infant mortality. Adjusting for birth weight blocks that pathway, removing the indirect effect of smoking that operates through birth weight.

**Sarah:** And part two. Birth weight is also a collider.

**Kiffer:** Right. Birth defects also cause low birth weight. So birth weight is a collider on a path between smoking and birth defects. Conditioning on birth weight induces a spurious negative association between smoking and birth defects. Among low-birth-weight infants, if you didn't get there by way of smoking, you're more likely to have a birth defect. Which makes smoking look protective in that stratum.

**Sarah:** So adjusting for birth weight does two harmful things at once. It blocks part of the legitimate effect through the mediator. And it conditions on a collider. Both make the smoking-mortality association look weaker than it actually is.

**Kiffer:** And I want to underline the takeaway from these two case studies. The canonical mistake in observational analysis is over-adjustment. Conditioning on variables that look like helpful controls but are actually colliders or mediators. Reviewers often ask researchers to adjust for more variables. But more is not always better. Schisterman, Cole, and Platt in 2009 in Epidemiology coined the term overadjustment bias for this pattern.

**Sarah:** Section 2 takeaway in one sentence.

**Kiffer:** DAGs help you see what to adjust for and what not to. Confounders, yes. Mediators, no, if you want the total effect. Colliders, definitely no. The famous paradoxes in the literature are usually misspecified causal models hiding in plain sight.

**Sarah:** Section 3. Three biases that survive correct measurement and correct DAG specification.

**Kiffer:** First. Residual confounding.

**Sarah:** Define it carefully. Because students sometimes confuse this with unmeasured confounding.

**Kiffer:** Yeah. Residual confounding is the bias that remains after adjustment for a measured confounder, due to imperfect measurement or incomplete capture of the confounding variable. It's a form of unmeasured confounding hiding inside a measured variable. You think you've adjusted for socioeconomic status because you have an income variable, but the variable is binned coarsely and the real confounding lives in finer-grained differences.

**Sarah:** And the textbook example is hormone replacement therapy and cardiovascular disease.

**Kiffer:** Right. For decades, observational studies showed that postmenopausal women using hormone replacement therapy had a 30 to 50 percent lower risk of cardiovascular disease than non-users. The Nurses' Health Study, which started in 1976 and followed roughly one hundred thousand American nurses, was a major source. By the late 1990s, hormone replacement therapy was one of the most-prescribed drug classes in the United States.

**Sarah:** Then in 2002, the Women's Health Initiative reported its results. And let me set up what the Women's Health Initiative is, because this is one of the landmark trials in modern epidemiology.

**Kiffer:** Please.

**Sarah:** The Women's Health Initiative was launched by the United States National Institutes of Health in 1991. The hormone-therapy arm randomized over sixteen thousand postmenopausal women to combined estrogen plus progestin or placebo. Unlike the observational studies, the assignment was random. Confounding was handled by design.

**Kiffer:** And in July 2002, the principal results paper by Jacques Rossouw and colleagues was published in the Journal of the American Medical Association. The trial had been stopped early. Hormone replacement therapy increased cardiovascular disease, stroke, and breast cancer. The opposite of what the observational studies had shown.

**Sarah:** The discrepancy was massive. And subsequent analyses tried to figure out why.

**Kiffer:** Linda Humphrey and colleagues in 2002 in the Annals of Internal Medicine pointed at residual confounding. Miguel Hernan and colleagues in 2008 in Epidemiology, in a paper titled Observational Studies Analyzed Like Randomized Experiments, showed that when you analyze the observational data with the same restrictions used in the randomized trial, the apparent protective effect largely disappears.

**Sarah:** And the residual confounding was driven by what the literature calls healthy user bias.

**Kiffer:** Define that for the listener.

**Sarah:** Healthy user bias is the pattern where people who choose to take preventive medications are systematically different from those who don't. They tend to be wealthier, more educated, more health-conscious, less likely to smoke, more likely to exercise, more likely to follow medical advice. All of those characteristics also affect cardiovascular risk independent of the medication.

**Kiffer:** Right. And the observational studies tried to adjust for measured covariates. Income, education, smoking. But the adjustment was incomplete. The remaining unmeasured or imperfectly-measured differences were enough to create a 30 to 50 percent apparent benefit that didn't actually exist.

**Sarah:** The lesson lists three mechanisms for residual confounding. Walk through them.

**Kiffer:** First, measurement error in the confounder. Smoking measured as ever versus never instead of pack-years leaves a lot of confounding by intensity and duration unadjusted. A one-pack-per-day smoker for thirty years gets coded the same way as someone who smoked briefly in college.

**Sarah:** Second, coarse categorization. Adjusting for income in three big brackets leaves within-category confounding from finer differences. The person making sixty-five thousand and the person making three hundred thousand are in the same bracket, but their lives differ in ways that affect health.

**Kiffer:** Third, omitted dimensions. Socioeconomic status encompasses education, income, wealth, occupational prestige, neighborhood context, social capital. Adjusting for education alone leaves residual confounding by all the other components.

**Sarah:** Zoe Fewell, George Davey Smith, and Jonathan Sterne ran a simulation in 2007 in the American Journal of Epidemiology. They showed that even modest measurement error in a strong confounder can leave substantial residual confounding, sufficient to create or mask associations of the magnitude commonly reported.

**Kiffer:** And the mitigations the lesson lists. Improve measurement. Use continuous rather than categorical confounders when possible. Use validated instruments. Conduct sensitivity analyses. The most popular sensitivity tool right now is the E-value, introduced by Tyler VanderWeele and Peng Ding in a 2017 paper in the Annals of Internal Medicine.

**Sarah:** What does an E-value tell you?

**Kiffer:** An E-value tells you how strong unmeasured or residual confounding would have to be to fully explain away the observed association. A large E-value means very strong unmeasured confounding would be needed, so the result is robust. A small E-value means even weak unmeasured confounding could explain it away.

**Sarah:** And another tool. Negative controls.

**Kiffer:** Right. Marc Lipsitch, Eric Tchetgen Tchetgen, and Ted Cohen in 2010 in Epidemiology proposed using negative control exposures or outcomes. Find a variable that should not be causally related to the outcome but shares the same confounding structure. If your analysis shows an effect on the negative control, residual confounding is driving things.

**Sarah:** Second bias. Reverse causation.

**Kiffer:** Reverse causation occurs when the presumed outcome actually causes the presumed exposure rather than the other way around. Particularly problematic in cross-sectional and case-control studies, where the temporal sequence of events is unclear.

**Sarah:** The textbook case in this lesson is physical activity and chronic illness.

**Kiffer:** Right. Numerous observational studies report that physical inactivity is associated with increased risk of cardiovascular disease, diabetes, and cancer. While the association is likely at least partly causal, reverse causation is a major concern. People who are developing chronic illness may reduce their physical activity because of early symptoms, fatigue, or functional limitations.

**Sarah:** And Ding and colleagues in 2020 used data from the United Kingdom Biobank, a large prospective cohort of about half a million British adults, to address this empirically. They used what's called a lag analysis.

**Kiffer:** Define that.

**Sarah:** A lag analysis excludes events that occur in the first several years of follow-up. The logic is, if reverse causation is operating, the people who die or develop disease early are likely the ones who were already sick at baseline. So excluding them removes those for whom reverse causation is most likely.

**Kiffer:** And what Ding and colleagues found was that excluding the first several years of follow-up substantially attenuated the association between physical activity and mortality. Consistent with reverse causation. Individuals who died early in follow-up were more likely to have been inactive at baseline because they were already sick at baseline.

**Sarah:** Other mitigation strategies for reverse causation. Prospective designs with repeated measures, so you can track changes in exposure over time and establish temporal ordering. And the third one is Mendelian randomization.

**Kiffer:** Right. Mendelian randomization is a genetic instrumental-variable approach. Genetic variants that influence a modifiable exposure, like genetic variants associated with body mass index, are used as instruments. Because genotype is fixed at conception and not affected by later disease status, the variants cannot be subject to reverse causation.

**Sarah:** Third bias. Simultaneity. Or bidirectional causation. The limit case of reverse causation.

**Kiffer:** Simultaneity arises when two variables mutually cause each other. Standard regression models assume that the predictor causes the outcome, not the other way around. When causation runs in both directions, ordinary regression estimates are biased.

**Sarah:** And the classic example is obesity and depression. Floriana Luppino and colleagues in a 2010 paper in the Archives of General Psychiatry conducted a meta-analysis of longitudinal studies.

**Kiffer:** What did they find?

**Sarah:** Bidirectional causation. Obesity at baseline increased the risk of subsequent depression, with an odds ratio of one point five five. Depression at baseline increased the risk of subsequent obesity, with an odds ratio of one point five eight. Roughly equal magnitudes in both directions.

**Kiffer:** Right. Which means a cross-sectional study finding an association between obesity and depression cannot determine whether obesity causes depression, depression causes obesity, or both. Standard regression treats one variable as the exposure and the other as the outcome, which is misspecified when they're mutually causal.

**Sarah:** Mitigation strategies for simultaneity. The lesson mentions cross-lagged panel models.

**Kiffer:** Yeah. A cross-lagged panel model uses repeated measures of both variables over time. You estimate the path from obesity at time one to depression at time two, and simultaneously the path from depression at time one to obesity at time two. The two paths are estimated jointly.

**Sarah:** Other strategies. Instrumental variables. Or careful longitudinal designs that establish temporal precedence. The point is that simultaneity bias requires specialized analytic approaches that recognize the bidirectional structure.

**Kiffer:** Okay. Let me try to pull the takeaways together.

**Sarah:** Yeah, let me list them. There are six I'd want a beginning epidemiology student to leave with.

**Kiffer:** Go ahead.

**Sarah:** First. Measurement is theory-laden. Every variable encodes a theoretical commitment about what causes disease. The biomedical model is powerful but often insufficient. The social-determinants framework, fundamental causes, ecosocial theory, and health equity all push researchers to measure upstream conditions individual-level instruments cannot see.

**Kiffer:** Second. Differential item functioning means an instrument can mean different things in different groups. The CES-D and self-rated health are real published examples. Establish measurement invariance before comparing groups, or observed disparities can reflect measurement artifacts rather than true differences.

**Sarah:** Third. Random measurement error attenuates associations toward the null through regression dilution bias. Lambda equals the variance between persons divided by the total variance. As lambda gets smaller, attenuation gets worse. So a null finding doesn't necessarily mean no effect. It might mean noisy measurement. And treating ordinal Likert data as if it were interval-level can inflate Type I error rates.

**Kiffer:** Fourth. DAGs help you see what to adjust for and what not to. Confounders, yes. Mediators, only if you want the direct effect. Colliders, never. Over-adjustment is the canonical mistake. The obesity paradox is collider stratification bias. The birth weight paradox is overadjustment by a mediator that's also a collider.

**Sarah:** Fifth. Three biases survive correct measurement and correct DAG specification. Residual confounding from imperfectly measured confounders. Reverse causation when temporal sequence is unclear. And simultaneity when variables mutually cause each other. Hormone replacement therapy is the textbook case of residual confounding. Physical activity is the textbook case of reverse causation. Obesity and depression is the textbook case of simultaneity.

**Kiffer:** Sixth. Each bias has its own toolkit. For residual confounding, improve measurement, conduct E-value sensitivity analyses, and use negative controls. For reverse causation, run lag analyses, use repeated measures, or use Mendelian randomization. For simultaneity, use cross-lagged panel models or instrumental variables.

**Sarah:** And one practical recommendation. Go play with the Causal DAG Playground in Section 2 of the lesson. Building each pattern by hand. Picking what to adjust and seeing whether the backdoor paths are open or closed. That hands-on practice is the fastest way to develop intuition for the kind of causal reasoning the rest of this material will keep asking from you.

**Kiffer:** And one more thing. Lesson 1 set the foundations of epidemiology, reminding us that public health depends on networks of actors and institutions, that paradigms shape what we can see, and that the published record can mislead us. Lesson 7 turns those framing concerns into operational guidance about what you measure, how you specify the model, and what biases survive your best efforts.

**Sarah:** Right. Each lesson is building on the one before. Lesson 7 is where the philosophical concerns become methodological discipline.

**Kiffer:** Next up is Lesson 8. Sampling, Selection, and External Validity. Who got into the study, who didn't, and what that means for generalizability.

**Sarah:** Take care, everyone.

**Kiffer:** See you there.