# Lesson 11 — Validity in Observational Studies (v3 expanded)

*Companion-podcast transcript • Sarah & Kiffer* 
*~5,366 words • ~29 min audio*

---

**Sarah:** Welcome back to Office Hours. I'm Sarah.

**Kiffer:** And I'm Kiffer. Today we're working through Lesson 11, Validity in Observational Studies. This is one of those consolidating lessons. We've been talking about validity in pieces all the way through this material and the first ten lessons of the previous part of the series. Lesson 11 is where we lay the formal taxonomy on the table and walk through it systematically.

**Sarah:** And the framing the lesson uses is the four pillars of validity. Internal validity. External validity, sometimes called transportability. Construct validity. And statistical conclusion validity. Let's just get clear definitions on the table before we go anywhere else, because each one is a different question.

**Kiffer:** Yeah, let's do that. Internal validity is the question of whether the association you observed in your study sample reflects a true causal relationship within that source population. So you ran a study, you got a number, and you're asking, is that number telling me what's actually happening among the people I sampled? Or is it being distorted by something inside the study itself?

**Sarah:** External validity, or transportability, is a different question. It's about whether your finding generalizes to populations beyond the source population. So even if your number is a perfect description of the people you studied, will the same number show up in a different city, a different country, a different age range, a different time period?

**Kiffer:** Construct validity asks whether you actually measured the constructs you intended to measure. You set out to measure depression. You administered a questionnaire. Does that questionnaire actually capture depression as it exists in the world, or does it capture something else, something adjacent, something that varies in ways the construct itself doesn't?

**Sarah:** And statistical conclusion validity is whether the statistical inferences from your data are sound. Did you have enough statistical power. Did you handle multiple comparisons properly. Was your model specified correctly. Did you account for clustering or correlation in the data.

**Kiffer:** So those are the four pillars. And one of the central messages of the lesson, and honestly one of the central messages of the whole course, is that these four are independent. A study can be strong on one and weak on another. You evaluate each separately. You don't just ask, is this study valid. You ask, is this study internally valid, externally valid, construct valid, and statistically valid.

**Sarah:** And there's a particular relationship between internal and external validity that I think trips students up a lot. Internal validity is necessary but not sufficient for external validity. Let me unpack that.

**Kiffer:** Yeah, please.

**Sarah:** If your study isn't internally valid, the number you got isn't even a faithful description of your own sample. So there's nothing to transport. You can't generalize a biased estimate. Internal validity is a precondition. But internal validity alone doesn't guarantee that your finding will hold in another population. A perfectly internally valid study, with perfect measurement, perfect adjustment for confounding, perfect handling of selection, can still produce a finding that doesn't generalize. Because the populations differ in ways that change the effect.

**Kiffer:** So the takeaway is that you have to evaluate both. Internal first, because it's the precondition. But then external validity is its own separate question with its own separate methods. We'll come back to that point at the end.

**Sarah:** Okay, let's start with internal validity, because that's where most of the lesson lives. The three classical threats to internal validity are selection bias, information bias, and confounding. You'll see these three as a unit in basically every epidemiology textbook. Let's take them one at a time.

**Kiffer:** Selection bias first. The basic mechanism is that the people who end up in your study aren't a faithful representation of the source population. The composition is distorted in some way that's related to both the exposure and the outcome. And that distortion bends the association you observe.

**Sarah:** And the lesson uses directed acyclic graphs to make this concrete. The simplest selection bias scenario is one where exposure and disease both independently affect whether someone ends up in the study. When you condition on being in the study, meaning when you only analyze the people who showed up, exposure and disease look associated even if they were independent in the source population. That's the selection bias structure.

**Kiffer:** There are several canonical forms of selection bias that come up over and over again. Let's walk through the big four. Berkson's bias. Healthy worker effect. Differential loss to follow-up. And volunteer bias.

**Sarah:** Berkson's bias is the hospital-based one. It's named after Joseph Berkson, who described it in 1946. The setup is that if you select cases and controls from a hospital population, both groups are people who were sick enough to be hospitalized. If your exposure of interest also independently increases the chance of being hospitalized, you get a spurious association just from the hospitalization filter.

**Kiffer:** The classic example is studying whether some risk factor causes a particular cancer using hospital cases and hospital controls. If the risk factor also tends to send people to the hospital for other reasons, your control group is enriched for the exposure, and your odds ratio shrinks toward the null or even flips direction.

**Sarah:** The mitigation is population-based sampling. Don't recruit cases and controls from the hospital. Recruit them from the source population that gave rise to the cases. Cancer registries, population-based surveillance systems, things like that.

**Kiffer:** Healthy worker effect is the second big one. The basic observation is that workers, as a group, are healthier than the general population. Because to be a worker you have to be healthy enough to work. Sick people drop out of the workforce. So if you compare an occupational cohort to the general population, you'll see lower disease rates in the workers, even if the work itself is harmful, because the workforce is selected for health to begin with.

**Sarah:** And it's not just at hiring. The healthy worker effect operates over time. People who develop disease often leave the workforce, so your active workforce gets selected even further for health. This is sometimes called the healthy worker survivor effect.

**Kiffer:** The mitigation is internal comparisons. Instead of comparing workers to the general population, compare highly exposed workers to less exposed workers within the same occupational cohort. Both groups are selected for the ability to work, so that part of the selection cancels out.

**Sarah:** Differential loss to follow-up is the cohort study version of selection bias. You enrolled a group, you intended to follow them, but some people dropped out. If the people who dropped out differ systematically from the people who stayed, and the dropping out is related to both exposure and outcome, you get bias.

**Kiffer:** And the mitigation is careful tracking of follow-up. Investigators put real effort into reaching participants who've moved, who've stopped responding, who've changed phone numbers. The fewer people you lose, the less room there is for differential loss to bias your estimates. Beyond that, you do sensitivity analyses for nonresponse. You assume different scenarios about what the missing people would have looked like and see how much your estimate changes.

**Sarah:** Volunteer bias is the fourth one. The people who volunteer for a study tend to differ systematically from the people who don't. Volunteers are typically healthier, more educated, more health-conscious. So if you're estimating prevalence or incidence from volunteer samples, you're going to be biased relative to the population the volunteers came from.

**Kiffer:** And the mitigation here is again sensitivity analysis. You think carefully about who the volunteers are, you compare them on measurable characteristics to the source population, and you reason about what direction and magnitude of bias the differences imply. Population-based sampling, when it's feasible, sidesteps the problem entirely.

**Sarah:** Okay, that's selection bias. Information bias is the second classical internal validity threat. And it's about how the data were measured, not who ended up in the study.

**Kiffer:** Information bias is what happens when exposure or outcome is measured with error in a way that distorts the association. The four big patterns are recall bias, social desirability bias, observer bias, and detection bias. Let's walk through each.

**Sarah:** Recall bias is the case-control classic. You're asking people about past exposures. People who got the disease tend to remember and report exposures differently from people who didn't. They've been searching their memory for explanations. They've talked to doctors and family. They have a stronger motivation to scrutinize their own past.

**Kiffer:** And the result is that cases recall exposures with higher sensitivity, and possibly different specificity, than controls. This is differential misclassification. Bias can be in either direction. Usually it inflates the association, but not always.

**Sarah:** Social desirability bias is another information bias pattern. People underreport behaviors that are socially stigmatized and overreport behaviors that are socially valued. Smoking, alcohol use, sexual behavior, drug use, eating habits. All vulnerable. And the underreporting can be differential. Different groups feel different levels of stigma.

**Kiffer:** Observer bias is when the data collector knows the participant's group status and that knowledge influences how they record the data. They probe more thoroughly for symptoms in one group. They round measurements differently. Even unconsciously, the observer's expectations shape the data.

**Sarah:** And detection bias is when the surveillance for the outcome differs across exposure groups. If exposed people are seen by their doctors more often, more outcomes get detected in that group. Even if the underlying rate is the same, you'll see a higher apparent rate in the exposed group, just because they were watched more closely.

**Kiffer:** The mitigations for information bias are mostly about design. Blinding, when possible. Blinding the outcome assessor to exposure status. Blinding the exposure assessor to outcome status. Validated instruments rather than ad hoc questions. Prospective designs rather than retrospective ones. Biomarker validation against self-report. And standardized protocols so that data collection is the same across groups.

**Sarah:** Confounding is the third classical threat. And it's the one we'll spend the most time on next lesson, in Lesson 12, Confounding and Causal Inference. But the lesson here gives a tour.

**Kiffer:** Confounding is when some third variable is associated with both the exposure and the outcome, and is not on the causal pathway between them. The classic example is smoking and lung cancer with coffee drinking as a possible confounder. Coffee drinkers tend to smoke more. Smoking causes lung cancer. So if you don't adjust for smoking, you'll see an apparent association between coffee and lung cancer that's really just smoking peeking through.

**Sarah:** There are three forms of confounding worth naming separately. Classical confounding, which is what I just described. Confounding by indication, which is specific to pharmacoepidemiology. And time-varying confounding, which shows up in longitudinal studies.

**Kiffer:** Confounding by indication is really important in drug safety research. When you study whether a drug causes some adverse outcome, you're comparing people who took the drug to people who didn't. But people who took the drug took it for a reason. They had the indication. The condition the drug was meant to treat. And that indication often itself increases the risk of the outcome.

**Sarah:** So if you observe that people on antidepressants have higher suicide rates, it might be that the antidepressant causes suicidality. Or it might be that depression itself causes suicidality, and the antidepressant is just a marker for having depression severe enough to need treatment. The indication is confounding the drug-outcome association.

**Kiffer:** Time-varying confounding shows up when you have repeated measures of exposure and confounder over time, and the confounder both affects future exposure and is affected by past exposure. The classic example is HIV treatment, where CD4 count both predicts whether you start treatment and is affected by whether you've been on treatment in the past. Standard regression methods can give biased answers in that situation.

**Sarah:** The mitigations for confounding form a long list, and we'll go through them in detail next lesson. But let's introduce them here. Restriction. Matching. Statistical adjustment via multivariable models. Propensity score methods. Instrumental variables. And marginal structural models.

**Kiffer:** Restriction is the simplest. You restrict your study to a subgroup that's homogeneous on the confounder. If you only enroll non-smokers, smoking can't confound. The cost is smaller sample and less generalizability.

**Sarah:** Matching is when you select your unexposed group to look like your exposed group on the confounder. Or in case-control studies, when you select controls to look like cases on the confounder. You're constructing the comparison so the confounder is balanced by design.

**Kiffer:** Statistical adjustment via multivariable models is what most observational studies do. You include the confounders in a regression model and let the model estimate the exposure effect controlling for them. It works well when you've measured the confounders, when the model is specified correctly, and when there's enough overlap in covariates between groups.

**Sarah:** Propensity score methods are an alternative. The propensity score is the probability of being exposed given measured covariates. Once you have the propensity score, you can match, weight, or stratify on it, which is sometimes more robust than including covariates as regressors directly.

**Kiffer:** Instrumental variables are for unmeasured confounding. If you can find a variable that affects exposure but doesn't directly affect the outcome except through exposure, you can use it as an instrument to back out the exposure effect even when you can't measure the confounder. Mendelian randomization is the most famous instrumental variable strategy in epidemiology, using genetic variants as instruments.

**Sarah:** And marginal structural models, abbreviated MSM, the full term being marginal structural model, are designed specifically for time-varying confounding. They use inverse probability of treatment weighting, abbreviated IPTW, the full term being inverse probability of treatment weighting, to construct a pseudo-population in which the confounder no longer predicts treatment, and then estimate the effect in that pseudo-population.

**Kiffer:** And again, we'll go deep on each of these next lesson. Today, the goal is to put them on the map alongside their threat. Confounding is the threat. Restriction, matching, adjustment, propensity scores, instrumental variables, and marginal structural models are the tools.

**Sarah:** Okay, that's internal validity. Let's move to external validity, or transportability. This is the second pillar.

**Kiffer:** External validity is the question of whether your finding generalizes beyond the source population. And the central mechanism that threatens external validity is effect modification.

**Sarah:** Effect modification means the size or direction of the exposure-outcome association varies across subgroups. So in younger people the effect might be one number, and in older people it might be a different number. In one ethnic group the effect might be small, and in another it might be large. The average effect that you computed in your sample is some weighted average of those subgroup effects.

**Kiffer:** And the issue for transportability is that if you transport that average to a new population with a different mix of subgroups, the average will be different. Even if every subgroup-specific effect is exactly the same in the new population, the weighted average shifts when the weights shift.

**Sarah:** The textbook example we covered earlier earlier in this series is the Antihypertensive and Lipid-Lowering Treatment to Prevent Heart Attack Trial. Abbreviated ALLHAT, the full term being the Antihypertensive and Lipid-Lowering Treatment to Prevent Heart Attack Trial. Let me recall the outline briefly.

**Kiffer:** Yeah. ALLHAT was a large randomized trial comparing several first-line antihypertensive drugs in a population that was enriched for racial and ethnic minorities, particularly Black participants. And what the trial found was that the drug effectiveness varied by race. Some drug classes worked better in Black patients. Other drug classes worked better in non-Black patients. The average effect across the whole trial was a meaningful number, but applying that average to a clinical population with a different racial composition would have given misleading guidance.

**Sarah:** And ALLHAT is sometimes pointed to as a textbook case because it shows that a methodologically excellent randomized trial, the gold standard for internal validity, can still produce findings that need to be interpreted carefully when transported. The internal validity was strong. The external validity required attention to effect modification by race.

**Kiffer:** The conditions for transportability are that the effect modifiers, whatever they are, have the same distribution in the target population as in the study population. If they don't, you have to do something to account for the difference.

**Sarah:** And that's where the formal methods come in. There are four big ones the lesson highlights. Inverse probability of selection weighting. Standardization across effect modifier subgroups. Sensitivity analysis for unmeasured effect modification. And target trial emulation.

**Kiffer:** Inverse probability of selection weighting is conceptually similar to inverse probability of treatment weighting, but for transportability. You weight the study sample so that, in the weighted version, the distribution of effect modifiers matches the target population. Then you compute your effect in the weighted sample, and that gives you the estimate that should apply in the target.

**Sarah:** Standardization is the older method. You compute the exposure effect within each subgroup defined by the effect modifier. Then you take a weighted average using the target population's subgroup distribution as the weights. Same logic, different implementation.

**Kiffer:** Sensitivity analysis for unmeasured effect modification is what you do when you suspect there are effect modifiers you didn't measure. You posit plausible scenarios for those unmeasured modifiers and see how much your transported estimate would change. If the answer is, not much, you can be more confident. If the answer is, a lot, you have to be cautious.

**Sarah:** And target trial emulation is a broader framework that's gotten a lot of attention recently. The idea is that whenever you're doing observational research, you should specify, explicitly, the hypothetical randomized trial that your observational study is trying to emulate. That includes the eligibility criteria, the treatment strategies, the outcome, the follow-up. Specifying the target trial forces clarity about what population you're really making inferences about, which directly informs transportability.

**Kiffer:** Okay, that's external validity. Let's move to construct validity, the third pillar.

**Sarah:** Construct validity is the bridge between abstract theoretical concepts and concrete operational measurements. We met this idea earlier earlier in this series, in the lesson on conceptualization and measurement.

**Kiffer:** And the basic question is, did the thing you measured actually capture the construct you wanted to capture. You wanted to measure depression. You used a questionnaire. Does the questionnaire measure depression, or does it measure something adjacent, like distress, or sadness, or trouble sleeping. Or does it measure depression unevenly across different groups.

**Sarah:** Two case studies the lesson points back to are the Center for Epidemiologic Studies Depression scale, abbreviated CES-D, the full term being the Center for Epidemiologic Studies Depression scale, and self-rated health.

**Kiffer:** The CES-D is a twenty-item self-report scale developed in the 1970s. It's been used in tens of thousands of studies. And there's a substantial literature showing that the scale measures depression somewhat differently across different cultural and linguistic groups. Some items work the way the developers intended in one population and don't work the same way in another.

**Sarah:** Self-rated health is even more interesting. The classic single-item question is, how would you rate your health. Excellent, very good, good, fair, or poor. It's been shown to predict mortality remarkably well across many populations. But the meaning of, say, very good versus good shifts across age groups, across cultures, across socioeconomic strata. People are using different reference points to answer.

**Kiffer:** And this leads us to two technical concepts that are central to construct validity. Differential item functioning, abbreviated DIF, the full term being differential item functioning. And measurement non-invariance.

**Sarah:** Differential item functioning is when an individual item on a scale behaves differently in different groups, even after controlling for the underlying construct. So two people with the same true level of depression have different probabilities of endorsing a given item, just because they belong to different groups.

**Kiffer:** Measurement non-invariance is the broader phenomenon. The full measurement model, the relationship between the latent construct and the observed indicators, differs across groups. If you have measurement non-invariance, comparing scores across groups doesn't really compare the construct. It compares the construct entangled with the measurement.

**Sarah:** And the implication for research is serious. If your construct is measured non-equivalently across the groups you're comparing, your findings about group differences are partly reflecting measurement, not the construct. You can't tell which is which without doing the formal psychometric work.

**Kiffer:** The mitigations are formal psychometric assessment of measurement invariance, use of validated instruments that have been tested for invariance in the populations you're studying, and explicit caution when comparing groups for whom invariance hasn't been established.

**Sarah:** Okay, that's construct validity. Statistical conclusion validity is the fourth pillar, and it's about whether the statistical inferences you've drawn from the data are sound.

**Kiffer:** There are four big sub-questions here. Statistical power. Multiple comparisons. Model misspecification. And handling of clustering or correlation. Let's walk through each.

**Sarah:** Statistical power is the probability that you'll detect a real effect if it's there. If your study is underpowered, you might fail to detect a real association just because the sample isn't big enough. And if you don't acknowledge that, you might wrongly conclude there's no effect when there really is one.

**Kiffer:** Multiple comparisons are the problem of testing many hypotheses simultaneously. If you test twenty independent hypotheses at the five percent level, you'd expect one of them to come up significant by chance even if none of the twenty is real. Without some correction for multiple testing, the literature ends up populated with false positives.

**Sarah:** Model misspecification is when the statistical model you fit doesn't match the structure of the data. If you assume a linear relationship and the truth is curved, you'll get biased estimates. If you assume normal errors and they're not, your standard errors will be wrong. Misspecification can produce both biased point estimates and misleading uncertainty estimates.

**Kiffer:** And clustering or correlation is when your observations aren't independent. Students within schools. Patients within clinics. Repeated measures within people. If you treat correlated observations as independent, your standard errors will be too small, and your confidence intervals and p-values will be misleading.

**Sarah:** The mitigations are pretty standard. Adequate power calculations done before the study runs. Pre-registration of primary analyses to limit post-hoc comparisons, with appropriate corrections when multiple comparisons can't be avoided. Diagnostic plots and model checking to catch misspecification. And mixed models, generalized estimating equations, or cluster-robust standard errors to handle clustering.

**Kiffer:** Okay. Those are the four pillars. Internal, external, construct, statistical conclusion. Each with its own threats. Each with its own mitigations. Now the lesson moves into a relatively new toolkit that's been gaining traction in the last decade or so. Quantitative bias analysis.

**Sarah:** Quantitative bias analysis is the project of putting numbers on the question, how robust is this finding to plausible biases. Instead of just listing limitations and saying our finding might be biased, you actually compute how much the bias would have to be to overturn the finding.

**Kiffer:** And one of the cleanest concrete examples is the E-value. Developed by Tyler VanderWeele and Peng Ding in 2017. The E-value is a single number that quantifies how strong an unmeasured confounder would have to be, in association with both the exposure and the outcome, to fully explain away your observed association.

**Sarah:** Let me say that in plain words because the formal definition is a mouthful. Suppose your study found a relative risk of 2.0 between an exposure and an outcome, after adjusting for the confounders you measured. The E-value answers the question, how strongly associated, on the relative risk scale, would an unmeasured confounder have to be with both the exposure and the outcome, beyond what's already explained by the measured confounders, to fully account for that 2.0 association.

**Kiffer:** And if the answer is, say, an E-value of 3.0, that means an unmeasured confounder would have to be associated with both the exposure and the outcome by a factor of 3.0 or more, on the risk ratio scale, in order to explain away the observed association.

**Sarah:** And then you ask, is that plausible. Is there an unmeasured confounder out there that's that strongly associated with both. If yes, you have to take seriously that your finding might be explained by it. If no, your finding is robust. Higher E-values mean more robust findings, because the bar for an explaining confounder is higher.

**Kiffer:** And the nice thing about E-values is they're relatively easy to compute and they're interpretable on a scale that practicing researchers can reason about. They've been incorporated into reporting guidelines and many journals now request E-values for observational studies of causal effects.

**Sarah:** A second concrete tool is negative control exposures. The classic paper is Lipsitch and colleagues in 2010.

**Kiffer:** The idea of a negative control exposure is that you find a different exposure that has no plausible biological effect on the outcome, but that you'd expect to be subject to the same kinds of confounding and selection biases as your real exposure of interest. If your real exposure shows an apparent association with the outcome, but the negative control exposure also shows an association of similar magnitude, that's a red flag. It suggests that the apparent association is being driven by bias rather than a true effect.

**Sarah:** The simplest version is studying a vitamin and some health outcome, and using the prior decade's vitamin use as the negative control. You wouldn't expect a vitamin you stopped taking ten years ago to still be affecting the outcome through a biological pathway. So if it shows an association, that's evidence that something else, probably a residual confounder like overall health-conscious behavior, is producing the apparent effect.

**Kiffer:** Negative controls are powerful because they exploit the same comparison structure as the real analysis, and any apparent effect of the negative control is, by design, not biological. So it has to be bias. The size of that apparent effect tells you something about the magnitude of bias affecting your real estimate.

**Sarah:** A third strategy is triangulation across study designs. The idea is that different study designs are subject to different biases. If you study the same exposure-outcome relationship using a randomized trial, an observational cohort, a Mendelian randomization analysis, and a natural experiment, and they all give similar answers, you can be more confident the answer reflects a true effect. Because each design is biased in different directions, and consistent answers across designs are unlikely to all be artifacts.

**Kiffer:** Triangulation is a really attractive principle because it acknowledges that no single study design is perfect. Instead of trying to solve all the problems within one design, you accept the imperfections and look for convergent evidence across multiple imperfect designs.

**Sarah:** And then sensitivity analysis as a general strategy is the umbrella for all of this. You re-run your analysis under different assumptions about the missing data. Under different model specifications. Under different inclusion and exclusion criteria. Under different choices about how to handle outliers. Under different priors if you're doing Bayesian analysis.

**Kiffer:** And the principle is that if your finding is robust across all those plausible alternative analytic choices, you can be more confident that the finding isn't just an artifact of one particular set of choices. If the finding is fragile, meaning it shifts substantially when you make defensible alternative choices, you have to be more cautious in how strongly you state the conclusion.

**Sarah:** Sensitivity analysis is also important because it serves as a check on the garden of forking paths problem we talked about back in the this material lesson on research integrity. Researchers face many small choices in any analysis. Without sensitivity analyses, the choice that produces the most striking result tends to win. With sensitivity analyses, you're forced to show that your result holds across the choices that didn't make the final cut.

**Kiffer:** Okay. Let me try to pull all of this together into the takeaways.

**Sarah:** Yeah, let me list them. There are seven main ones I'd want a student to walk away with.

**Kiffer:** Go ahead.

**Sarah:** First takeaway. Validity is multi-dimensional. There are four pillars. Internal, external, construct, and statistical conclusion. Each one is asking a different question. You have to evaluate each one separately. A study can be strong on one and weak on another.

**Kiffer:** Second. Internal validity is necessary but not sufficient. A study has to be internally valid for the finding to be a faithful description of the source population. But internal validity alone doesn't guarantee that the finding will transport. External validity is its own separate question.

**Sarah:** Third. Selection bias, information bias, and confounding are the three classical internal validity threats. Each has canonical forms. Berkson's, healthy worker, differential loss to follow-up, and volunteer bias for selection. Recall, social desirability, observer, and detection for information. Classical, by indication, and time-varying for confounding. Knowing the canonical forms is what lets you spot them in the wild.

**Kiffer:** Fourth. External validity is fundamentally about effect modification. If the exposure-outcome association varies across subgroups, transporting the average to a population with different subgroup composition gives a misleading estimate. The Antihypertensive and Lipid-Lowering Treatment to Prevent Heart Attack Trial, ALLHAT, is the textbook case. The formal methods for transportability include inverse probability of selection weighting, standardization across effect modifier subgroups, sensitivity analysis for unmeasured effect modification, and target trial emulation.

**Sarah:** Fifth. Construct validity asks whether you measured what you intended to measure. The Center for Epidemiologic Studies Depression scale and self-rated health are case studies in why this matters. Differential item functioning, DIF, and measurement non-invariance are real threats, especially when you're comparing across cultural, linguistic, or socioeconomic groups.

**Kiffer:** Sixth. Statistical conclusion validity requires adequate power, appropriate handling of multiple comparisons, correct model specification, and proper accounting for clustering. The mitigations are pretty standard. Power calculations before the study runs. Pre-registration. Diagnostic plots. Mixed models or cluster-robust standard errors when you have correlated observations.

**Sarah:** And seventh. Quantitative bias analysis turns validity from yes-no questions into graded ones. The E-value, from VanderWeele and Ding 2017, is the cleanest example. An E-value of 3.0 means an unmeasured confounder would have to be associated with both exposure and outcome by a factor of 3.0 or more to fully explain away the observed association. Negative control exposures, from Lipsitch and colleagues 2010, are another tool. So is triangulation across study designs. And sensitivity analysis is the umbrella strategy that ties them all together.

**Kiffer:** And one more thing worth mentioning before we close. The connection back to this material and the earlier study-design lessons. Every concept in this lesson has been threaded through everything you've already learned. We've talked about sampling and external validity earlier in this series. Conceptualization and measurement earlier in this series. Information bias earlier in this series. Design-specific biases earlier in this series. Confounding and statistical inference earlier in this series. The four pillars are not new content. They're a framework that organizes content you already have.

**Sarah:** And the value of the framework is that it gives you a systematic way to read a study. When you pick up a paper, you ask. Is this internally valid. Is selection bias plausible. Is information bias plausible. Is confounding plausible. If yes to any of those, what's the likely direction and magnitude of bias. Is this externally valid. Are the populations it's being applied to similar enough on effect modifiers. Is this construct valid. Are the measures appropriate for the populations being compared. Is this statistically valid. Is the power adequate, are the comparisons handled, is the model right.

**Kiffer:** And by the time you've answered those questions, you have a much sharper view of what the study tells you, what it doesn't, and how cautiously to take the conclusions. Validity isn't a property a study has or doesn't have. It's a graded assessment along multiple dimensions.

**Sarah:** And the practical recommendation. When you read your next observational paper, try going through the four pillars explicitly. Write a sentence about each. What's the internal validity story. What's the external validity story. What's the construct validity story. What's the statistical conclusion validity story. It feels mechanical at first. It becomes second nature with practice. And it's how working epidemiologists actually appraise the literature.

**Kiffer:** Next up is Lesson 12, Confounding and Causal Inference, where we go even deeper on the most-studied threat to causal inference. We'll formalize the definition, walk through directed acyclic graphs as a tool for identifying confounders, and unpack each of the control strategies we previewed today. Restriction, matching, statistical adjustment, propensity scores, instrumental variables, and marginal structural models.

**Sarah:** Take care, everyone.

**Kiffer:** See you there.
