# Lesson 9 — Information Bias & Data Quality (v3 expanded)

*Companion-podcast transcript • Sarah & Kiffer* 
*~5681 words • ~30.7 min audio*

---

**Sarah:** Welcome back to Office Hours. I'm Sarah.

**Kiffer:** And I'm Kiffer. Today we're working through Lesson 9, Information Bias and Data Quality. This is the third leg of what this material calls the bias triad in observational research.

**Sarah:** Let me set this up for anyone joining us mid-course. Lesson 7 was about what we measure and how we specify a causal model. Lesson 8 was about who ends up in the study, that's selection bias. Today's lesson is about something different again. Even with the right people in the study, and the right variables on paper, the data we record about them can be systematically wrong.

**Kiffer:** And the lesson opens with a really honest framing. Some degree of measurement error is present in virtually every study. The question is never whether there is information bias. The question is how it's shaped, how it's distributed, and whether the design and analysis can detect or correct it.

**Sarah:** Three sections. Section 1 is misclassification, the framework that organizes all of information bias. Section 2 is observer and detection biases, where the error originates with the people doing the measuring or with the surveillance system. And Section 3 is the more technical artifacts, regression dilution and digit preference, which sneak in even when nobody is misclassified at all.

**Kiffer:** Okay. Section 1. Misclassification. This is sometimes called measurement bias. The framework that organizes everything is whether the errors are differential or non-differential. That distinction does most of the conceptual work in this section.

**Sarah:** Hold on, let me slow that down for a beginning student. What does it mean for misclassification to be differential or non-differential? Differential with respect to what?

**Kiffer:** Good catch. The word is shorthand. Differential means the measurement error is different across the groups you're comparing. Non-differential means the error is the same across groups. So if you're comparing cases of disease to controls, non-differential would mean both groups have the same probability of being misclassified on the exposure. Differential would mean the cases are misclassified at a different rate from the controls.

**Sarah:** Got it. And that distinction matters because the two kinds of error do different things to your effect estimate.

**Kiffer:** Exactly. Let's start with non-differential. The probability of being misclassified on exposure is the same among cases and controls. Or, the probability of being misclassified on outcome is the same among exposed and unexposed individuals. The errors are random with respect to the variable you're trying to associate.

**Sarah:** And there's a famous rule of thumb that goes with this. For a binary exposure, non-differential misclassification typically biases the observed association toward the null. Toward no effect. Toward a relative risk of one.

**Kiffer:** Right, and the textbook example is Blair and colleagues in 1996. Let me tell you a bit about who they were and what they did, because this study is foundational. Aaron Blair was a senior researcher at the U.S. National Cancer Institute who spent decades studying occupational exposures, particularly in agricultural workers. In 1996 he and his team compared two methods of measuring pesticide exposure.

**Sarah:** Two methods. First, they asked workers to fill out a self-report questionnaire about their pesticide use. Second, they took a urine sample and tested it for urinary metabolites, which are the chemical breakdown products of pesticides that show up in urine when you've actually been exposed.

**Kiffer:** And the discordance was striking. Among workers who said they had no pesticide exposure, about 30 percent had detectable urinary metabolites. Conversely, some workers reporting heavy exposure had no biological evidence.

**Sarah:** So the misclassification was bidirectional. Self-report missed real exposures, and self-report also added phantom exposures.

**Kiffer:** Right. And when they used self-reported exposure to estimate associations with health outcomes, the odds ratios were substantially attenuated compared to estimates using the biomarker classification. The noise in the exposure measurement dragged the estimate toward the null.

**Sarah:** Okay, can you give me the intuition for why that happens? Not the math, the picture.

**Kiffer:** Sure. Imagine you start with two clean groups. Truly exposed people on one side, truly unexposed people on the other. The exposed group has more disease, the unexposed group has less. Now imagine you randomly take some people from the exposed group and put them in the unexposed bucket because the questionnaire missed their exposure. And you randomly take some people from the unexposed group and put them in the exposed bucket. What happens to the difference between buckets?

**Sarah:** It shrinks. Because the exposed bucket now has some unexposed people pulling its disease rate down, and the unexposed bucket now has some exposed people pulling its disease rate up. The contrast between the two buckets is diluted.

**Kiffer:** Exactly. That's the dilution mechanism. And that's why non-differential misclassification of a binary exposure attenuates the association toward the null.

**Sarah:** But there's a caveat the lesson is careful about.

**Kiffer:** Yes. The toward-the-null rule holds reliably for binary exposures. For exposures with more than two categories, polytomous exposures, non-differential misclassification can actually bias in either direction. People can shift across category boundaries in ways that don't simply average out. So if you're working with a multi-category dietary exposure, for example, you can't assume the bias is conservative.

**Sarah:** Good caveat to flag. Now let's get to the harder case, differential misclassification.

**Kiffer:** Differential misclassification means the accuracy of measurement differs between the groups you're comparing. And this is where the bias becomes unpredictable. It can pull the effect estimate toward the null. It can push it away from the null. It can flip the sign altogether. Anything is possible.

**Sarah:** Two big mechanisms drive differential misclassification. Recall bias and social desirability bias. Let's take recall bias first.

**Kiffer:** The textbook case is the INTERPHONE study from 2010. Let me set the context. INTERPHONE was a massive multinational case-control study coordinated by the International Agency for Research on Cancer, which is the cancer research arm of the World Health Organization, headquartered in Lyon, France. The study ran across 13 countries and was designed to investigate whether mobile phone use was associated with brain tumors.

**Sarah:** And quick definitions. A case-control study takes people with the disease, the cases, and people without it, the controls, and then looks backward to see who was exposed. Glioma and meningioma are the two kinds of brain tumors INTERPHONE studied. Glioma starts from the supportive cells in the brain. Meningioma starts from the membranes that wrap around the brain.

**Kiffer:** Right. Cases reported their historical mobile phone use after their diagnosis. So they're being asked to remember which side of their head they typically held the phone against, going back years.

**Sarah:** And here's the result that gave the game away. Cases with tumors on the same side of the head as their reported phone use had an odds ratio of 1.8 for glioma. So an elevated risk on the side where they remembered using the phone. But cases with tumors on the opposite side showed an odds ratio of 0.7. A protective association.

**Kiffer:** Now, that pattern is biologically nonsensical. There is no plausible mechanism by which a phone on your right ear would protect you from a tumor on your left side of the brain. So if both of those numbers are real associations, we have a logical problem.

**Sarah:** And the most plausible explanation is that the cases differentially recalled or reported phone use on the side of their tumor. Once you've been diagnosed with a brain tumor, your memory of which ear you held the phone against gets reshaped by your concern about the cause. You start to remember more phone use on the tumor side. And less on the other side.

**Kiffer:** And the lesson points out that reproductive epidemiology has the same problem in spades. Werler and colleagues in 1989 demonstrated this in a study of mothers of children with birth defects. Martha Werler is a reproductive epidemiologist at Boston University who has spent decades on birth-defects research. She and her team showed that mothers of affected infants recalled and reported medication use, dietary exposures, and environmental contacts more completely than mothers of healthy children.

**Sarah:** Mothers of children with birth defects were more likely to remember minor illnesses during pregnancy, prescription drug use, chemical exposures. Their memory was more complete because they had been searching it for an explanation for months before the interview.

**Kiffer:** And Swan and colleagues in 1992 quantified this beautifully. Shanna Swan is an environmental and reproductive epidemiologist now at Mount Sinai in New York. Her team found that mothers of malformed infants reported 40 percent more occupational chemical exposures compared to what was actually documented in their employment records, while mothers of healthy infants showed no such reporting excess.

**Sarah:** So the comparison was self-report versus the actual employer's records, and the discrepancy was concentrated entirely on the affected-baby side. That's a textbook differential misclassification.

**Kiffer:** The lesson then walks through the mechanisms by which recall differs between cases and controls. It's a useful taxonomy. First, rumination. People who experience an adverse outcome spend more time thinking about potential causes, rehearsing memories more thoroughly. Second, effort after meaning, which is psychologists' phrase for the human tendency to search for explanations of significant events. That search drives more intensive memory retrieval among cases.

**Sarah:** Third, prompted recall. Cases may have received information from clinicians or news media about risk factors, which then triggers more detailed retrospective exposure assessment. The case has been told that pesticides cause lymphoma, so when they sit down for the interview they bring up every encounter with pesticides they can think of.

**Kiffer:** And fourth, telescoping. Significant events get recalled as having occurred closer in time to the outcome than they actually did. Memory compresses time around an emotionally salient event.

**Sarah:** And the mitigation strategies are basically design choices. First, prospective designs. Collect exposure data before the outcome occurs. Cohort studies and exposure registries are largely immune to recall bias because the exposure information is gathered before anyone is sick.

**Kiffer:** Second, structured instruments. Standardized validated questionnaires with specific prompts. Open-ended questions amplify recall bias. Third, record-based exposure assessment. Use medical records, pharmacy databases, employment records, anything not filtered through participant memory. Fourth, blinding. Keep participants unaware of specific study hypotheses, so motivated recall is reduced. And fifth, validation sub-studies. Compare self-reported data with objective records in a subset of participants, so you can quantify the size of the bias.

**Sarah:** Okay, on to the second mechanism for differential misclassification, social desirability bias.

**Kiffer:** This is the classic case where people answer survey questions based on what's socially acceptable rather than what's true. The textbook example is alcohol consumption.

**Sarah:** Lorraine Midanik in 1982 demonstrated that self-reported alcohol consumption in population surveys systematically accounts for only 40 to 60 percent of known alcohol sales in the same population.

**Kiffer:** Quick context. Midanik was a survey researcher at the Alcohol Research Group in Berkeley who spent decades on the question of whether you can trust self-reported drinking. Her approach was to take an entire jurisdiction, add up the total amount of alcohol sold, and compare it to the total amount of drinking people said they did in surveys. The two numbers should match. They don't.

**Sarah:** People are reporting consumption of about half what they're actually drinking, on average.

**Kiffer:** And more recent biomarker studies have confirmed this. Kilian and colleagues in 2020 used phosphatidylethanol, sometimes shortened to PEth, which is a substance that builds up in red blood cells when alcohol is metabolized. It only forms in the presence of alcohol and it persists for several weeks, so it's a relatively objective biomarker of recent drinking.

**Sarah:** And what did they find?

**Kiffer:** They found that biomarker-based estimates of heavy drinking prevalence were approximately twice as high as self-reported estimates. The biomarker said one in five people in the population was drinking heavily. The survey said one in ten.

**Sarah:** And the underreporting is not random. It's concentrated among the heaviest drinkers and in populations where drinking carries more stigma. So this is differential. Heavy drinkers underreport more than light drinkers, which compresses the exposure distribution and makes it harder to see real effects of heavy drinking on outcomes.

**Kiffer:** Now here's where the lesson takes a turn that I think is genuinely important. It moves from the technical framework of misclassification to what it calls the equity dimensions of data quality.

**Sarah:** Right, and this section is doing real conceptual work. The technical framework treats measurement error as a thing to be quantified and corrected. Validation studies, sensitivity analyses, regression calibration. But the lesson points out that errors are not distributed at random across the population. They cluster along the same lines that structure inequality.

**Kiffer:** And the way we choose to measure, or not measure, particular groups encodes a theory about whose health matters and whose suffering counts. That's a quote in spirit from the lesson.

**Sarah:** Let me walk through the specific examples, because they're concrete. First, cause-of-death misclassification by socioeconomic position. Death certificates are the bedrock of mortality surveillance. Most countries build their entire population health picture from them.

**Kiffer:** Mohsen Naghavi and colleagues in 2010 documented that what epidemiologists call garbage codes, ill-defined causes like cardiac arrest unspecified, or ill-defined heart disease, are more common for decedents who are older, lower-income, racialized, or rural.

**Sarah:** Naghavi is a senior researcher at the Institute for Health Metrics and Evaluation at the University of Washington in Seattle, which produces the Global Burden of Disease estimates that are widely used for policy. So when he documents this pattern, it's flowing into the numbers that countries use to decide where to spend their health budgets.

**Kiffer:** And because cause-specific mortality drives both research priorities and resource allocation, differential misclassification at the certificate stage propagates inequities through every downstream analysis. If we don't know precisely what poor older rural people are dying of, we can't fund research on it, and we can't intervene.

**Sarah:** Second example. Race and ethnicity as administrative categories. Recorded inconsistently across health systems. Self-report on some forms, clinician observation on others, next-of-kin attribution on death certificates. And often a single Other bucket that collapses dozens of distinct communities.

**Kiffer:** Indigenous identity in particular is systematically under-recorded. Janet Smylie and Michelle Firestone in 2015 documented substantial mismatches between First Nations, Métis, and Inuit self-identification and the way these populations appear in administrative health data in Canada. Smylie is a Métis physician and researcher at the University of Toronto. Her work specifically focuses on the gap between how Indigenous people identify themselves and how data systems see them.

**Sarah:** And the consequence has two layers. The methodological consequence is differential misclassification of group membership, which can deflate or inflate observed disparities depending on direction. The political consequence is that populations rendered statistically invisible struggle to make claims on a public health system that does not see them.

**Kiffer:** Third example. Erasure of gender and sexual minorities. Most large health surveys, until very recently, collected only binary sex and no measure of gender identity or sexual orientation.

**Sarah:** Trans, non-binary, and Two-Spirit individuals have therefore been either invisible or actively miscoded. Sometimes against their will, when they're assigned to a category that doesn't match their lived identity. Bauer and colleagues in 2009 documented this pattern in Canadian health data.

**Kiffer:** And the methodological point is that you can't fix this with post-hoc adjustment. If a survey collapses a hundred Indigenous nations into a single checkbox, no amount of statistical sophistication will recover the differences that were never captured.

**Sarah:** Fourth example. Underrepresentation as a form of data quality. Even when measurement instruments work well, populations who are systematically under-sampled cannot benefit from the resulting evidence. Clinical trials have historically over-represented White men of working age. Genome-wide association studies have over-represented people of European ancestry.

**Kiffer:** Quick definition. A genome-wide association study, sometimes shortened to GWAS, is a study that scans the entire genome of thousands of people to find genetic variants associated with disease. If those thousands of people are mostly of European descent, the genetic risk scores produced by the study transfer poorly to people of other ancestries. The instrument doesn't work as well outside the population it was trained on.

**Sarah:** And the most striking case in the lesson, the one that pulled all of this together during the pandemic, is pulse oximetry. Let me unpack it.

**Kiffer:** Please. Pulse oximetry. The small clip on the fingertip that estimates blood oxygen saturation. It works by shining red and infrared light through the finger and inferring how much oxygen is bound to hemoglobin in your blood. It's used everywhere in medicine because it's cheap, fast, and non-invasive.

**Sarah:** The reference standard for actually measuring oxygen saturation is arterial blood gas. That's a needle into the artery in your wrist and a chemistry analysis of the blood. It's accurate but invasive and slow. Pulse oximetry is the convenient surrogate.

**Kiffer:** Sjoding and colleagues in 2020 compared paired pulse oximetry and arterial blood gas measurements in over 10,000 patients. Michael Sjoding is a critical care physician at the University of Michigan. The study was published right at the height of the COVID-19 pandemic, and it caught a wave of attention because the implications were immediate and enormous.

**Sarah:** What's COVID-19, just to make sure we're on the same page for any listener new to all this?

**Kiffer:** Coronavirus disease 2019, the respiratory illness caused by the virus SARS-CoV-2, which spread globally starting in early 2020 and produced the first sustained pandemic in a century. One of its hallmark complications is hypoxemia, dangerously low blood oxygen, sometimes silent until very late. Clinicians were leaning on pulse oximetry constantly to triage patients.

**Sarah:** Right. And here's what Sjoding found. Among Black patients, the pulse oximeter reported a saturation of 92 to 96 percent in 11.7 percent of cases when the true arterial saturation was below 88 percent. So the device was reporting a comfortable range while the patient was actually critically low. That's nearly three times the rate of occult hypoxemia, hidden low oxygen, that occurred in White patients, where the rate was 3.6 percent.

**Kiffer:** And during the pandemic, this calibration error meant that Black patients were systematically less likely to be flagged for supplemental oxygen, hospital admission, or therapy thresholds keyed to oximetry readings.

**Sarah:** The reason is essentially that pulse oximeters were calibrated and validated on majority-White cohorts when the technology was developed in the 1970s and 1980s. Skin pigment changes how light passes through tissue, and the calibration didn't account for that adequately.

**Kiffer:** And this isn't a problem of human reporting bias or missing data. It's a problem of an instrument whose training conditions encoded a theory about who the relevant patient population was. And whose deployment in a more diverse population produced systematic, racially patterned misclassification.

**Sarah:** That's a really important reframing. Information bias isn't just a methodological annoyance. The choice of what gets measured, and on whom the instrument was calibrated, is itself a political and ethical decision.

**Kiffer:** Okay. Section 2. Observer and detection biases. Errors that emerge from the data collector or from the surveillance system itself, rather than from the participant.

**Sarah:** Observer bias first. This occurs when the data collector's knowledge of the participant's group status influences their measurement. The classic version is an interviewer who knows whether the person they're interviewing is a case or a control, and unconsciously probes harder for exposures among cases.

**Kiffer:** And the gold-standard prevention is blinding. The person measuring the outcome shouldn't know whether the participant was in the exposed or unexposed group. Or, in a clinical trial, the outcome assessor doesn't know which arm the patient was randomized into.

**Sarah:** Training and standardization help, of course, but they don't prevent the unconscious influence of knowing exposure status. Blinding is what removes the source of bias structurally. And let me make a point that students sometimes miss. Increasing the sample size doesn't fix observer bias. A bigger sample of biased measurements just gives you a more precise estimate of the wrong number.

**Kiffer:** Right. Sample size addresses random error. It doesn't address systematic bias.

**Sarah:** Detection bias. This is the screening problem. The textbook example is PSA screening for prostate cancer.

**Kiffer:** Quick definition. PSA stands for prostate-specific antigen. It's a protein produced by prostate cells, including cancer cells. A blood test can measure prostate-specific antigen levels, and high levels can signal possible prostate cancer.

**Sarah:** And the historical context. After widespread PSA screening was introduced in the United States in the late 1980s, prostate cancer incidence approximately doubled. From about 100 cases per 100,000 men in 1986 to over 230 per 100,000 by 1992.

**Kiffer:** But mortality from prostate cancer changed very little over the same period. So we were finding a lot more prostate cancer, but not preventing many more deaths.

**Sarah:** The interpretation is that the apparent epidemic was largely a detection artifact. Intensive screening identified a reservoir of slow-growing, clinically insignificant cancers that would never have caused symptoms or death. This is overdiagnosis.

**Kiffer:** And Ruth Etzioni and colleagues in 2002 documented this carefully. Etzioni is a biostatistician at the Fred Hutchinson Cancer Center in Seattle who specializes in cancer screening models. They estimated that lead-time bias accounted for the majority of the apparent survival improvement that screening seemed to produce.

**Sarah:** Lead-time bias. Define it for me carefully.

**Kiffer:** Sure. Lead time is the interval between when screening detects the cancer and when it would have presented clinically with symptoms. Lead-time bias is the appearance of longer survival after screening just because diagnosis happens earlier on the timeline. The patient still dies on the same calendar date. But the time-from-diagnosis-to-death looks longer because the diagnosis was moved earlier. You haven't actually saved any lives. You've just stretched the metric.

**Sarah:** And length-time bias is the related artifact. Screening preferentially detects slower-growing tumors with longer preclinical phases, because those tumors are present for a longer window during which they can be detected. Fast-growing tumors might pass through the detectable window between screening visits and never get caught by screening at all. So screened cases appear to have better prognosis than symptomatic cases, but it's because you're sampling tumors with naturally better prognosis.

**Kiffer:** Both biases inflate the apparent benefit of screening. And the evidence on PSA screening specifically is genuinely mixed.

**Sarah:** Two big randomized trials produced conflicting results. The European Randomized Study of Screening for Prostate Cancer, sometimes shortened to ERSPC, found a 20 percent relative reduction in prostate cancer mortality over 13 years. But for every prostate cancer death prevented, approximately 27 men were overdiagnosed and treated for cancers that would never have hurt them.

**Kiffer:** And the U.S. Prostate, Lung, Colorectal, and Ovarian trial, sometimes shortened to PLCO, found no mortality benefit from organized screening. Partly because of high rates of contamination, where the so-called control group members were getting prostate-specific antigen tests anyway through their regular doctors. So the trial was effectively comparing screening to slightly less screening, not screening to no screening.

**Sarah:** Twenty-seven men overdiagnosed per death prevented. And those men weren't just diagnosed. Many of them had biopsies, surgeries, radiation, hormonal therapy. With real side effects. Incontinence, sexual dysfunction. So the harms are not abstract.

**Kiffer:** Right. And the discrepancy between the European and the U.S. trials illustrates how detection bias complicates interpretation. The true effect of screening is hard to isolate from the artifacts created by differential detection intensity.

**Sarah:** Then there's surveillance bias, which is the cohort version of detection bias. The textbook case is hormone replacement therapy and breast cancer.

**Kiffer:** Quick definition. Hormone replacement therapy, sometimes shortened to HRT, is the use of estrogen, sometimes combined with progestin, to relieve menopausal symptoms. It was widely prescribed in the 1990s and early 2000s and was the subject of a long debate about whether it caused breast cancer.

**Sarah:** Haut and colleagues in 2012 demonstrated that women on hormone replacement therapy in observational studies had more frequent physician visits and mammographic screening than non-users. Differential surveillance explained a substantial portion of the apparent association between hormone replacement therapy and breast cancer in early observational studies.

**Kiffer:** Hormone replacement therapy users were more likely to have breast cancer detected at earlier stages, not necessarily more likely to actually develop it. When analyses accounted for screening frequency, the apparent increased risk was substantially attenuated.

**Sarah:** And the lesson gives a really useful diagnostic checklist for distinguishing detection-driven from biology-driven associations. When you see an association between an exposure and an outcome, ask whether it could be explained by differential detection rather than a true biological effect.

**Kiffer:** Four indicators of detection bias. First, the exposed group has more healthcare contacts, diagnostic tests, or screening procedures than the unexposed group. Second, the association is stronger for less severe or earlier-stage disease. That's the signature of more eyes finding more subclinical cases. Third, incidence increases without a proportional change in mortality. That's the signature of overdiagnosis.

**Sarah:** And fourth, the association diminishes when analyses control for healthcare utilization. If adjusting for the number of physician visits cuts your effect estimate in half, that's a pretty good sign that surveillance, not biology, was driving a lot of the original association.

**Kiffer:** Section 3. Regression dilution and digit preference. Two artifacts that show up even when nobody is misclassifying anyone.

**Sarah:** These are subtler. The misclassification framework was about errors in which category a person ends up in. These are errors in the recorded value itself. The wrong number written down for the right person.

**Kiffer:** Regression dilution bias first. Sometimes called regression attenuation bias. The bias arises when a single measurement of an exposure is used to represent a participant's long-term or usual level.

**Sarah:** Walk me through the logic carefully.

**Kiffer:** Imagine you measure someone's blood pressure on Monday. They had three coffees that morning, they're stressed about a meeting, they slept badly. Their reading is 145 over 95. On Wednesday it's 130 over 80. On Friday it's 138 over 88. Their true long-term average might be around 137 over 87, but any single reading bounces around that average. The bouncing around is called within-person variation.

**Sarah:** And if you only have one reading per person, your data set is wider than the underlying truth. Some people's recorded values are higher than their true average, some are lower, and the spread of recorded values is bigger than the spread of true averages.

**Kiffer:** Exactly. And that inflated variance dilutes the apparent exposure-outcome association. The extra noise pushes the regression line flatter than it should be.

**Sarah:** Stephen MacMahon and colleagues in 1990 demonstrated this in cardiovascular epidemiology. MacMahon is an Australian epidemiologist who has worked on blood pressure and cardiovascular disease for decades. He's at the George Institute for Global Health. They showed that studies using a single baseline blood pressure measurement substantially underestimated the association between usual blood pressure and stroke risk.

**Kiffer:** And then the Prospective Studies Collaboration, which is a large pooled analysis combining data from many cohort studies, showed that correcting for regression dilution approximately doubled the estimated effect. A 10 millimeters of mercury lower usual systolic blood pressure was associated with a 40 percent lower stroke risk after correction. Without correction, the apparent reduction was only about 20 percent.

**Sarah:** So the public-health message is much stronger than the uncorrected analyses suggested. Lowering usual blood pressure by 10 millimeters of mercury actually cuts stroke risk by about 40 percent, not 20.

**Kiffer:** Right. And the correction factor here has a name. It's called the regression dilution ratio, or lambda. Lambda is the variance between persons divided by the total variance, which is variance between persons plus variance within persons. Lambda is always between zero and one. The observed slope is always smaller than the true slope by exactly lambda.

**Sarah:** And that means exposures with a lot of within-person variability have low lambdas and severe regression dilution. Walter Willett in 2013 showed that regression dilution ratios for single 24-hour dietary recalls range from 0.1 to 0.3 for many nutrients.

**Kiffer:** Willett is a nutritional epidemiologist at the Harvard T.H. Chan School of Public Health, and probably the world's most cited nutrition researcher. His point is brutal. A lambda of 0.1 to 0.3 means observed diet-disease associations from a single recall may represent only 10 to 30 percent of the true effect. The real effect of, say, sodium on blood pressure could be three to ten times larger than what the single-recall study estimated.

**Sarah:** Which partly explains why nutritional epidemiology often produces weaker and more inconsistent findings than the biology would predict. The measurement is so noisy that even real effects get attenuated to invisibility.

**Kiffer:** Correction methods. Four of them. First, repeat measurements. Get multiple measurements per individual and use the mean. The mean has less within-person noise than any single reading.

**Sarah:** Second, calibration sub-studies. Measure a subsample twice, use the repeat correlation to estimate lambda, then divide your observed slope by lambda to get the corrected estimate.

**Kiffer:** Third, structural equation models like simulation extrapolation, sometimes shortened to SIMEX, which is a computational method that adds known levels of extra noise to the data, observes how the slope changes, and extrapolates back to what the slope would be with zero noise.

**Sarah:** And fourth, instrumental variables, like Mendelian randomization, which uses genetic variants as proxies for the exposure. Genetic variants are fixed at conception and not affected by within-person measurement noise. So they give you a much cleaner estimate of the long-term exposure-outcome association.

**Kiffer:** Then the second artifact in this section. Digit preference and heaping.

**Sarah:** Digit preference is when recorded values cluster at certain numbers, typically those ending in zero or five, due to rounding by observers or self-reporters. The original demonstration goes way back. George Whipple in 1919 developed an index for measuring age heaping in census data.

**Kiffer:** Whipple's index. A value of 100 means no heaping. A value of 500 means everyone's reported age ends in zero or five. In some populations and historical periods, the index gets really high, and you can literally see the saw-tooth pattern in the age pyramid. Reported ages of 30, 35, 40, 45, 50 have excess counts. Reported ages of 29, 31, 34, 36 are depleted. People are rounding their own ages to the nearest five.

**Sarah:** And the modern, more clinical version is blood pressure heaping. Forty to 60 percent of manually recorded blood pressures end in zero. Observers round mentally to the nearest ten. Or sometimes to the nearest five.

**Kiffer:** Even in an infinitely large sample, this would still be there. Sample size doesn't fix systematic measurement artifacts. You'd just have a very precise estimate of a heaped distribution.

**Sarah:** And the analytic implications get serious at clinical thresholds. Imagine a population where 52 percent of recorded birth weights end in two zeros, like 2500 grams or 3000 grams. The 2500 gram cutoff is the standard clinical threshold for low birth weight.

**Kiffer:** And what happens at that threshold? Some 2480 gram infants get rounded up to 2500. Some 2520 gram infants get rounded down. So the misclassification at the threshold is bidirectional. Some truly low-birth-weight infants are mislabeled as normal weight, and some truly normal-weight infants are mislabeled as low birth weight. Prevalence estimates are biased. Threshold-based associations are biased.

**Sarah:** Solutions. Automated oscillometric devices that display exact values without observer interpretation. Standardized protocols that prompt for non-rounded recording. Sensitivity analyses that test multiple thresholds to see how much the conclusions depend on the cutoff. Statistical correction methods that smooth heaped distributions.

**Kiffer:** And the broader lesson here is that data quality is multidimensional. It's not just about whether the right person is in the right group. It's about whether the recorded value is a faithful representation of the underlying construct. Regression dilution and digit preference are about that fidelity.

**Sarah:** Okay. Let's pull this all together. I think there are several substantive takeaways from this lesson.

**Kiffer:** First. Information bias is the third bias category, alongside selection bias from Lesson 8 and confounding, which is coming in Lesson 11. Selection bias is about who's in the study. Confounding is about which third variables are mixed up with the exposure-outcome relationship. Information bias is about what gets recorded once everyone is in.

**Sarah:** Second. Misclassification is the central concept. Non-differential misclassification of a binary exposure typically biases toward the null. Differential misclassification can bias either direction, and recall bias and social desirability bias are the two big mechanisms. INTERPHONE for recall, Midanik and Kilian for social desirability.

**Kiffer:** Third. Observer bias is prevented by blinding, structurally. Detection bias is the screening problem, where intensive screening surfaces a reservoir of subclinical disease and creates an apparent epidemic without changing mortality. Surveillance bias is the cohort version, where the exposed group simply gets more medical attention and so accumulates more diagnoses regardless of biology.

**Sarah:** Fourth. Regression dilution is the artifact where a single noisy measurement attenuates the slope of the exposure-outcome association. Lambda is the correction factor. Repeat measurements, calibration sub-studies, simulation extrapolation, and Mendelian randomization are the main tools.

**Kiffer:** Fifth. Digit preference and heaping are real and large. Forty to 60 percent of manual blood pressures end in zero. Heaping at clinical thresholds creates bidirectional misclassification at exactly the boundary where it does the most damage. Automated devices are the cleanest fix.

**Sarah:** And sixth. Data quality is not neutral. Misclassification is patterned by structural inequality. Pulse oximetry on Black patients, garbage cause-of-death codes for older lower-income rural decedents, race and ethnicity recorded inconsistently, Indigenous identity systematically under-recorded, gender and sexual minorities erased, clinical trials over-representing White men. Each is a data-quality problem with equity stakes.

**Kiffer:** And the implication for appraising research. When you read a study and ask, is this measurement valid, also ask which populations were the instruments developed and validated in. Which categories are present and which are missing. Which differences are the analyses able, or unable, to detect. A null finding produced by a blunt instrument is not the same as evidence of no effect. It's evidence that this particular measurement system could not see one.

**Sarah:** I'd encourage anyone working through this lesson to play with the misclassification bias simulator in the interactive module. Watching the odds ratio drift toward the null under non-differential errors, and then watching it inflate past the truth under recall bias, is the fastest way to feel why the differential versus non-differential distinction matters.

**Kiffer:** Next up is Lesson 10. Design-Specific and Temporal Biases. That's where the three bias categories combine in characteristic ways across different study designs. Things like immortal time bias, healthy-worker effects, and the specific ways case-control versus cohort designs are each vulnerable.

**Sarah:** Take care, everyone.

**Kiffer:** See you there.