# Lesson 6 — Screening & Diagnostic Tests (v3 expanded)

*Companion-podcast transcript • Sarah & Kiffer* 
*~5,416 words • ~29 min audio*

---

**Sarah:** Welcome back to Office Hours. I'm Sarah.

**Kiffer:** And I'm Kiffer. Today we're working through Lesson 6, Screening and Diagnostic Tests. This is the lesson where the language of probability you've been building gets applied at the level of a single test, given to a single person, on a single day.

**Sarah:** And I want to set up why that move matters. Last week, in Lesson 5, we were thinking about disease frequency in populations. Prevalence, incidence, risk, rate. All of those are population-level numbers. Lesson 6 narrows the lens. Now we're standing in front of one patient with one test result, and the question is, what should this person believe?

**Kiffer:** And the surprise of the lesson, at least for most students, is that those two questions are tightly coupled. The probability that this individual person has the disease, given a positive test, depends on how common the disease is in the population they came from. Population-level prevalence and individual-level interpretation are linked through the same arithmetic.

**Sarah:** And that linkage is what predictive values capture. The most clinically important idea in this lesson, in Section 3. But we have to build up to it carefully.

**Kiffer:** Four sections. Section 1 is the basic vocabulary of test attributes. Section 2 introduces sensitivity and specificity, the two properties of the test itself. Section 3 introduces predictive values, which is what the patient cares about. And Section 4 deals with continuous tests through receiver operating characteristic curves and likelihood ratios. The whole lesson hinges on one image, a two-by-two contingency table, with test result on one side and true disease state on the other. Four cells. Different combinations give you every measure in the lesson.

**Sarah:** Okay. Section 1. Test attributes. The first thing the lesson does is define what it means by a test, because the word is broader than students usually expect.

**Kiffer:** Yeah, when you hear the word test in a clinical setting, you probably picture a blood draw or a swab going off to the lab. The lesson uses the word more broadly. A test is any device or procedure designed to detect or quantify a sign, substance, tissue change, or body response in an individual. It can be applied at the household level too, like a water-quality test for a whole home.

**Sarah:** And in epidemiology, the term test extends to clinical signs, history-taking questions, items on a survey, even findings at autopsy. So when a clinician asks, do you have any chest pain, that question is a test. When a survey item asks how often you have felt down or hopeless in the last two weeks, that item is a test. Each one classifies a person as positive or negative for some underlying state.

**Kiffer:** Which means everything we're about to cover, sensitivity, specificity, predictive value, the whole apparatus, applies just as much to questionnaire items as it does to laboratory assays.

**Sarah:** Then the lesson distinguishes screening tests from diagnostic tests.

**Kiffer:** Screening tests are applied to apparently healthy populations to detect disease early, before someone has symptoms. Mammography for breast cancer. Colonoscopy for colorectal cancer. Newborn metabolic screening, where you take a heel prick from a baby in the first days of life. The point is to catch disease at a stage where intervention is more effective, in a population where most people don't have the disease.

**Sarah:** Diagnostic tests are applied to people who are already suspected of disease. Someone presents to the emergency department with chest pain. The doctor orders an electrocardiogram, a troponin level, maybe a stress test. The patient is already in the diagnostic workup.

**Kiffer:** And the key point is, despite the different uses, the principles of evaluation are the same for both. The arithmetic is identical. Same two-by-two table. Same sensitivity and specificity. What changes is the prevalence of disease in the tested population, and that changes the meaning of a positive result.

**Sarah:** There's also a quick note about analytic versus diagnostic sensitivity and specificity, and the lesson wants you to keep them separate.

**Kiffer:** The analytic sensitivity of an assay is the lowest concentration of a chemical compound it can detect. So if a polymerase chain reaction assay can detect down to fifty viral copies per millilitre, that fifty is the analytic sensitivity. Analytic specificity is the capacity of the test to react only to its target compound, not to similar compounds. Those are bench-chemistry properties. Diagnostic sensitivity and specificity, sometimes called epidemiologic, are population-level properties. That's what Section 2 is about.

**Sarah:** Now we get to a distinction that matters for almost every measurement you will ever make. Accuracy versus precision.

**Kiffer:** And the lesson illustrates this with the classic four-quadrant target diagram. Imagine a bullseye. The bullseye represents the true value. You shoot a series of shots. The pattern tells you about the test.

**Sarah:** Accuracy is closeness to the true value. To be accurate, a test does not need every result to land at the truth. But the average of repeated tests should be close to the true value. So accuracy is about the centroid of your shots.

**Kiffer:** Precision is consistency. A test that always gives the same result, regardless of whether it's correct, is precise. Precision is about how tightly the shots cluster, not where they cluster.

**Sarah:** Quadrant one. Accurate and precise. Tight cluster, dead-centred on the bullseye. This is what you want.

**Kiffer:** Quadrant two. Inaccurate but precise. Tight cluster, but off-centre. Like a sniper rifle that's been miscalibrated. Every shot lands in the same place, just not where you wanted. This is bad in a particularly insidious way, because the consistency can fool you into trusting it.

**Sarah:** Quadrant three. Accurate but imprecise. Centred on the bullseye on average, but the shots are scattered widely. Like a shotgun aimed correctly. Any single shot might be way off, but the average is right.

**Kiffer:** Quadrant four. Inaccurate and imprecise. Scattered shots, off-centre centroid. Useless.

**Sarah:** And the four cases tell you why we measure both accuracy and precision separately. They are independent properties. A test can be precisely wrong, accurately imprecise, both, or neither.

**Kiffer:** Then three related vocabulary terms. Repeatability, reproducibility, and agreement. They sound like synonyms in everyday English but they have specific meanings.

**Sarah:** Repeatability is variability obtained from repeated testing of the same sample within the same laboratory, by the same equipment. Reproducibility is variability when the same sample is tested in different laboratories. That's a tougher standard, because now you're including differences in calibration, technicians, and reagent lots.

**Kiffer:** And agreement refers to how well two different tests, or two different raters, agree when applied to the same sample. So if I have one test that uses enzyme immunoassay and another that uses polymerase chain reaction, and I run both on the same set of samples, how often do they agree?

**Sarah:** Now the lesson gets quantitative. How do we actually measure these things? It depends on whether the test result is quantitative, meaning a number on a continuous scale, or categorical, meaning positive versus negative or some ordered set of categories.

**Kiffer:** For quantitative tests, three measures. The coefficient of variation, the concordance correlation coefficient, and the Bland-Altman plot.

**Sarah:** Coefficient of variation, abbreviated CV, is the simplest. The coefficient of variation is the standard deviation of the test results on the same sample, divided by the mean. So a relative measure of variability, expressed as a fraction of the mean. Lower coefficient of variation means greater precision. The advantage over the raw standard deviation is that it normalises for scale.

**Kiffer:** Concordance correlation coefficient, abbreviated CCC. Compares two sets of test results, like results from test A and test B run on the same samples. It's a better measure of agreement than the more familiar Pearson correlation coefficient, because Pearson only measures linear association, not agreement.

**Sarah:** Two tests can have a Pearson correlation of one but disagree systematically. If test B always reads exactly twice what test A reads, the correlation is one but the tests don't agree. Concordance correlation coefficient catches this by combining three components. The location-shift, how far data are from the equality line. The scale-shift, the difference in slopes. And the Pearson correlation. A concordance correlation coefficient of one indicates perfect agreement, not just perfect correlation.

**Kiffer:** And then the Bland-Altman plot, also called the limits-of-agreement plot. You plot the difference between paired test results on the y-axis against the mean of the two paired results on the x-axis.

**Sarah:** What you see is informative. The horizontal line at the mean of the differences tells you whether there's systematic bias. The spread of points tells you how much disagreement there is. Limits of agreement are defined as the mean difference plus or minus 1.96 times the standard deviation of differences, the range within which 95 percent of differences fall.

**Kiffer:** And if the points fan outward across the x-axis, the two tests agree at low values but diverge at high values. A common pattern, and you'd never see it from a single correlation coefficient.

**Sarah:** Then for categorical tests, the standard measure of agreement is Cohen's kappa, named after Jacob Cohen, the American statistician who introduced it in 1960. Kappa is a chance-corrected agreement statistic. The reason you need correction is that two raters can agree by random luck.

**Kiffer:** The formula in plain words. Kappa equals observed agreement minus expected agreement, divided by one minus expected agreement. The numerator measures how much better than chance the two tests agree. The denominator standardises that improvement. Kappa of zero means agreement no better than chance. Kappa of one means perfect agreement.

**Sarah:** And the interpretation scale comes from a 1977 paper by Landis and Koch, two American biostatisticians.

**Kiffer:** Kappa less than or equal to zero is poor. Below zero means tests agree less than chance, which can happen if they pull in opposite directions. Kappa from 0.01 to 0.20 is slight. From 0.21 to 0.40 is fair. From 0.41 to 0.60 is moderate. From 0.61 to 0.80 is substantial. And from 0.81 to 1.00 is almost perfect.

**Sarah:** But there are two important caveats, because kappa can mislead you.

**Kiffer:** First caveat. Bias. If one test consistently produces more positives than the other, kappa is affected. The recommendation is to first run McNemar's chi-squared test, named after the American psychologist Quinn McNemar in 1947. McNemar's test asks whether the two tests classify the same proportion as positive. If it rejects, the kappa value may not reflect the kind of agreement you think.

**Sarah:** Second caveat. Prevalence. Kappa depends on the underlying prevalence. Two tests will tend to have a higher kappa when prevalence is moderate, around 0.5, compared to very high or very low prevalence. The kappa paradox. Two tests can show 95 percent raw agreement and still produce a kappa near zero, when prevalence is so low that almost all the agreement is on the negative cases.

**Kiffer:** And finally, weighted kappa for ordinal data, where the categories have a natural order, like grades of severity. Weighted kappa accounts for partial agreement. Pairs of test results that are close, like scores of four and five, get more credit than pairs that are far apart, like one and five. Near-misses are not as bad as far-misses.

**Sarah:** Okay, that closes Section 1. Now Section 2. Sensitivity and specificity, the two quantitative properties that capture most of what we care about for any diagnostic or screening test.

**Kiffer:** Section 2 starts with the gold standard. A gold standard is a test or procedure that is treated as absolutely accurate. By definition, it diagnoses every case and misdiagnoses none. The reference against which we evaluate other tests.

**Sarah:** And the lesson is honest that in reality, very few true gold standards exist. Take cancer diagnosis. The conventional gold standard is histopathology, where a pathologist looks at tissue under a microscope. But pathologists disagree. They miss things. They over-call things. The histopathologic diagnosis isn't actually error-free, even though we treat it that way for evaluating other tests.

**Kiffer:** And much of the error is biological variability. People do not immediately become diseased upon exposure. The timescale for crossing a detectable threshold varies from person to person. So a negative test two days after exposure to hepatitis C doesn't necessarily mean the test failed. The person may not yet have seroconverted.

**Sarah:** When no true gold standard exists, the lesson notes alternative approaches. Use results from several tests in combination. Repeated testing of selected samples. Or latent class models, statistical models that estimate sensitivity and specificity of multiple imperfect tests simultaneously, by treating the true disease state as an unobserved latent variable.

**Kiffer:** And then we get to the two-by-two table. This is the image you have to have in your head for the rest of the lesson.

**Sarah:** Rows are disease status. Disease positive on top. Disease negative on bottom. Columns are test result. Test positive on the left. Test negative on the right. Four cells, conventionally labelled a, b, c, d, reading left to right, top to bottom.

**Kiffer:** Cell a is disease positive and test positive. True positives. Cell b is disease positive and test negative. False negatives. Cell c is disease negative and test positive. False positives. Cell d is disease negative and test negative. True negatives.

**Sarah:** Sensitivity is a divided by the sum of a and b. The proportion of truly diseased individuals that the test correctly flags as positive. The probability of testing positive given that disease is present.

**Kiffer:** Specificity is d divided by the sum of c and d. The proportion of truly disease-free individuals that the test correctly clears. The probability of testing negative given that disease is absent.

**Sarah:** And the two complementary fractions. False negative fraction is one minus sensitivity. False positive fraction is one minus specificity.

**Kiffer:** The lesson then walks through a worked example using norovirus enzyme immunoassay data. Norovirus is a highly contagious gastrointestinal virus, the most common cause of acute gastroenteritis outbreaks. Cruise ships, long-term care facilities. The enzyme immunoassay, abbreviated EIA, is a common laboratory test that uses antibodies to detect viral antigens in stool. Polymerase chain reaction, abbreviated PCR, would be the more sensitive reference test.

**Sarah:** The study evaluated 188 stool samples against a gold standard. Of the 82 truly positive samples, 71 were caught by the enzyme immunoassay and 11 were missed. Of the 106 truly negative samples, 103 were correctly cleared and 3 were falsely flagged.

**Kiffer:** So sensitivity is 71 over 82, which equals 86.6 percent. Specificity is 103 over 106, which equals 97.2 percent. False negative fraction is 13.4 percent. So the test misses about one in seven truly infected people. False positive fraction is 2.8 percent. The test incorrectly flags about three in every hundred uninfected.

**Sarah:** Then the lesson introduces two memory aids. SnNOut and SpPIn.

**Kiffer:** SnNOut. High Sensitivity, Negative test, rules disease Out. A sensitive test rarely misses true cases. So if you run a highly sensitive test and it comes back negative, the probability that disease is actually present is low.

**Sarah:** SpPIn. High Specificity, Positive test, rules disease In. A specific test rarely flags non-disease as positive. So if you run a highly specific test and it comes back positive, the probability that disease is actually present is high.

**Kiffer:** Section 2 closes with true prevalence versus apparent prevalence. True prevalence is the actual proportion of the population that has the disease. In the norovirus study, true prevalence is 82 over 188, which equals 43.6 percent. That's high because the study deliberately enrolled symptomatic patients.

**Sarah:** Apparent prevalence is the proportion that tests positive, true positives plus false positives. In the norovirus example, apparent prevalence is 74 over 188, which equals 39.4 percent.

**Kiffer:** And the relationship is straightforward arithmetic. Apparent prevalence equals true prevalence times sensitivity, plus the quantity one minus true prevalence, times the quantity one minus specificity. The first term is true positives. The second term is false positives. Add them, you get everyone who tests positive.

**Sarah:** The Rogan-Gladen formula goes the other direction. If you know sensitivity, specificity, and apparent prevalence, you can estimate the true prevalence. Named after Walter Rogan, an American epidemiologist at the National Institute of Environmental Health Sciences, and Bruce Gladen, who introduced it in a 1978 paper.

**Kiffer:** In plain words. True prevalence equals the quantity apparent prevalence plus specificity minus one, divided by the quantity sensitivity plus specificity minus one.

**Sarah:** Worked example from the lesson. Apparent prevalence 0.150. Sensitivity 0.363. Specificity 0.876. Numerator. 0.150 plus 0.876 minus 1, which equals 0.026. Denominator. 0.363 plus 0.876 minus 1, which equals 0.239. So true prevalence equals 0.026 divided by 0.239, which equals 0.109. About 11 percent.

**Kiffer:** And one important caveat. Some combinations of sensitivity, specificity, and apparent prevalence will produce estimates outside the range from zero to one. That's a sign that the sensitivity and specificity estimates probably don't apply to the population you're studying. The Rogan-Gladen output going negative or above 100 percent is the formula's way of telling you something is wrong with your assumptions.

**Sarah:** Okay. Section 3. Predictive values. The most clinically important section in the lesson.

**Kiffer:** And the conceptual move at the heart of Section 3 is the move from properties of the test to properties of the test result for a particular person.

**Sarah:** Sensitivity and specificity are properties of the test itself. They tell you how well the test classifies people who have or don't have the disease, given that you already know their true status. Both condition on the disease state.

**Kiffer:** But in the real world, you don't know the disease state. That's why you ran the test. So what you actually want to know is the reverse. Given the test result, what's the probability of disease? That's a predictive value.

**Sarah:** Predictive value positive, abbreviated PV+. The probability of disease given a positive test result. Predictive value positive equals a divided by the column total of test positives, which is a plus c. Among everyone who tests positive, what fraction actually has the disease?

**Kiffer:** Predictive value negative, abbreviated PV minus. The probability of no disease given a negative test result. Predictive value negative equals d divided by b plus d. Among everyone who tests negative, what fraction is actually disease-free?

**Sarah:** And the calculation can be written using Bayes-style formulas that incorporate prevalence directly. Predictive value positive equals prevalence times sensitivity, divided by the quantity prevalence times sensitivity, plus the quantity one minus prevalence times one minus specificity.

**Kiffer:** And the punchline of Section 3 is that those formulas explicitly contain prevalence. So predictive values depend on the prevalence of disease in the population being tested. Same test, different population, different predictive values.

**Sarah:** And the lesson dramatises this with the most important table in the lesson. Using the norovirus enzyme immunoassay sensitivity of 86.6 percent and specificity of 97.2 percent, watch what happens to predictive value positive as prevalence drops.

**Kiffer:** At 50 percent prevalence, predictive value positive is 96.9 percent. So a positive test almost certainly means the person has the disease. Because half the population has it, false positives are a small share of all positives.

**Sarah:** At 5 percent prevalence, predictive value positive drops to 61.9 percent. Still useful, but now about four out of ten positive results are false positives. Same test. Same sensitivity. Same specificity. Just applied to a population where the disease is less common.

**Kiffer:** At 0.1 percent prevalence, predictive value positive falls to 3 percent. So when one in a thousand people has the disease, ninety-seven percent of all positive test results are false positives. The test that looked excellent at 50 percent prevalence is essentially useless at 0.1 percent prevalence, in the sense that a positive result tells you almost nothing.

**Sarah:** And meanwhile, predictive value negative goes the other way. As prevalence drops, predictive value negative climbs toward 100 percent. Because most people don't have the disease, a negative test result is very likely to be correct.

**Kiffer:** So in low-prevalence settings, a negative test is reassuring but a positive test is mostly noise. That's the fundamental challenge of screening rare conditions in unselected populations.

**Sarah:** And the lesson drives this home with a scenario. Universal HIV screening. HIV is human immunodeficiency virus, the virus that causes AIDS when untreated. Imagine a country considering universal screening. The rapid test under consideration has sensitivity 99.5 percent and specificity 99.8 percent. Both excellent. National HIV prevalence is 0.3 percent.

**Kiffer:** Plug into the predictive value formula. Numerator. 0.003 times 0.995, which equals 0.002985. Denominator. 0.003 times 0.995 plus 0.997 times 0.002, which equals 0.004979. Predictive value positive equals 0.002985 divided by 0.004979, which equals 60 percent.

**Sarah:** Read what that means. Even with a test that's 99.5 percent sensitive and 99.8 percent specific, in a population where HIV prevalence is 0.3 percent, only 60 percent of positive test results actually correspond to people who have HIV. The other 40 percent are false positives.

**Kiffer:** Which is exactly why confirmatory testing is essential for HIV screening, and for any low-prevalence condition. You don't act on a single rapid test result. You retest with a different, more specific assay.

**Sarah:** And the broader implication. Predictive values are not good measures of a test's intrinsic performance, because they vary from population to population. When you're evaluating a test, look at sensitivity and specificity. When you're advising a patient, look at predictive values, computed for the population that will actually be tested.

**Kiffer:** The lesson then offers three strategies to increase predictive value positive.

**Sarah:** Strategy one. Target high-risk groups. Instead of universal screening, screen people who already have a higher pre-test probability. People with symptoms, known exposures, relevant family history. Higher prevalence in the screened population pushes predictive value positive up directly.

**Kiffer:** Strategy two. Increase specificity. The false-positive fraction is one minus specificity. Lowering the false-positive rate, even by a little, dramatically reduces false positives in absolute terms when you're testing many people.

**Sarah:** Strategy three. Use multiple tests in series. The standard approach for low-prevalence screening. First, run a sensitive test. People who screen negative are released. People who screen positive go to a second, more specific confirmatory test. Only those positive on both are treated as positive. The combined specificity in series is much higher than either alone.

**Kiffer:** And the trade-off is that overall sensitivity drops a bit. Anyone missed by either test is missed overall. But for screening of low-prevalence conditions, that trade is usually worth it. Better to miss a few cases than to flood the population with false positives.

**Sarah:** Okay. Section 4. Cutpoints, receiver operating characteristic curves, and likelihood ratios.

**Kiffer:** Sections one through three treated tests as if they were strictly binary. Positive or negative. In practice, most tests produce a continuous result. Blood urea nitrogen levels. Optical density values from an enzyme assay. Antibody titres. The threshold at which we call a result positive is itself a design choice.

**Sarah:** The lesson uses the term cutpoint, sometimes called cut-off or threshold. The cutpoint is the value above which a test is called positive and below which it's called negative.

**Kiffer:** And the overlap problem is fundamental. The distribution of test values for healthy individuals overlaps with the distribution for diseased individuals. There's almost no test where every healthy person scores below every diseased person. Whatever cutpoint you choose, you will produce both false positives and false negatives.

**Sarah:** Raising the cutpoint, demanding a higher value before calling someone positive, increases specificity, because fewer healthy people get flagged. But it decreases sensitivity, because more diseased people get cleared. Lowering the cutpoint does the opposite.

**Kiffer:** Which is exactly the situation a receiver operating characteristic curve, abbreviated ROC, is designed to make visible. Origin note. The name comes from World War II radar engineering. Operators were choosing thresholds for detecting enemy aircraft on radar screens, balancing the cost of missing a real threat against the cost of false alarms. The methodology migrated into medicine in the 1970s.

**Sarah:** A receiver operating characteristic curve plots sensitivity on the y-axis against the false positive fraction, which is one minus specificity, on the x-axis. You compute one point on the curve at every possible cutpoint, and the resulting curve traces out the trade-off.

**Kiffer:** The 45-degree diagonal from bottom-left to top-right represents a test with no discriminating ability. Any cutpoint produces sensitivity equal to false positive fraction, what you'd get from flipping a coin. The closer your curve is to that diagonal, the worse your test.

**Sarah:** The top-left corner represents a perfect test. Sensitivity 100 percent. False positive fraction zero, meaning specificity 100 percent. The closer the curve gets to the top-left corner, the better the test discriminates.

**Kiffer:** And the optimal cutpoint, assuming equal costs of false positives and false negatives, is the cutpoint where sensitivity plus specificity is at a maximum. That corresponds geometrically to the point on the curve closest to the top-left corner. Sometimes called the Youden index.

**Sarah:** If costs are not equal, you pick a different point. In cancer screening, false negatives are catastrophic, so you want a cutpoint with very high sensitivity, even if specificity suffers. The receiver operating characteristic curve gives you the menu of options.

**Kiffer:** Then the lesson summarises the curve with a single number. The Area Under the Curve, abbreviated AUC.

**Sarah:** Area under the curve has a clean interpretation. It equals the probability that a randomly chosen diseased person scores higher on the test than a randomly chosen non-diseased person.

**Kiffer:** Which means area under the curve of 0.5 is no discrimination. Diseased and non-diseased are equally likely to score higher than each other. That's chance. That's the diagonal. Area under the curve of 1.0 is perfect discrimination. Every diseased person scores higher than every non-diseased person.

**Sarah:** And the rough scale. Area under the curve of exactly 0.5 is no discrimination, chance alone. From 0.5 to 0.7 is poor. From 0.7 to 0.8 is acceptable. From 0.8 to 0.9 is excellent. Above 0.9 is outstanding.

**Kiffer:** Then the lesson moves to likelihood ratios. The cleanest tool for moving from a test result to an updated probability of disease for a specific patient.

**Sarah:** A likelihood ratio is the ratio of two probabilities. The probability of obtaining a particular test result among diseased individuals, divided by the probability of obtaining that same result among non-diseased individuals.

**Kiffer:** Likelihood Ratio Positive, abbreviated LR plus. The likelihood ratio for a positive test result. Likelihood ratio positive equals sensitivity divided by the quantity one minus specificity. The numerator is the probability of a positive test among the diseased. The denominator is the probability of a positive test among the non-diseased, which is the false positive fraction.

**Sarah:** So a likelihood ratio positive of one means a positive test is equally likely among diseased and non-diseased. The result is uninformative. A likelihood ratio positive of ten means a positive test is ten times more likely among diseased people. Strong evidence for disease. A likelihood ratio positive of one hundred is very strong evidence.

**Kiffer:** Likelihood Ratio Negative, abbreviated LR minus. Likelihood ratio negative equals the quantity one minus sensitivity divided by specificity. Lower values mean a negative test is more informative for ruling out disease. A likelihood ratio negative close to zero is ideal.

**Sarah:** The lesson also notes category-specific likelihood ratios. Instead of dichotomising into positive and negative, you stratify the test into multiple result categories and compute a likelihood ratio for each. Particularly useful for diagnostic settings. A troponin level slightly elevated is different from one massively elevated, and the likelihood ratios for those categories are different.

**Kiffer:** Now the magic. Likelihood ratios let you update probability of disease without recomputing the predictive value formula every time. The trick is to work in odds rather than probabilities.

**Sarah:** The three-step process. Step one. Convert pre-test probability to pre-test odds. Step two. Multiply pre-test odds by the likelihood ratio to get post-test odds. Step three. Convert post-test odds back to post-test probability.

**Kiffer:** Step one. Pre-test probability is your best estimate of probability of disease before the test. To convert probability to odds, you divide probability by one minus probability. So a probability of 50 percent corresponds to odds of one. A probability of 25 percent corresponds to odds of one over three.

**Sarah:** Step two. Multiply by the likelihood ratio. Post-test odds equals pre-test odds times the relevant likelihood ratio. If positive, multiply by likelihood ratio positive. If negative, multiply by likelihood ratio negative. If category-specific, multiply by the likelihood ratio for the actual result category.

**Kiffer:** Step three. Convert back from odds to probability. Probability equals odds divided by one plus odds. Odds of one give a probability of 50 percent. Odds of three give 75 percent. Odds of nine give 90 percent.

**Sarah:** Worked example from the lesson. Pre-test probability is 2 percent. Test result corresponds to a category-specific likelihood ratio of 25.95.

**Kiffer:** Step one. 0.02 divided by 0.98 equals 0.0204. Pre-test odds of 0.0204.

**Sarah:** Step two. 0.0204 times 25.95 equals 0.5294. Post-test odds of 0.5294.

**Kiffer:** Step three. 0.5294 divided by 1.5294 equals 0.346. About 35 percent post-test probability.

**Sarah:** So the test result moved this patient from 2 percent pre-test probability of disease to 35 percent post-test probability. A meaningful update from a single test result with a strong likelihood ratio.

**Kiffer:** Note what's happening conceptually. The test didn't tell you whether the patient has the disease. It updated your belief. It moved you from one probability to another. That's the right way to think about almost every test result in medicine.

**Sarah:** And likelihood ratios let you do that update without going back to the population two-by-two table. They're portable. The same likelihood ratio applies regardless of pre-test probability. The likelihood ratio is a property of the test, but the post-test probability is personal to the patient.

**Kiffer:** Okay. Let's pull this together into the big takeaways.

**Sarah:** First takeaway. A test in epidemiology is broader than students usually think. Lab assays, imaging, clinical signs, history-taking questions, survey items, even autopsy findings. The arithmetic of sensitivity, specificity, and predictive value applies to all of them equally.

**Kiffer:** Second. Accuracy is closeness to the true value. Precision is consistency. They're independent. A test can be precisely wrong, accurately imprecise, both, or neither. Cohen's kappa is the standard chance-corrected agreement measure for categorical tests.

**Sarah:** Third. Sensitivity is the probability of testing positive given disease. Specificity is the probability of testing negative given no disease. These are properties of the test, not of the population. SnNOut and SpPIn are the clinical mnemonics. High sensitivity rules disease out on a negative test. High specificity rules disease in on a positive test.

**Kiffer:** Fourth. Predictive values depend on prevalence. The same excellent test in a low-prevalence population produces mostly false positives. The norovirus example showed predictive value positive falling from 96.9 percent at 50 percent prevalence to 3 percent at 0.1 percent prevalence. The HIV scenario showed predictive value positive at 60 percent even with sensitivity over 99 percent and specificity over 99 percent, simply because population prevalence was 0.3 percent.

**Sarah:** And the strategies to increase predictive value positive. Target high-risk groups. Increase specificity. Use multiple tests in series. The standard playbook for screening programmes that don't drown the population in false alarms.

**Kiffer:** Fifth. The receiver operating characteristic curve makes the cutpoint trade-off visible. Plot sensitivity against one minus specificity at every cutpoint. The diagonal is no discrimination. The top-left corner is perfect. Area under the curve summarises overall discriminatory ability. 0.5 is chance. Above 0.9 is outstanding.

**Sarah:** Sixth. Likelihood ratios combine sensitivity and specificity into a single number that updates probability of disease. Likelihood ratio positive equals sensitivity divided by one minus specificity. Likelihood ratio negative equals one minus sensitivity divided by specificity. The three-step process moves from pre-test probability to post-test probability through the odds form.

**Kiffer:** And one practical note. The lesson has two embedded simulators worth playing with. The Sensitivity, Specificity, Predictive Value, and Cutoff simulator in Section 3. Drop prevalence to 0.1 percent while keeping sensitivity and specificity at 95 percent. Watch predictive value positive plummet toward single digits. The rare-disease screening problem in one screen.

**Sarah:** And the receiver operating characteristic curve builder in Section 4. Set separation between healthy and diseased distributions to zero, and the curve collapses onto the diagonal. Set it to six standard deviations apart, and area under the curve approaches one.

**Kiffer:** Big-picture takeaway. The single most important idea in this lesson is that sensitivity and specificity are properties of the test, but predictive values are properties of the test result for an individual person, and they depend on prevalence. The same test that's brilliant in a high-prevalence clinical setting can be useless for population screening of a rare condition. That's not a flaw in the test. That's a feature of probability.

**Sarah:** Next up is Lesson 7, Measures of Association. That's where we put together everything we've built about disease frequency in Lesson 5 and the two-by-two contingency logic from Lesson 6 to compare groups and quantify how much an exposure changes the probability of an outcome.

**Kiffer:** Take care, everyone.

**Sarah:** See you there.
