Screening &
Diagnostic Tests

Fundamental Epidemiological Concepts and Approaches

Learning objectives for this lesson:

Define accuracy and precision as they relate to test characteristics
Interpret measures of precision for quantitative tests and calculate kappa for categorical tests
Define sensitivity and specificity, and calculate their estimates and confidence intervals
Define predictive values and explain the factors that influence them
Choose appropriate cutpoints using ROC curves and likelihood ratios
Use multiple tests and interpret results in series or parallel

This course was developed by Dr. Kiffer G. Card, Faculty of Health Sciences, Simon Fraser University based on Dohoo, I. R., Martin, S. W., & Stryhn, H. (2012). Methods in Epidemiologic Research. VER Inc.

Reference

Glossary: Key Terms, People & Concepts

📚 Reference page, available throughout the lesson

This glossary collects the key concepts, people, and ideas you will meet in this lesson. Use it as a reference while you work through the material, or as a review before assessments. Type in the search box to filter entries.

Test Performance Concepts

Screening Test A test applied to asymptomatic individuals to identify those at higher risk of disease so they can undergo further diagnostic evaluation. Goal is early detection in apparently healthy people.

Diagnostic Test A test used in symptomatic individuals or those with positive screens to confirm or rule out disease. Generally more invasive, expensive, and accurate than screening tests.

Gold (Reference) Standard The best available test or set of criteria used to define true disease status when evaluating a new test. The benchmark against which sensitivity and specificity are measured.

Accuracy How close a measurement is to the true value. In test evaluation, the proportion of all results (positive and negative) that are correct.

Precision The reproducibility or repeatability of measurements: how close repeated measurements are to one another, regardless of accuracy.

Sensitivity (True Positive Rate) The probability that a test correctly identifies a person with the disease: P(test+ | disease+). High sensitivity is needed to rule out disease (SnNout).

Specificity (True Negative Rate) The probability that a test correctly identifies a person without the disease: P(test− | disease−). High specificity is needed to rule in disease (SpPin).

Positive Predictive Value (PPV) Among those who test positive, the proportion who actually have the disease: P(disease+ | test+). Strongly dependent on the prevalence in the tested population.

Negative Predictive Value (NPV) Among those who test negative, the proportion who truly do not have the disease: P(disease− | test−).

Prevalence Threshold The point along the prevalence axis below which the predictive value of a test deteriorates rapidly. A reminder that a test’s usefulness depends on the population it is applied to.

False Positive A test result that incorrectly indicates disease in someone who is disease-free. Tied to specificity (1 − specificity = false positive rate).

False Negative A test result that incorrectly indicates absence of disease in someone who actually has it. Tied to sensitivity (1 − sensitivity = false negative rate).

Cutpoint (Threshold) The numeric value of a continuous test that separates “positive” from “negative.” Lowering the cutpoint typically raises sensitivity at the expense of specificity.

Methods & Measures

ROC Curve Receiver Operating Characteristic curve: a plot of sensitivity (true positive rate) vs. 1 − specificity (false positive rate) across all possible cutpoints, used to compare tests and choose thresholds.

Area Under the Curve (AUC) A summary of overall test discrimination from the ROC curve, ranging from 0.5 (no better than chance) to 1.0 (perfect). Equivalent to the c-statistic.

Positive Likelihood Ratio (LR+) Sensitivity / (1 − specificity). Indicates how much a positive test increases the odds of disease. Values > 10 strongly rule in disease (Deeks & Altman, 2004).

Negative Likelihood Ratio (LR−) (1 − sensitivity) / specificity. Indicates how much a negative test decreases the odds of disease. Values < 0.1 strongly rule out disease.

Cohen’s Kappa A chance-corrected measure of agreement between two raters or tests on categorical outcomes. Ranges from 0 (chance agreement) to 1 (perfect agreement).

Series Testing Sequential testing in which a positive on one test triggers a second test; disease is declared only if both are positive. Increases overall specificity, decreases overall sensitivity.

Parallel Testing Two tests done simultaneously; disease is declared if either is positive. Increases overall sensitivity, decreases overall specificity.

Screening Programme Concepts

Lead-Time Bias Apparent improvement in survival that arises only because screening detects disease earlier in its course, even when actual time of death is unchanged.

Length Bias Tendency for screening to preferentially detect slow-progressing (longer pre-clinical phase) cases, making screened cases appear to have better outcomes than non-screened cases.

Overdiagnosis Detection of disease that would never have caused symptoms or harm in the patient’s lifetime. Inflates apparent screening benefit and exposes patients to unnecessary treatment (Welch & Black, 2010; Brodersen et al., 2018).

Wilson & Jungner Criteria Ten classic criteria (Wilson & Jungner, 1968, WHO) for evaluating whether a screening programme is appropriate, covering the disease, the test, treatment availability, costs, and ethics. See also Wikipedia: Screening (medicine) and the modern revisit by Andermann et al. (2008).

No matching entries. Try a different search term.

Section 1

Introduction & Test Attributes

⏱ Estimated reading time: 12 minutes

Lesson 6 · HSCI 341

From Populations to Individual Results

Given a single test result, what does it actually tell you about this person?

Section 1 of 4

Introduction & Test Attributes

What a test is, the difference between analytic and diagnostic properties, and how we quantify precision and agreement.

What is a test?

Screening vs. diagnostic tests

Screening test

Applied to apparently healthy populations. Goal: early detection before symptoms appear.

Diagnostic test

Applied to individuals already suspected of disease. Goal: confirm or rule out a condition.

Evaluation principles are the same for both. The clinical context differs.

A critical distinction

Analytic vs. diagnostic sensitivity and specificity

Analytic sensitivity

The lowest concentration of a compound the test can detect. A laboratory property.

Diagnostic sensitivity

The probability that a truly diseased person tests positive. An epidemiological property, covered next.

Analytic specificity: cross-reactivity with other compounds. Diagnostic specificity: correct negative rate. Different quantities entirely.

Test quality

Accuracy and precision

Accurate & precise

Mean near truth; low scatter. The ideal.

Precise, not accurate

Consistent results, but all biased from the truth.

Accurate, not precise

Mean near truth; high variability.

Neither

High scatter and a biased mean.

Coefficient of variation = σ / μ. Lower values indicate greater precision.

Categorical agreement

Cohen's kappa (κ)

Kappa (Cohen, 1960)

\[ \color{#0B7B6B}{\kappa} = \frac{\color{#C2410C}{p_o} - \color{#6D28D9}{p_e}}{1 - \color{#6D28D9}{p_e}} \]

κ chance-corrected agreementpₒ observed agreementpₑ agreement expected by chance

κ ≤ 0: Poor | 0.01–0.20: Slight | 0.21–0.40: Fair

0.41–0.60: Moderate | 0.61–0.80: Substantial | 0.81–1.00: Almost perfect (Landis & Koch, 1977)

Weighted kappa

For ordinal scales: near-misses (4 vs. 5) penalised less than large discrepancies (1 vs. 5).

Carry forward

The thread into the next section

Screening applies to healthy populations; diagnostic testing to suspected cases. Evaluation principles are shared.
Analytic and diagnostic sensitivity are different quantities. Keep them separate.
Accuracy, precision, and agreement each ask a version of: how much can you trust the result?

Introduction and Overview

An earlier lesson covered measures of disease frequency in populations. This lesson takes the same probabilistic vocabulary and applies it at the level of a single test administered to a single person. Whether you're evaluating a new screening assay, interpreting a clinical result, or designing a surveillance algorithm, the same four-cell 2×2 logic appears: a test result that's either positive or negative, against a true disease state that's either present or absent (Sackett & Haynes, 2002). The four content sections build up from the basic attributes of a test (this section), through sensitivity and specificity (a later section), to the predictive values that depend on disease prevalence (a later section), and finally to ROC curves and likelihood ratios for tests with continuous output (a later section).

Learning Objectives

Distinguish between screening tests and diagnostic tests.
Define analytic sensitivity and specificity of a test.
Explain the difference between accuracy and precision.
Describe measures of agreement, including Cohen’s kappa and weighted kappa.

What Is a Test?

A test is any device or procedure designed to detect or quantify a sign, substance, tissue change, or body response in an individual. Tests can also be applied at the household or other levels of aggregation. In epidemiology, the term “test” extends broadly to include clinical signs, history-taking questions, survey items, and post-mortem findings.

Why Evaluate Tests?

In a decision-making context (e.g., clinical diagnosis), the selection of an appropriate test should alter your assessment of the probability that a disease exists, and guide subsequent actions (further testing, treatment, quarantine). In a research context, understanding test characteristics is essential for knowing how they affect data quality.

Screening vs. Diagnostic Tests

Click each card to learn more:

Screening TestsClick to learn more

Diagnostic TestsClick to learn more

Despite their different uses, the principles of evaluation and interpretation are the same for both screening and diagnostic tests.

Attributes of the Test Per Se

Analytic Sensitivity and Specificity

The analytic sensitivity of an assay refers to the lowest concentration of a chemical compound the test can detect. The analytic specificity refers to the capacity of a test to react to only one chemical compound. These are distinctly different from diagnostic (epidemiologic) sensitivity and specificity, which are discussed in a later section.

Accuracy and Precision

The laboratory accuracy of a test relates to its ability to give a true measure of the substance of interest. To be accurate, a test need not always be close to the true value, but if repeat tests are run, the resulting average should be close to the true value.

The precision of a test relates to how consistent the results are. If a test always gives the same value for a sample (regardless of whether it is the correct value), it is said to be precise.

Figure 5.1. Laboratory accuracy and precision. The bullseye represents the true value.

Precision and Agreement

Repeatability refers to variability obtained from repeated testing of the same sample within the same laboratory. Reproducibility refers to variability from testing the same sample in different laboratories. Agreement refers to how well two different tests (or raters) agree when applied to the same sample.

Measuring Precision: Quantitative Tests

Common measures for quantifying variability between pairs of test results include:

Coefficient of Variation (CV)▼

The CV is computed as CV = σ / μ, where σ is the standard deviation among test results on the same sample and μ is the mean. A lower CV indicates greater precision.

Concordance Correlation Coefficient (CCC)▼

The CCC (Lin, 1989) compares two sets of test results and better reflects agreement than a Pearson correlation. It is computed from three parameters: the location-shift (how far data are from the equality line), the scale-shift (difference in slopes), and the Pearson r. A CCC of 1 indicates perfect agreement.

Limits of Agreement (Bland-Altman Plot)▼

A Bland-Altman plot (Bland & Altman, 1986) plots the differences between paired test results against their mean value. The mean difference (μ_d) and limits of agreement (μ_d ± 1.96σ_d) are shown. This reveals systematic bias and whether disagreement varies with the magnitude of the measurement.

Measuring Agreement for Categorical Tests: Kappa (κ)

When test results are categorical (dichotomous or ordinal), Cohen’s kappa (κ) measures agreement beyond what would be expected by chance alone (Cohen, 1960).

Cohen’s kappa

\[ \color{#0B7B6B}{\kappa} = \dfrac{\color{#C2410C}{p_o} - \color{#6D28D9}{p_e}}{1 - \color{#6D28D9}{p_e}} \]Eq 5.2

kappa equals the observed agreement minus the agreement expected by chance, divided by one minus the chance agreement.

The benchmark categories below follow Landis & Koch (1977).

κ Value	Interpretation
≤ 0	Poor agreement
0.01 – 0.20	Slight agreement
0.21 – 0.40	Fair agreement
0.41 – 0.60	Moderate agreement
0.61 – 0.80	Substantial agreement
0.81 – 1.00	Almost perfect agreement

Factors Affecting Kappa

Bias: If one test consistently produces more positive results than the other, κ will be affected. Use McNemar’s χ² test to check whether the two tests classify the same proportion as positive before evaluating agreement.

Prevalence: The prevalence of the underlying condition affects κ. Two tests will have a higher κ when prevalence is moderate (~0.5) compared to very high or very low prevalence.

Weighted Kappa

For tests measured on an ordinal scale, a weighted kappa accounts for partial agreement. Pairs of test results that are close (e.g., scores of 4 and 5) receive more credit than pairs that are far apart (e.g., scores of 1 and 5). This provides a better reflection of agreement for ordinal data.

Key Takeaways

A test is any procedure designed to detect or quantify a sign, substance, or response.
Screening tests are applied to healthy populations; diagnostic tests are applied to individuals suspected of disease.
Accuracy measures closeness to the true value; precision measures consistency of results.
Cohen’s kappa quantifies agreement beyond chance for categorical tests; weighted kappa extends this to ordinal scales.
Prevalence and bias both affect kappa values.

✦ Pass the knowledge check with 100% to continue

Section 2

Sensitivity & Specificity

⏱ Estimated reading time: 15 minutes

Section 2 of 4

Sensitivity & Specificity

The 2×2 table, gold standards, and the test properties that travel wherever the test goes.

Reference standard

The gold standard

A gold standard is the reference procedure assumed to be perfectly accurate: all cases classified correctly, no misclassification.

In practice few true gold standards exist. Biological variability and measurement limits mean test evaluation often relies on composite or imperfect references.

The core tool

The 2×2 contingency table

	Disease + (D+)	Disease − (D−)
Test + (T+)	a true positive	c false positive
Test − (T−)	b false negative	d true negative

The key measures

Sensitivity and specificity

Sensitivity

\[ \color{#0B7B6B}{Se} = \frac{\color{#C2410C}{a}}{\color{#C2410C}{a} + \color{#6D28D9}{b}} = \frac{\text{TP}}{\text{all D+}} \]

Se sensitivity (true positive rate)a true positivesb false negatives

Specificity

\[ \color{#0B7B6B}{Sp} = \frac{\color{#C2410C}{d}}{\color{#6D28D9}{c} + \color{#C2410C}{d}} = \frac{\text{TN}}{\text{all D}-} \]

Sp specificity (true negative rate)d true negativesc false positives

SnNOut

High Se: a Negative result rules Out disease.

SpPIn

High Sp: a Positive result rules In disease.

Prevalence effects

True vs. apparent prevalence

Apparent prevalence (Eq 5.6)

\[ \color{#0B7B6B}{AP} = \color{#C2410C}{P} \cdot \color{#6D28D9}{Se} + (1 - \color{#C2410C}{P})(1 - \color{#1D4ED8}{Sp}) \]

AP apparent (test-measured) prevalenceP true prevalenceSe sensitivitySp specificity

Rogan-Gladen correction (1978)

\[ \color{#0B7B6B}{\hat{P}} = \frac{\color{#C2410C}{AP} + \color{#1D4ED8}{Sp} - 1}{\color{#6D28D9}{Se} + \color{#1D4ED8}{Sp} - 1} \]

P̂ corrected true prevalenceAP apparent prevalenceSe sensitivitySp specificity

Carry forward

What sensitivity and specificity cannot tell you

Se and Sp are test properties: stable across populations with different prevalence.
They answer the question from the test's viewpoint, not the patient's.
To answer “does this positive result mean disease?” you also need disease prevalence. A later section does exactly that.

Introduction and Overview

An earlier section named the attributes a test should have in the abstract. This section turns to the two quantitative properties that capture most of what we care about: sensitivity (the test's ability to find disease that is truly present) and specificity (its ability to correctly say “no” when disease is truly absent). Both are properties of the test itself, not of the population to which it is applied; that distinction becomes essential at the predictive values in

The 2×2 test table. Sensitivity and specificity are computed down the disease-status columns (so they are properties of the test); predictive values are computed across the test-result rows (so they depend on prevalence).

a later section.

Learning Objectives

Explain the concept of a gold standard and its role in test evaluation.
Calculate sensitivity, specificity, false positive fraction, and false negative fraction from a 2×2 table.
Distinguish between true prevalence and apparent prevalence.
Estimate true prevalence from apparent prevalence using the Rogan-Gladen formula.

The Gold Standard

A gold standard (GS) is a test or procedure that is absolutely accurate: it diagnoses all cases of a specific disease and misdiagnoses none. In reality, very few true gold standards exist. Much of the error in test evaluation is due to biological variability: people do not immediately become “diseased” upon exposure, and the timescale for crossing a detectable threshold varies from person to person.

Important Caveat

When no true gold standard exists, alternative approaches for estimating sensitivity and specificity are needed, including the use of results from several different tests, repeated testing of selected samples, and latent class models (discussed in Section 5.7 of the textbook).

The 2×2 Contingency Table

The concepts of sensitivity and specificity are most easily understood through a 2×2 contingency table comparing disease status to test results:

	Test Positive (T+)	Test Negative (T−)	Total
Disease Positive (D+)	a (true positive)	b (false negative)	m₁
Disease Negative (D−)	c (false positive)	d (true negative)	m₀
Total	n₁	n₀	n

Key Measures from the 2×2 Table

Click each card to explore:

Sensitivity (Se)Click to explore

Specificity (Sp)Click to explore

False Positive FractionClick to explore

False Negative FractionClick to explore

Worked Example (Norovirus EIA Data)

From a study of 188 stool samples tested with an EIA against a gold standard:

	GS+ (D+)	GS− (D−)	Total
T+	71	3	74
T−	11	103	114
Total	82	106	188

Se = 71/82 = 86.6% (95% CI: 77.3%, 93.1%)
Sp = 103/106 = 97.2% (95% CI: 92.0%, 99.4%)
FNF = 1 − 0.866 = 13.4%
FPF = 1 − 0.972 = 2.8%

True and Apparent Prevalence

The true prevalence (P) is the actual proportion of the population that has the disease. In Example 5.4, P = 82/188 = 43.6%.

The apparent prevalence (AP) is the proportion that tests positive, which includes both true positives and false positives. In Example 5.4, AP = 74/188 = 39.4%.

Apparent prevalence

\[ \color{#0B7B6B}{AP} = \color{#C2410C}{P}\,\color{#6D28D9}{Se} + (1 - \color{#C2410C}{P})(1 - \color{#1D4ED8}{Sp}) \]Eq 5.6

apparent prevalence equals true prevalence times sensitivity, plus the non-diseased fraction times one minus specificity (the false positives).

Estimating True Prevalence from Apparent Prevalence

If the Se and Sp of a test are known, the true prevalence can be estimated from the apparent prevalence using the Rogan-Gladen formula (Rogan & Gladen, 1978):

True prevalence from apparent prevalence

\[ \color{#0B7B6B}{P} = \dfrac{\color{#C2410C}{AP} + \color{#1D4ED8}{Sp} - 1}{\color{#6D28D9}{Se} + \color{#1D4ED8}{Sp} - 1} \]Eq 5.7

true prevalence equals apparent prevalence plus specificity minus one, divided by sensitivity plus specificity minus one.

Example Calculation

If AP = 0.150, Se = 0.363, and Sp = 0.876, then:

P = (0.150 + 0.876 − 1) / (0.363 + 0.876 − 1) = 0.026 / 0.239 = 0.109 (10.9%)

Note: Some combinations of Se, Sp, and AP can produce estimates of P outside the range 0–1, indicating that the Se and Sp estimates may not be applicable to the population being studied.

Reflection

A new rapid test for influenza has a sensitivity of 75% and a specificity of 98%. In a population where the true prevalence of influenza is 5%, calculate the apparent prevalence using the formula AP = P × Se + (1 − P) × (1 − Sp). What does this tell you about relying solely on test results to estimate disease burden?

Model answerAP = 0.05×0.75 + 0.95×(1−0.98) = 0.0375 + 0.019 = 0.057 (5.7%). The apparent prevalence (5.7%) is close to but biased upward from the true prevalence (5%): the false-positive rate of 2% applied to the 95% non-diseased population is more numerous than the 25% false negatives among the 5% diseased. The implication: using raw test results without correction systematically misestimates disease burden, with the direction of bias depending on the relative magnitudes of (1−Sp) and Se. For surveillance reporting you must correct for known test performance: P = (AP − (1 − Sp)) / (Se + Sp − 1). Routine surveillance dashboards that report ‘positivity rate’ as if it were prevalence are mathematically misleading whenever Se and Sp are imperfect.

Minimum 20 characters required.

✓ Reflection saved

Key Takeaways

A gold standard is the reference test assumed to be perfectly accurate; in practice, few truly exist.
Sensitivity = probability of testing positive given disease; specificity = probability of testing negative given no disease.
High Se is important for ruling out disease (SnNOut); high Sp is important for confirming disease (SpPIn).
Apparent prevalence differs from true prevalence due to test imperfections.
The Rogan-Gladen formula estimates true prevalence from apparent prevalence when Se and Sp are known.

✦ Pass the knowledge check with 100% and complete the reflection to continue

Section 3

Predictive Values

⏱ Estimated reading time: 12 minutes

Section 3 of 4

Predictive Values

From the test’s perspective to the patient’s: what a result actually means in a given population.

The patient-centred measures

Positive and negative predictive value

PV+ (Eq 5.8)

\[ \color{#0B7B6B}{PV^+} = \frac{\color{#C2410C}{P} \cdot \color{#6D28D9}{Se}}{\color{#C2410C}{P} \cdot \color{#6D28D9}{Se} + (1-\color{#C2410C}{P})(1-\color{#1D4ED8}{Sp})} \]

PV+ positive predictive valueP prevalence (pre-test probability)Se sensitivitySp specificity

PV− (Eq 5.9)

\[ \color{#0B7B6B}{PV^-} = \frac{(1-\color{#C2410C}{P}) \cdot \color{#1D4ED8}{Sp}}{(1-\color{#C2410C}{P}) \cdot \color{#1D4ED8}{Sp} + \color{#C2410C}{P}(1-\color{#6D28D9}{Se})} \]

PV− negative predictive valueP prevalence (pre-test probability)Se sensitivitySp specificity

The prevalence effect

The same test, very different answers

Prevalence	PV+	PV−
50%	96.9%	87.9%
5%	61.9%	99.3%
0.1%	3.0%	~100%

Se = 86.6%, Sp = 97.2% held constant. Only prevalence changes.

A public health scenario

HIV universal screening: the arithmetic

PV+ calculation at P = 0.3%

\[ \color{#0B7B6B}{PV^+} = \frac{\color{#C2410C}{0.003} \times \color{#6D28D9}{0.995}}{\color{#C2410C}{0.003} \times \color{#6D28D9}{0.995} + \color{#1D4ED8}{0.997} \times \color{#BE185D}{0.002}} = 60\% \]

0.003 prevalence P0.995 sensitivity Se0.997 1 − P (non-diseased)0.002 1 − Sp (false positive rate)

Se = 99.5%, Sp = 99.8%: an excellent test. Yet 40% of positives are false alarms at low prevalence. Confirmatory testing is arithmetically necessary.

Practical strategies

Increasing PV+ in low-prevalence settings

Target high-risk groups

Higher local prevalence raises PV+ without changing the test itself.

More specific confirmation

Apply a high-Sp test to initial positives. Overall false-positive rate falls sharply.

Series vs. parallel

Series: higher overall Sp, lower Se. Parallel: higher Se, lower Sp. Match strategy to stakes.

Carry forward

What to take into the next section

PV+ and PV− are not portable across populations with different prevalence.
Sensitivity and specificity are the transferable test properties.
A later section asks what happens when we stop forcing continuous results into a binary yes/no.

Introduction and Overview

An earlier section covered sensitivity and specificity, which are properties of the test itself. This section introduces the predictive values: what an individual person should believe about their disease status given the test result. Importantly, predictive values depend on disease prevalence in the population being tested, which is why the same test can be useful in one setting and useless in another. This is the most clinically important section in the lesson.

Learning Objectives

Define predictive value positive (PV+) and predictive value negative (PV−).
Calculate PV+ and PV− from a 2×2 table and using Bayesian formulas.
Explain how prevalence affects predictive values.
Describe strategies for increasing the predictive value of a positive test.

What Are Predictive Values?

While Se and Sp are characteristics of the test, predictive values tell us how useful the test is for individuals of unknown disease status. Once we decide to use a test, we want to know the probability that the individual has or does not have the disease, given the test result.

▸ INTERACTIVE STORY: 1000 PIXEL PEOPLE
Open full screen ↗

Watch a 95-95 test scan 1,000 people and see PPV emerge from the math. Next ▶ advances scenes.

A 6-scene Bayesian-reasoning visualization: a population of 1,000 with 1% prevalence, a 95%-sensitive 95%-specific test scanning across, the four buckets (TP/FP/FN/TN) populating in real time, and the surprising PPV that follows.

Predictive Value Positive (PV+)

The PV+ is the probability that an individual who tests positive actually has the disease: p(D+|T+) = a / n₁.

Positive predictive value

\[ \color{#0B7B6B}{PV^+} = \dfrac{\color{#C2410C}{p(D^+)}\,\color{#6D28D9}{Se}}{\color{#C2410C}{p(D^+)}\,\color{#6D28D9}{Se} + \color{#1D4ED8}{p(D^-)}(1 - \color{#BE185D}{Sp})} \]Eq 5.8

the positive predictive value is the true positives, prevalence times sensitivity, divided by all positives (true positives plus false positives, the non-diseased fraction times one minus specificity).

In the norovirus example: PV+ = 71/74 = 95.9% (95% CI: 88.6%, 99.2%)

Predictive Value Negative (PV−)

The PV− is the probability that an individual who tests negative truly does not have the disease: p(D−|T−) = d / n₀.

Negative predictive value

\[ \color{#0B7B6B}{PV^-} = \dfrac{\color{#C2410C}{p(D^-)}\,\color{#6D28D9}{Sp}}{\color{#C2410C}{p(D^-)}\,\color{#6D28D9}{Sp} + \color{#1D4ED8}{p(D^+)}(1 - \color{#BE185D}{Se})} \]Eq 5.9

the negative predictive value is the true negatives, the non-diseased fraction times specificity, divided by all negatives (true negatives plus false negatives, the diseased fraction times one minus sensitivity).

In the norovirus example: PV− = 103/114 = 90.4% (95% CI: 83.4%, 95.1%)

Effect of Prevalence on Predictive Values

Predictive values depend heavily on the prevalence of disease in the population being tested (an application of Bayes's theorem). This is why PV+ and PV− are not good measures of a test’s intrinsic performance; they vary from population to population.

Dramatic Impact of Prevalence

Using Se = 86.6% and Sp = 97.2% from the norovirus example, observe how PV+ and PV− change as prevalence drops:

Prevalence (%)	PV+ (%)	PV− (%)
50	96.9	87.9
5	61.9	99.3
0.1	3.0	100.0

As you can see, when prevalence drops to 0.1%, the PV+ falls to just 3%, meaning 97% of positive results are false positives. Meanwhile, the PV− approaches 100%. This is a fundamental challenge in screening low-prevalence populations.

A Worked Example in Natural Frequencies

The formula can feel abstract, so it helps to walk a whole group of people through the test and simply count. Imagine screening 10,000 people for a disease with a prevalence of 1%, using a strong test with Se = 99% and Sp = 95%.

Group	People
Have the disease (1% of 10,000)	100
Diseased and test positive (true positives, 99% of 100)	99
Diseased and test negative (false negatives)	1
Do not have the disease (99% of 10,000)	9,900
Healthy and test positive (false positives, 5% of 9,900)	495
Healthy and test negative (true negatives)	9,405

Now read the positive predictive value straight off the counts. A total of 99 + 495 = 594 people test positive, but only 99 of them truly have the disease, so PV+ = 99 / 594 = 16.7%. About five of every six positive results are false alarms, even though the test is correct 99% of the time in the sick and 95% of the time in the healthy. The reason is arithmetic, not a flaw in the test: the 9,900 healthy people are so numerous that their small 5% error rate yields more false positives (495) than there are true cases in the whole group (100).

🧪 Interactive: Sensitivity, Specificity, PPV & the Cutoff

Two overlapping populations of test scores: diseased and healthy. Drag the cutoff line (or use the slider). Move the prevalence slider to see PPV collapse on rare-disease screening, even with a strong test.

Distribution of test scores

Drag the dashed cutoff line. Right of the line = test positive.

2×2 confusion matrix (per 10,000 tested)

	D+	D−	Total
T+	1954	2020	3974
T−	46	5980	6026
Total	2000	8000	10,000

Sensitivity

97.7%

Specificity

74.8%

PPV

49.2%

NPV

99.2%

Cutoff 5.00

Disease prevalence 0.200

Healthy mean score 4.00

Diseased mean score 8.00

SD (both groups) 1.50

Presets:

Move the cutoff: see Sn and Sp trade off. Then move prevalence: see PPV/NPV swing while Sn/Sp stay fixed.

Strategies to Increase PV+

Click each card to explore:

Target High-Risk GroupsClick to explore

Increase SpecificityClick to explore

Use Multiple TestsClick to explore

Scenario: Universal HIV Screening

A country considers implementing universal HIV screening using a rapid test with Se = 99.5% and Sp = 99.8%. The national HIV prevalence is 0.3%.

PV+ = (0.003 × 0.995) / [(0.003 × 0.995) + (0.997 × 0.002)] = 0.002985 / (0.002985 + 0.001994) = 60.0%

Even with an excellent test (99.5% Se, 99.8% Sp), 40% of positive results in this low-prevalence population would be false positives. This is why confirmatory testing is essential.

Reflection

Consider a screening programme for a rare genetic condition affecting 1 in 10,000 newborns. The test has Se = 99% and Sp = 99.9%. Calculate the PV+ and discuss the implications of the result for clinical decision-making. What strategies would you recommend to improve the programme?

Model answerAt P = 1/10,000 = 0.0001 with Se = 0.99 and Sp = 0.999: PPV = (0.0001×0.99) / (0.0001×0.99 + 0.9999×0.001) = 0.000099 / 0.001099 ≈ 0.09 (9%). Even with an extraordinarily specific test, 91% of positive screens are false alarms. Implications: every positive screen must be followed by a confirmatory test (different assay or repeat with different conditions), genetic counselling, and family-history workup; never act on the first positive alone. Programme-improvement strategies: (a) tighten screening criteria, restricting to higher-prevalence subgroups (family history) when feasible; (b) add a second-stage confirmatory test (e.g., DNA sequencing after the rapid immunoassay) before any treatment decision; (c) combine multiple markers in a panel to multiply specificity; (d) improve specificity at the cost of sensitivity if the disease is treatable late as well as early.

Minimum 20 characters required.

✓ Reflection saved

Key Takeaways

PV+ is the probability of disease given a positive test; PV− is the probability of no disease given a negative test.
Predictive values are driven by both test characteristics (Se, Sp) and the prevalence of disease.
In low-prevalence populations, even highly specific tests can produce mostly false positive results.
Strategies to increase PV+ include targeting high-risk groups, increasing Sp, and using multiple tests in series.

✦ Pass the knowledge check with 100% and complete the reflection to continue

Section 4

Cutpoints, ROC Curves & Likelihood Ratios

⏱ Estimated reading time: 15 minutes

Section 4 of 4

Cutpoints, ROC Curves & Likelihood Ratios

What to do when the test gives a number, and how to use it to update the probability of disease.

The overlap problem

Every cutpoint is a trade-off

Shift the cutpoint right: more false negatives, fewer false positives. Shift left: the reverse.

ROC curves

Plotting performance across all cutpoints

Area under the curve

Summarising discriminatory ability

AUC interpretation (Hanley & McNeil, 1982)

\[ \color{#0B7B6B}{AUC} = P(\color{#C2410C}{\text{score}_{D+}} > \color{#6D28D9}{\text{score}_{D-}}) \]

AUC area under the ROC curvescore (D+) test value of a diseased personscore (D−) test value of a healthy person

0.50: Chance | 0.50–0.70: Poor | 0.70–0.80: Acceptable

0.80–0.90: Excellent | >0.90: Outstanding

Equivalent to the Mann–Whitney U statistic. Non-parametric confidence intervals available.

Likelihood ratios

Combining Se and Sp into one update factor

LR+ (Eq 5.10)

\[ \color{#0B7B6B}{LR^+} = \frac{\color{#C2410C}{Se}}{1 - \color{#6D28D9}{Sp}} \]

LR+ positive likelihood ratioSe sensitivity (true positive rate)Sp specificity

LR− (Eq 5.11)

\[ \color{#0B7B6B}{LR^-} = \frac{1 - \color{#C2410C}{Se}}{\color{#6D28D9}{Sp}} \]

LR− negative likelihood ratioSe sensitivitySp specificity (true negative rate)

Category-specific LR: \( LR_{cat} = P(\text{result} \mid D+) \;/\; P(\text{result} \mid D-) \). Grades evidence by result magnitude, not just whether it crossed a threshold.

Updating probability

Pre-test to post-test: three steps

Step 1

Pre-test odds = P / (1 − P)

Step 2

Post-test odds = pre-test odds × LR

Step 3

Post-test P = odds / (1 + odds)

Example: pre-test P = 2%, LR = 25.95 → post-test P = 35%. Multiplying the pre-test odds by the LR raises them about 26-fold.

Carry forward

The arc of the lesson

ROC curves visualise the Se/Sp trade-off; AUC summarises it.
LR+ and LR− combine Se and Sp into a single update factor for each test result.
Category-specific LRs grade evidence continuously, not just above or below a threshold.
The final review and assessment are just below.

Introduction and Overview

Earlier sections treated tests as if they were strictly binary, either positive or negative. In practice, most tests produce a continuous result (a blood pressure reading, an antibody titre, a probability score) that gets dichotomized at a chosen cutpoint. This section makes the cutpoint visible and shows how to choose it well: ROC curves trade sensitivity against specificity at every possible cutpoint, and likelihood ratios let a clinician update probability of disease without doing any of that arithmetic by hand.

Learning Objectives

Explain the trade-off between sensitivity and specificity when choosing a cutpoint.
Describe receiver operating characteristic (ROC) curves and the area under the curve (AUC).
Define and calculate likelihood ratios for positive and negative test results.
Apply likelihood ratios to update pre-test probability to post-test probability.

Interpreting Continuous Test Results

Many tests produce results on a continuous or semi-quantitative scale (e.g., blood urea nitrogen levels, optical density values, enzyme activity). To classify individuals as positive or negative, we select a cutpoint (also called a cut-off or threshold) to determine what level indicates a positive test result.

The Overlap Problem

In reality, the distributions of test values for healthy and diseased individuals often overlap. Whatever cutpoint we choose will result in both false positive and false negative results. Raising the cutpoint increases Sp (fewer false positives) but decreases Se (more false negatives). Lowering the cutpoint has the opposite effect.

Figure 5.4. Overlap between healthy and diseased distributions. Moving the cutpoint left or right trades off sensitivity for specificity.

Receiver Operating Characteristic (ROC) Curves

A ROC curve plots the Se (y-axis) against the false positive fraction (1 − Sp) (x-axis) computed at a number of different cutpoints (Hanley & McNeil, 1982; see also Wikipedia: ROC curve). This graphical tool helps select the optimum cutpoint and evaluate overall test performance.

Interpreting the ROC Curve▼

The 45° diagonal line represents a test with no discriminating ability (no better than chance). The closer the ROC curve gets to the top-left corner, the better the test discriminates between D+ and D− individuals. The top-left corner represents a test with Se = 100% and Sp = 100%.

Choosing the Optimal Cutpoint▼

Assuming equal costs of false negative and false positive results, the optimal cutpoint occurs where Se + Sp is at a maximum, which corresponds to the point closest to the top-left corner (or farthest from the 45° line). This maximised value of Se + Sp − 1 is known as Youden’s J index, and it is the quantity the interactive ROC tool below reports as you drag the cutoff. However, if the costs are unequal, you might emphasise Se or Sp depending on the clinical context.

Parametric vs. Non-Parametric ROC Curves▼

A non-parametric ROC curve simply plots Se and (1 − Sp) using each observed test value as a cutpoint. A parametric ROC curve provides a smoothed estimate by assuming that the latent variables follow a specified distribution (usually binormal). Both approaches can generate 95% confidence intervals.

Area Under the Curve (AUC)

The AUC summarises the overall discriminatory ability of the test across all cutpoints. It can be interpreted as the probability that a randomly selected D+ individual has a greater test value than a randomly selected D− individual, equivalent to the Mann–Whitney U statistic (Hanley & McNeil, 1982).

AUC Value	Interpretation
0.50	No discrimination (chance alone)
0.50 – 0.70	Poor discrimination
0.70 – 0.80	Acceptable discrimination
0.80 – 0.90	Excellent discrimination
> 0.90	Outstanding discrimination

📊 Interactive: ROC Curve Builder

Same diseased/healthy distributions as the previous tool. As you slide the cutoff, the point traces out the ROC curve. AUC = the probability that a random D+ scores higher than a random D−. Increase the separation between the two means and watch AUC climb toward 1.

Test score distributions

Drag the dashed cutoff line.

ROC curve

Yellow dot = current cutoff. Diagonal = random-chance reference.

Cutoff 5.00

Distribution separation 4.00

SD (both groups) 1.50

Sensitivity

97.7%

1 − Specificity

25.2%

Youden J

0.725

AUC

0.970

Outstanding discrimination (AUC = 0.970). The curve hugs the upper-left; almost any cutoff is good.

Likelihood Ratios

A likelihood ratio (LR) is the ratio of the probability of a given test result among D+ individuals to the probability of that same result among D− individuals (Deeks & Altman, 2004). LRs combine information from both Se and Sp, and allow the determination of post-test odds from pre-test odds via Bayes's theorem.

Likelihood Ratio for a Positive Test (LR+)

Positive likelihood ratio

\[ \color{#0B7B6B}{LR^+} = \dfrac{\color{#C2410C}{Se}}{1 - \color{#6D28D9}{Sp}} \]Eq 5.10

the positive likelihood ratio is sensitivity divided by one minus specificity (the true positive rate over the false positive rate).

An LR+ of a positive test result is the odds of disease given a positive test result divided by the pre-test odds. Higher LR+ values mean a positive test result is more informative for confirming disease.

Likelihood Ratio for a Negative Test (LR−)

Negative likelihood ratio

\[ \color{#0B7B6B}{LR^-} = \dfrac{1 - \color{#C2410C}{Se}}{\color{#6D28D9}{Sp}} \]

the negative likelihood ratio is one minus sensitivity divided by specificity (the false negative rate over the true negative rate).

Lower LR− values mean a negative test result is more informative for ruling out disease. An LR− close to 0 is ideal.

Category-Specific LR

Instead of simply classifying results as positive or negative, researchers in diagnostic settings often calculate category-specific LRs based on the actual test value. This uses the actual result rather than just positive/negative, so the strength of evidence is graded by how extreme the value is.

Category-specific likelihood ratio

\[ \color{#0B7B6B}{LR_{\text{cat}}} = \dfrac{\color{#C2410C}{P(\text{result}\mid D^+)}}{\color{#6D28D9}{P(\text{result}\mid D^-)}} \]Eq 5.12

a category-specific likelihood ratio is the probability of that result among diseased people divided by its probability among non-diseased people.

From Pre-Test to Post-Test Probability

Likelihood ratios allow you to update your assessment of disease probability after receiving a test result:

Three-Step Process

Convert pre-test probability to pre-test odds: odds = P / (1 − P)
Multiply by the likelihood ratio: post-test odds = pre-test odds × LR
Convert post-test odds back to probability: P = odds / (1 + odds)

Example: Pre-test probability = 2%, test result at a cutpoint where LR_cat = 25.95.

Pre-test odds = 0.02/0.98 = 0.0204
Post-test odds = 0.0204 × 25.95 = 0.5294
Post-test probability = 0.5294 / (1 + 0.5294) = 35%

After obtaining the test result, the estimated probability of disease rises from 2% to 35%.

R Activity: 2x2 metrics, PPV vs. prevalence, ROC + Youden cutpoint

The companion R script r-activities/HSCI_341_Lesson_6_Screening_and_Diagnostic_Tests.R walks through two examples: (A) compute Se, Sp, PPV, NPV from the 71/3/11/103 contingency table and plot PPV vs. true prevalence for a 95/95 test, and (B) build an ROC curve, compute AUC with a 95% CI, and find the Youden-optimal cutpoint using pROC.

# PART A -- diagnostic metrics from a 2x2 table
a <- 71; b <- 3; c <- 11; d <- 103

diag_metrics <- function(a, b, c, d) {
  Se  <- a / (a + c);  Sp  <- d / (b + d)
  PPV <- a / (a + b);  NPV <- d / (c + d)
  round(c(Se = Se, Sp = Sp, PPV = PPV, NPV = NPV), 3)
}
diag_metrics(a, b, c, d)

# How PPV depends on prevalence (Bayes)
ppv_from_prev <- function(P, Se, Sp) (P*Se) / (P*Se + (1-P)*(1-Sp))
prev <- seq(0.001, 0.5, length.out = 100)
plot(prev, ppv_from_prev(prev, 0.95, 0.95),
     type = "l", lwd = 2, ylim = c(0, 1),
     xlab = "True prevalence", ylab = "PPV",
     main = "A 95/95 test - PPV depends entirely on prevalence")

# PART B -- ROC curve, AUC, Youden cutpoint
library(pROC)
set.seed(341)
n       <- 400
disease <- rbinom(n, 1, 0.30)
score   <- rnorm(n, mean = ifelse(disease == 1, 10, 7), sd = 2)

roc_obj <- roc(disease, score, levels = c(0, 1), direction = "<")
plot(roc_obj, col = "firebrick", lwd = 2)
abline(a = 0, b = 1, lty = 3, col = "grey")

auc(roc_obj)
ci.auc(roc_obj)
coords(roc_obj, "best", ret = c("threshold", "sensitivity", "specificity"))

What you should be able to do after this activity: compute Se/Sp/PPV/NPV by hand and in R, explain why PPV collapses at low prevalence even for excellent tests, and find the Youden-optimal threshold from an ROC curve along with its AUC and 95% CI.

R Reflect on what you just ran

Use the questions below to interpret the actual numbers and plots. Look at your console and plot output before answering.

1. From diag_metrics(71, 3, 11, 103), what are Se, Sp, PPV, and NPV (to three decimals)? Which is highest and which is lowest, and what does each tell a clinician about this particular test?

Model answerFrom the norovirus table (TP=71, FP=3, FN=11, TN=103, which is exactly the a, b, c, d passed to diag_metrics(71, 3, 11, 103)): Se = 71/(71+11) = 0.866, Sp = 103/(103+3) = 0.972, PPV = 71/(71+3) = 0.959, NPV = 103/(103+11) = 0.904. Specificity is highest (0.972: the test rarely flags a healthy person as positive), and sensitivity is lowest (0.866: it still misses about 13% of people who truly have the disease). For a clinician: a positive result is fairly trustworthy here (PPV 0.959), but chiefly because the disease is common in this sample; a negative result is reassuring yet not airtight (NPV 0.904, so roughly one negative in ten is a missed case).

2. Look at the PPV-vs-prevalence plot (Se = Sp = 0.95). At what approximate prevalence does PPV first exceed 0.50? Why does PPV drop so sharply at low prevalence even when both Se and Sp are 95%? What does that imply for population-wide screening of a rare disease?

Model answerPPV crosses 0.50 at roughly prevalence = 5% with Se = Sp = 0.95. Below that, false positives dominate even with a near-perfect test; the math is PPV = (Se×P) / (Se×P + (1−Sp)×(1−P)), so at P = 0.01, PPV ≈ 0.16. The implication: population-wide screening of rare diseases is hard. Most positive results will be false; the program must include confirmatory testing, and the harms of unnecessary follow-up (anxiety, biopsies, treatment) often outweigh the screen's marginal benefit. This is exactly why prostate-specific antigen screening and many cancer screening programmes have been re-evaluated.

3. From auc(roc_obj) and coords(roc_obj, "best", ...), report the AUC and the Youden-optimal threshold along with its Se and Sp. Would you move the threshold higher or lower than Youden's optimum if you cared more about ruling OUT disease than ruling it in, and what happens to Se and Sp when you do?

Model answerAUC will be around 0.90–0.94 in this simulation. Youden-optimal threshold maximises Se + Sp − 1, typically giving balanced Se ≈ Sp ≈ 0.85–0.90. To prioritise ruling OUT disease, move the threshold lower (more positive calls); this raises Se (fewer false negatives) at the cost of Sp (more false positives). The trade-off is symmetric: ruling IN disease (confirmatory test) wants the threshold higher, raising Sp at the cost of Se. The choice depends on the relative costs of FN vs. FP, which is a clinical and policy decision the AUC alone cannot make for you.

Saved.

Reflection

A disease screening programme uses a test with Se = 92.7% and Sp = 77.4% at a particular cutpoint. Calculate LR+ for this cutpoint. If the pre-test probability of disease is 10%, what is the post-test probability after a positive result? Discuss whether this cutpoint is appropriate for a screening programme where false negatives are very costly.

Model answerLR+ = Se / (1 − Sp) = 0.927 / 0.226 = 4.10. Pre-test probability 10% → pre-test odds = 0.10/0.90 = 0.111. Post-test odds = 0.111 × 4.10 = 0.456. Post-test probability = 0.456/(1+0.456) = 0.313 (31%). For a screening test where false negatives are very costly, this cutpoint is questionable: an LR+ of 4.1 only triples the disease odds, and the 92.7% Se still misses 7% of true cases. The 77.4% Sp also produces many false positives (each requiring follow-up). Better strategies: (a) move the threshold down to raise Se (accept more FP); (b) use this cutpoint as a first-stage triage with mandatory confirmatory testing on all positives; (c) re-screen at intervals to catch FN at the next round; (d) supplement with a second independent test for parallel screening, which raises overall Se.

Minimum 20 characters required.

✓ Reflection saved

Key Takeaways

The choice of cutpoint involves a trade-off between sensitivity and specificity.
ROC curves plot Se vs. (1 − Sp) across cutpoints; the AUC summarises overall test performance.
An AUC of 0.5 represents chance; values closer to 1.0 indicate better discrimination.
LR+ = Se/(1 − Sp); LR− = (1 − Se)/Sp. LRs combine both Se and Sp into a single metric.
LRs allow conversion of pre-test probability to post-test probability using a three-step odds-based calculation.

✦ Pass the knowledge check with 100% and complete the reflection to continue

HSCI 341 · Lesson 6

Fundamental Epidemiological Concepts and Approaches

Screening &Diagnostic Tests

Learning objectives for this lesson:

Glossary: Key Terms, People & Concepts

Introduction & Test Attributes

From Populations to Individual Results

Introduction & Test Attributes

Screening vs. diagnostic tests

Screening test

Diagnostic test

Analytic vs. diagnostic sensitivity and specificity

Analytic sensitivity

Diagnostic sensitivity

Accuracy and precision

Accurate & precise

Precise, not accurate

Accurate, not precise

Neither

Cohen's kappa (κ)

Weighted kappa

The thread into the next section

Introduction and Overview

Learning Objectives

What Is a Test?

Why Evaluate Tests?

Screening vs. Diagnostic Tests

Attributes of the Test Per Se

Analytic Sensitivity and Specificity

Accuracy and Precision

Precision and Agreement

Measuring Precision: Quantitative Tests

Measuring Agreement for Categorical Tests: Kappa (κ)

Factors Affecting Kappa

Weighted Kappa

Key Takeaways

Sensitivity & Specificity

Sensitivity & Specificity

The gold standard

The 2×2 contingency table

Sensitivity and specificity

SnNOut

SpPIn

True vs. apparent prevalence

What sensitivity and specificity cannot tell you

Introduction and Overview

Learning Objectives

The Gold Standard

Important Caveat

The 2×2 Contingency Table

Key Measures from the 2×2 Table

Worked Example (Norovirus EIA Data)

True and Apparent Prevalence

Estimating True Prevalence from Apparent Prevalence

Example Calculation

Reflection

Key Takeaways

Predictive Values

Predictive Values

Positive and negative predictive value

The same test, very different answers

HIV universal screening: the arithmetic

Increasing PV+ in low-prevalence settings

Target high-risk groups

More specific confirmation

Series vs. parallel

What to take into the next section

Introduction and Overview

Learning Objectives

What Are Predictive Values?

Predictive Value Positive (PV+)

Predictive Value Negative (PV−)

Effect of Prevalence on Predictive Values

Dramatic Impact of Prevalence

A Worked Example in Natural Frequencies

🧪 Interactive: Sensitivity, Specificity, PPV & the Cutoff

Distribution of test scores

2×2 confusion matrix (per 10,000 tested)

Strategies to Increase PV+

Scenario: Universal HIV Screening

Screening &
Diagnostic Tests