# Lesson 5 — Measures of Disease Frequency (v3 expanded)

*Companion-podcast transcript • Sarah & Kiffer* 
*~5,780 words • ~31 min audio*

---

**Sarah:** Welcome back to Office Hours. I'm Sarah.

**Kiffer:** And I'm Kiffer. Today we're working through Lesson 5, Measures of Disease Frequency. This is the lesson where you build the toolkit. Every measure of association you'll meet in Lesson 7 starts from one of the frequency measures we'll cover today.

**Sarah:** Let me set the stage. Lessons 1 through 4 in this course were the conceptual scaffolding. Causal concepts. Surveillance and outbreak investigation. Sampling. And questionnaire design. The why, the who, and the how-to-collect.

**Kiffer:** Lesson 5 is the bridge. It takes the data those instruments produce and turns it into the standard quantitative outputs of epidemiology. Counts, proportions, odds, rates, risks, prevalences, mortality rates, burden of disease estimates. By the end of this lesson, when somebody says the age-standardized prevalence of diabetes among Canadians is nine point four percent, you should know exactly what that sentence claims and what it doesn't.

**Sarah:** Four sections. Section one is foundational concepts. Section two is incidence in two flavors, risk and rate. Section three is prevalence, mortality, and burden of disease, including DALYs, or Disability-Adjusted Life Years. Section four is standardization and confidence intervals.

**Kiffer:** Let's start with Section one. The lesson opens by asking why we measure disease frequency. There are three big reasons. Surveillance, observational research, and outbreak investigation.

**Sarah:** Quick definitions. Surveillance is the ongoing, systematic collection of health data so public health authorities can spot changes in disease patterns. The weekly flu reports every winter, the chronic disease dashboards run by Statistics Canada. It's a permanent monitoring system, not a one-off study.

**Kiffer:** Observational research is studies where the researcher doesn't control who gets exposed. They watch nature unfold and measure what happens. Cohort studies, case-control studies, cross-sectional studies.

**Sarah:** And outbreak investigation is what happens when an unusual cluster of disease shows up. People in one neighborhood are getting sick after eating at the same restaurant. The health unit shows up, counts cases, traces exposures, calculates attack rates.

**Kiffer:** In all three contexts the same idea sits underneath. You count how often disease occurs, you compare across groups or across time, you draw conclusions. The conclusions are only as good as the measures you chose.

**Sarah:** Morbidity and mortality are the two big categories of events you're usually counting. Morbidity is illness, mortality is death. But the same machinery applies to vaccinations, hospital admissions, births, workplace injuries. Anything you can count over a defined population in a defined period.

**Kiffer:** Then the lesson hammers a point that comes up again and again. Stratification matters. Because morbidity and mortality are tied to individual attributes, we usually calculate frequency measures stratified by age, by sex, by race or ethnicity, by geography. Sometimes by socioeconomic position or occupation.

**Sarah:** And the reason isn't just bookkeeping. Pooled, population-level numbers can hide important patterns. If you report a single national mortality rate, you'll miss that Indigenous communities in Canada experience lower life expectancy than non-Indigenous Canadians. You'll miss that young men die at very different rates from young women. Stratification reveals patterns that pooled data hide.

**Kiffer:** Right, and then the lesson introduces two time-related concepts that decide which measure you should use. The study period and the risk period.

**Sarah:** The study period is the window over which you actually observe your subjects. Often calendar time, like January twenty twenty through December twenty twenty-four. Sometimes defined by an event, like all babies born in two thousand eight followed for the first year of life. The study period is about you, the researcher.

**Kiffer:** The risk period is different. It's the window during which an individual could plausibly develop the disease. For some conditions the risk period is very short. The lesson uses post-partum eclampsia, a serious blood-pressure complication after childbirth. The risk window is usually under two days.

**Sarah:** For other conditions the risk period is essentially the lifetime. Migraine headaches. Type two diabetes. Hypertension. As long as the person is alive, they could develop the condition.

**Kiffer:** And the rule is this. When the risk period is short relative to the study period, risk measures work well. You count people. When the risk period is long relative to the study period, rate-based measures are more appropriate. You count person-time.

**Sarah:** Then the lesson lays out the four mathematical forms a measure of disease frequency can take. Counts. Proportions. Odds. Rates. Each has different units, a different range, and different uses.

**Kiffer:** A count is just a number. The number of cases observed. No denominator. Range zero to infinity. No units. If you say there were one hundred forty-two cases of measles in the province last year, that's a count. Counts on their own are limited because fifty cases in a population of a hundred is a public health emergency, while fifty cases in a population of a million is a blip.

**Sarah:** A proportion is a count divided by a total population. The numerator is a subset of the denominator. Twenty-eight cases over a hundred people. Range zero to one. Dimensionless.

**Kiffer:** Odds is the count of cases divided by the count of non-cases. Crucially, the numerator is not a subset of the denominator. Twenty-eight cases over seventy-two non-cases is about zero point three nine. Range zero to infinity. Also dimensionless. We'll come back to odds in Lesson 7 when we talk about case-control studies and odds ratios.

**Sarah:** And a rate is a count of events divided by person-time at risk. It has units of one over time. Cases per person-year, deaths per thousand person-months. Range zero to infinity, unbounded above.

**Kiffer:** And here's where the lesson gets sharp on terminology. The word rate is used loosely all the time. People say the rate of breast cancer recurrence is fourteen percent. That's not a rate. That's a risk. Or they'll say the case fatality rate of Ebola is fifty percent. That's not a rate either, that's a proportion.

**Sarah:** And being precise matters because risks and rates have genuinely different mathematical properties. They're related but not the same. Strict definition. A rate has person-time in the denominator. A proportion or risk has people in the denominator.

**Kiffer:** And the lesson distinguishes proportion from odds with a simple test. Is the numerator inside the denominator? If yes, proportion. If no, odds.

**Sarah:** That brings us to Section two. Incidence in its two flavors. Risk and rate.

**Kiffer:** Definition first. Incidence relates to the number of new events in a defined population within a specific period. The key word is new. Existing cases don't count toward incidence.

**Sarah:** And because incidence is about new cases, it's the right measure for identifying factors that cause people to become ill. If you want to ask whether smoking causes lung cancer, you compare the rate at which new cases arise in smokers versus non-smokers. That's an incidence comparison.

**Kiffer:** To count incident cases honestly, two things have to be in place. A clear case definition. And a surveillance system capable of identifying all cases.

**Sarah:** Quick definition there. A case definition is the explicit set of criteria that say whether a person counts as a case. For diabetes you might require a fasting blood glucose above seven millimoles per liter on two separate occasions, or a hemoglobin A1c above six point five percent. The case definition has to be specified before you start counting, otherwise different people will count differently and your numbers will be useless.

**Kiffer:** Then the lesson distinguishes first cases from all cases. For some diseases, an individual can only have a first case once. Type one diabetes, for example. For other conditions, multiple episodes can occur. Migraine headaches. Recurrent urinary tract infections. The researcher has to decide which to count, and state the choice explicitly.

**Sarah:** Then the four ways to express incidence. Incident times. Incidence count. Incidence risk, denoted R. And incidence rate, denoted I.

**Kiffer:** Incident times are the actual times at which cases occur, measured from a reference event. Cases on day three, day seven, day fifteen, day twenty-eight after exposure. These are the building blocks of survival analysis.

**Sarah:** Incidence count is just the number of new cases. Used when a disease is brand new or very rare and even the count is informative.

**Kiffer:** Incidence risk, R, is the probability that an individual in the defined population will develop the disease over a specified time period. It's a proportion. Dimensionless. Range zero to one. Sometimes called cumulative incidence.

**Sarah:** And the intuition for risk is this. When an oncologist says, your probability of breast cancer recurrence in the next year is fourteen percent, that's a risk. It's an individual-level prediction in a defined window.

**Kiffer:** Incidence rate, I, is the number of new cases per unit of person-time. It has units of one over time. Sometimes called incidence density.

**Sarah:** And rate is the right measure when individuals enter and leave the population over time. Open populations. Rate is also typically what you use when the goal is to identify factors related to disease, because rates handle variable follow-up gracefully.

**Kiffer:** Let's walk through the calculation for risk. R equals the number of newly affected individuals in the defined time period, divided by the population at risk.

**Sarah:** And the population at risk piece depends on whether the population is closed or open.

**Kiffer:** A closed population has no additions and few or no losses during the study period. The lesson uses two examples. Residents of a nursing home followed for a year. Or women followed for one week post-partum. Membership is essentially fixed at the start.

**Sarah:** In a closed population, only disease-free individuals at the start count toward the population at risk. People who already have the disease are excluded, because they can't develop a new case of something they already have.

**Kiffer:** What about people who leave during the study? Those are called withdrawals. The simplest correction is to subtract half their number from the population at risk. The assumption is they leave, on average, halfway through. So they contributed about half their potential observation time.

**Sarah:** An open population, by contrast, has individuals entering and leaving throughout. Women served by a cancer treatment center who have had mastectomies, for example. As new patients arrive and existing patients finish treatment or are lost to follow-up, the population is in flux.

**Kiffer:** A stable, or stationary, open population is one where the rates of additions and withdrawals stay approximately constant over time.

**Sarah:** And here's the key technical point. You cannot compute risk directly from an open population. So you compute the rate, and if you really need a risk you derive it from the rate.

**Kiffer:** On to the calculation for rate. I equals the number of new cases in the time period divided by the number of person-time units at risk. A person-time unit is one person observed for one defined period. One person-month is one person watched for one month. If you watch ten people for one year each, you have ten person-years.

**Sarah:** And the rule is this. After an individual contracts the disease, they're no longer at risk and no longer contribute person-time to the denominator. They've already had the event. They can't be a new case anymore.

**Kiffer:** The lesson walks through a clean worked example. Four people, followed for one month, thirty days. Person one is healthy the whole month. Contributes one full person-month at risk.

**Sarah:** Person two gets sick on day ten. Contributed ten days at risk. Ten over thirty is one third, or zero point three three person-months.

**Kiffer:** Person three gets sick on day twenty. Contributed twenty days. Twenty over thirty is two thirds, or zero point six seven person-months.

**Sarah:** Person four moves away on day fifteen. Lost to follow-up. Fifteen days at risk, which is zero point five zero person-months.

**Kiffer:** Total person-time is one plus zero point three three plus zero point six seven plus zero point five zero, which equals two point five zero person-months. Total new cases is two. So the incidence rate is two over two point five zero, which equals zero point eight zero cases per person-month.

**Sarah:** And notice the conceptual move there. We treated the four people very differently in the denominator depending on what happened to them. Each contributed exactly the amount of time they were genuinely susceptible. That's what distinguishes a rate from a risk. A risk asks what fraction of people developed disease. A rate asks, given the time at risk people contributed, how often did the disease arise per unit of that time.

**Kiffer:** Then the lesson lays out the relationship between risk and rate. They're related but not the same. For a closed population, the precise formula is this. Risk equals one minus e to the negative incidence rate times the time interval. Where e is the base of natural logarithms, approximately two point seven one eight.

**Sarah:** The exponential captures the fact that as people become diseased, they leave the at-risk pool. Risk is bounded above by one. The exponential keeps the math honest.

**Kiffer:** And there's a useful approximation. When the product of incidence rate and time interval is small, less than about zero point one, the simpler linear formula works. Risk is approximately incidence rate times the time interval. Because for small x, e to the negative x is approximately one minus x.

**Sarah:** So for rare diseases over short windows, you just multiply the rate by the time and get the risk. For common diseases or long windows, you need the exponential.

**Kiffer:** Okay, on to Section three. Prevalence, mortality, and burden of disease.

**Sarah:** Prevalence is fundamentally different from incidence. Incidence counts new cases. Prevalence counts existing cases. The number of individuals in a population who have the disease right now, at a particular moment.

**Kiffer:** The formula is, P equals cases of disease at a point in time, divided by individuals in the population at the same point in time.

**Sarah:** The lesson uses a clean example. Seventy-five athletes are tested for performance-enhancing drug use, and three test positive. Prevalence is three over seventy-five, which equals zero point zero four, or four percent.

**Kiffer:** Notice prevalence doesn't tell you when those athletes started using. It just counts who has the attribute right now.

**Sarah:** Then comes the relationship that connects everything. In a stable population where the incidence rate is constant, prevalence, incidence, and mean duration are linked by a single formula.

**Kiffer:** Prevalence equals the product of incidence rate and average duration, divided by one plus that product. Multiply incidence rate by duration. Divide that product by one plus the same product. The result is the prevalence.

**Sarah:** The intuition is this. Each new case adds a person to the prevalent pool. Each recovery or death removes a person. In steady state, the inflow equals the outflow, and the size of the pool is determined by the inflow rate times how long each person stays in. The plus-one in the denominator handles the fact that prevalence cannot exceed one.

**Kiffer:** Let's do a worked example. Influenza in an urban population. Incidence rate is zero point three per person-year. Mean duration of an infection is three weeks, which in years is about zero point zero five eight.

**Sarah:** Plug in. Incidence times duration is zero point three times zero point zero five eight, which equals zero point zero one seven four. Divide that by one plus zero point zero one seven four, which is one point zero one seven four. So prevalence equals zero point zero one seven, or about one point seven percent.

**Kiffer:** Translation. On any given day, you'd expect about one point seven percent of this population to currently have flu. Even though over a year, thirty percent will get it. Prevalence is small because duration is short.

**Sarah:** And the lesson has an interactive simulator that I think is the best teaching tool in the whole course. The sink analogy.

**Kiffer:** Picture a kitchen sink. A tap pouring water in. Two drains at the bottom. The water level depends on how fast water comes in versus how fast it leaves.

**Sarah:** The tap is incidence. New cases flowing in. The two drains are recovery and mortality. The two ways people exit the diseased state. The water level is prevalence. The simulator lets you move the sliders and watch the water level settle to a new equilibrium.

**Kiffer:** And the conceptual punchline is this. Two diseases can have very different prevalences for completely different reasons.

**Sarah:** Take diabetes. Low incidence relative to many infectious diseases, but very long duration. People live with diabetes for decades. The sink fills slowly but barely drains. Prevalence ends up high. Approaching ten percent of the adult population in many countries.

**Kiffer:** Now contrast that with pancreatic cancer. Low incidence and very short duration, because mortality is high. People diagnosed often die within months. The drain is wide open. Prevalence stays near zero, even though the disease is one of the deadliest cancers.

**Sarah:** Which is why prevalence alone is a poor measure of disease risk. It confounds occurrence and survival. For risk-factor research, use incidence.

**Kiffer:** Now mortality. Mortality statistics use exactly the same formulas as P, R, and I. The only difference is the outcome of interest is death, not disease.

**Sarah:** Strictly speaking, the mortality rate is the incidence rate of mortality. Cases per person-time, where the cases are deaths. There's an overall mortality rate, which counts deaths from all causes. And there's the cause-specific mortality rate, which restricts to deaths attributed to a specific disease.

**Kiffer:** And the lesson is honest about how hard cause attribution can be. The example is a recumbent patient who regurgitates, contracts aspiration pneumonia from inhaling stomach contents, and then dies. Did they die from the original condition that caused the recumbency? From the pneumonia? With the pneumonia? The cause is usually deemed the proximate cause, the final trigger. But this is a judgment call.

**Sarah:** Cause of death coding follows international rules from the World Health Organization, the International Classification of Diseases. But certifying physicians still have to make judgment calls. That introduces variability into cause-specific mortality numbers.

**Kiffer:** Then comes the burden of disease framework. The motivation is that mortality, incidence, and prevalence each tell you something different. But none on its own gives a complete picture of how much a disease weighs on a population.

**Sarah:** Pancreatic cancer is deadly but rare. Mortality captures it, prevalence misses it. Low-back pain is rarely fatal but enormously common and disabling. Mortality misses it. Vision loss, depression, untreated chronic pain. Same problem.

**Kiffer:** To compare these on the same scale, the World Health Organization and the Global Burden of Disease Study express disease impact in healthy years of life lost. A common currency that combines premature death and time spent in less than perfect health.

**Sarah:** Quick context. The Global Burden of Disease Study, abbreviated GBD, is a massive collaborative research effort that estimates the burden of every major disease in every country. It started in the nineteen nineties, sponsored by the World Bank and World Health Organization, now coordinated by the Institute for Health Metrics and Evaluation in Seattle.

**Kiffer:** The currency has three components. Years of Life Lost, abbreviated YLL. Years Lived with Disability, abbreviated YLD. And Disability-Adjusted Life Years, abbreviated DALYs.

**Sarah:** Let's take them one at a time. Years of Life Lost. Quantifies the burden from premature mortality. Each death is weighted by the additional years that person would have been expected to live had they survived to a standard life expectancy.

**Kiffer:** So a death at age thirty contributes far more YLLs than a death at age ninety. Because a thirty-year-old would have been expected to live another fifty-something years. A ninety-year-old, only a few. Crude mortality counts treat both deaths equally. YLLs do not.

**Sarah:** The formula sums over ages, with the number of deaths at each age multiplied by the standard life expectancy remaining at that age. The Global Burden of Disease standard life table sets life expectancy at birth at about eighty-six years.

**Kiffer:** Next, Years Lived with Disability. Quantifies the burden from non-fatal health loss. Each year spent with a condition is multiplied by a disability weight, between zero, perfect health, and one, a health state as bad as death.

**Sarah:** Disability weights are derived from large population surveys. Respondents are asked to compare paired health states. Would you rather live with mild back pain or with major depression. The pattern of comparisons is fed into a statistical model that assigns each state a weight on the zero-to-one scale.

**Kiffer:** In the incidence-based formula, YLD equals incident cases times average duration times the disability weight. In the prevalence-based version the Global Burden of Disease has used since twenty ten, YLD equals prevalent cases times the disability weight.

**Sarah:** And Disability-Adjusted Life Years, the DALY. Just the simple sum. DALY equals YLL plus YLD. One DALY equals one healthy year of life lost.

**Kiffer:** Let's do a worked example. Breast cancer in a population of one hundred thousand women. Suppose breast cancer causes twenty deaths in one year. Average age at death is sixty. Standard life expectancy at age sixty is twenty-five more years.

**Sarah:** YLL equals twenty deaths times twenty-five years per death equals five hundred years.

**Kiffer:** Now the YLD piece. Suppose five hundred women in this population are living with breast cancer at any given time, average duration four years before remission or death, disability weight zero point three zero.

**Sarah:** YLD equals five hundred prevalent cases times four years duration times zero point three zero, which equals six hundred years.

**Kiffer:** DALYs equal YLL plus YLD, which equals five hundred plus six hundred, which equals one thousand one hundred healthy years lost from breast cancer in this population in this year.

**Sarah:** And here's what's powerful about that. If we'd looked only at mortality, we'd have seen twenty deaths and stopped. The DALY makes visible that more than half of the breast cancer burden in this population is non-fatal. That has policy implications. It should shape investment in screening, treatment, survivorship support, mental health services for survivors, not just end-of-life care.

**Kiffer:** Let's talk about why DALYs add value beyond mortality, incidence, and prevalence on their own. There are five reasons.

**Sarah:** One. Common currency. They let you compare diseases that primarily kill against diseases that primarily disable. Ischemic heart disease against major depression. No single mortality or morbidity measure can do that.

**Kiffer:** Two. Premature death is weighted. A young death contributes more than an old death. Crude mortality rates ignore that.

**Sarah:** Three. Non-fatal burden becomes visible. Low-back pain, anxiety, hearing loss, migraine, untreated vision loss. Major drivers of population health loss, nearly invisible in mortality statistics.

**Kiffer:** Four. Priority-setting. Ministries of health and global funders use DALYs to compare cost-effectiveness across very different interventions. Dollars per DALY averted is a standard metric.

**Sarah:** Five. Tracking change over time. DALYs capture epidemiological transitions. The shift from infectious, fatal diseases toward chronic, disabling ones. The dominant pattern in global health over the past century.

**Kiffer:** But DALYs are not value-neutral, and the lesson is careful to flag three caveats.

**Sarah:** One. Disability weights reflect aggregate survey responses. They may not match a specific patient's lived experience. Someone with severe depression may rate their state very differently from how the population average rates it.

**Kiffer:** Two. Disability weights have shifted across Global Burden of Disease revisions. The weight for a given state in two thousand may not match the weight assigned in twenty twenty. So if you're comparing burden across years, check that the weights are comparable.

**Sarah:** Three. Methodological choices change rankings. The standard life expectancy you use, age-weighting, time-discounting. All explicit value choices that can substantially change which diseases come out on top of the burden ranking.

**Kiffer:** So when you read a DALY estimate, always check which version of the methodology produced it, and whether comparisons across years use the same conventions.

**Sarah:** Then the lesson runs through a quick survey of other measures.

**Kiffer:** First, attack rate. Despite the name, it's a risk, not a rate. The proportion of an exposed population that develops disease during an outbreak. Cases divided by exposed people.

**Sarah:** Secondary attack rate. The proportion of contacts of a primary case who develop the disease. Useful for estimating transmissibility.

**Kiffer:** Case fatality rate. The proportion of people with a disease who die from it. Technically a risk. Untreated rabies in humans is essentially a hundred percent. Seasonal influenza is well under one percent.

**Sarah:** Proportional morbidity or proportional mortality. The proportion of all morbidity or all deaths attributable to a specific cause. Useful when you don't have a good population-at-risk denominator but you have data on total events.

**Kiffer:** And the lesson grounds all of this in real Canadian surveillance systems. Because the numbers in textbooks and news reports come from named systems with specific definitions, denominators, and reporting lags.

**Sarah:** The big ones to know. The Canadian Chronic Disease Surveillance System, abbreviated CCDSS. Run by the Public Health Agency of Canada. Links provincial and territorial physician billing and hospital records. Produces national prevalence and incidence estimates for diabetes, hypertension, mental illness, chronic obstructive pulmonary disease, asthma, ischemic heart disease, and cancer.

**Kiffer:** The Canadian Notifiable Disease Surveillance System, abbreviated CNDSS. Aggregates reports from provinces and territories on legally notifiable communicable diseases. Measles, pertussis, syphilis, HIV. Diseases where each case is reported to public health authorities.

**Sarah:** FluWatch. The Public Health Agency of Canada's surveillance program for influenza. Combines sentinel physician reports, laboratory data, and severe outcome surveillance to track influenza activity in near real time during flu season.

**Kiffer:** The Respiratory Virus Detection Surveillance System, abbreviated RVDSS. The Public Health Agency of Canada's laboratory-based surveillance for influenza and other respiratory viruses, including respiratory syncytial virus and seasonal coronaviruses.

**Sarah:** The Canadian Cancer Registry. Run by Statistics Canada with provincial cancer registries. Population-based cancer incidence and survival data. The source for the annual Canadian Cancer Statistics report.

**Kiffer:** Canadian Vital Statistics, the births and deaths database. Statistics Canada. Source for crude and age-standardized mortality, life expectancy, infant mortality, and cause-specific death rates.

**Sarah:** The BC Centre for Disease Control, abbreviated BCCDC. Provincial reportable-disease, sexually transmitted infection, and overdose surveillance. Dashboards online for things like the BC Overdose Cohort and weekly respiratory virus reports.

**Kiffer:** The BC Coroners Service. The primary source for unregulated drug deaths and suicide statistics in British Columbia. Coroners investigate any death not clearly due to natural causes, so they capture things vital statistics systems miss or report with long delays.

**Sarah:** The Canadian Institute for Health Information, abbreviated CIHI. Two key administrative datasets. The Discharge Abstract Database, abbreviated DAD, captures inpatient hospitalizations across Canada. The National Ambulatory Care Reporting System, abbreviated NACRS, captures emergency department and ambulatory care visits.

**Kiffer:** And the punchline. When you cite a Canadian rate, always state which surveillance system produced it. Each system has its own case definition, denominator, and reporting lag.

**Sarah:** The CCDSS diabetes example shows what reading a single published number actually requires. The system reports age-standardized prevalence of diagnosed diabetes among Canadians aged one and older was about nine point four percent in fiscal year twenty seventeen to twenty eighteen.

**Kiffer:** To interpret nine point four percent honestly, you need to know three things. First, the case definition. A CCDSS diabetes case is a person with one hospital discharge or two physician claims for diabetes within a two-year window. A validated administrative algorithm with sensitivity around eighty-six percent and specificity around ninety-nine percent.

**Sarah:** Second, the denominator. All individuals registered with provincial health insurance during the fiscal year. Because Canada has near-universal public health insurance, the denominator is essentially the entire population. Not a sample.

**Kiffer:** Third, the standardization. Rates are direct-standardized to the twenty eleven Canadian population. So comparisons across years and provinces aren't distorted by the fact that some places or some years have older populations.

**Sarah:** And all three concepts from this lesson, case definition, denominator, and standardization, are required just to read one published number correctly.

**Kiffer:** Which brings us to Section four. Standardization and confidence intervals.

**Sarah:** Why standardize? Because when you compare disease frequency between populations, differences in host characteristics, like age, sex, geographic location, can confound the comparison.

**Kiffer:** Quick definition. Confounding is when a third variable is associated with both the comparison you're making and the outcome you're measuring. Population A may have a higher crude rate than Population B, but if Population A is older and the disease is age-related, the comparison reflects the age difference, not a true difference in age-specific risk.

**Sarah:** Standardization is the technique for adjusting for these confounders. It makes the comparison fair by holding the demographic structure constant.

**Kiffer:** There are two main types. Direct standardization and indirect standardization.

**Sarah:** Let's take direct standardization first. The recipe is, you take your study population's stratum-specific rates, the actual rates in each age band, and you apply them to a reference population's age structure. So if you measured age-specific rates in your study, you multiply each by the proportion of the reference population in that band, sum across bands, and the result is the directly standardized rate. The rate your study population would have if it had the reference age structure.

**Kiffer:** Direct standardization gives you a fair comparison. Every population standardized to the same reference is now expressed in the same age structure, so you can compare without worrying about age confounding.

**Sarah:** Indirect standardization is the inverse. You take a reference population's stratum-specific rates and apply them to your study population's structure. This produces an expected number of cases. Then you compare observed cases in your study to that expected number.

**Kiffer:** The ratio of observed to expected is the Standardized Mortality Ratio, abbreviated SMR, when the outcome is death. Or the Standardized Incidence Ratio, abbreviated SIR, when the outcome is incident disease. Greater than one means your population has more events than expected. Less than one, fewer.

**Sarah:** When do you choose one over the other? Direct works when you have stable, reliable stratum-specific rates in your study population. It's used heavily in cancer registry reporting and chronic disease surveillance.

**Kiffer:** Indirect is preferred when stratum-specific rates in your study population are unavailable or based on small samples. It borrows reliable rates from a large reference population, so it produces stable estimates even for small study populations or rare outcomes.

**Sarah:** Then the lesson turns to standard errors and confidence intervals. The standard error is a measure of how much your estimate would vary if you repeated the study many times. Small standard error, precise estimate. Large, unstable.

**Kiffer:** For a proportion, the standard error is the square root of, the quantity p times one minus p, divided by n. Where p is the estimated proportion and n is the sample size.

**Sarah:** And the formula has nice intuitions. Standard error is largest when p is around one-half, because that's where the binomial distribution is widest. Standard error shrinks as sample size grows. Both are exactly what you'd want.

**Kiffer:** For an incidence rate, the formula is based on the Poisson distribution, which is the right distribution for counting rare events over time. The standard error is the square root of the number of cases divided by total person-time.

**Sarah:** Once you have a standard error, the approximate ninety-five percent confidence interval is the estimate plus or minus one point nine six times the standard error. The one point nine six comes from the standard normal distribution. It's the Z-score that captures ninety-five percent of the probability mass in the middle.

**Kiffer:** And the lesson is honest about a limitation. In small samples, or when disease frequency is very low or very high, the approximate confidence interval can produce nonsense. A lower bound below zero for a proportion. An upper bound above one. In those cases you switch to exact confidence intervals based on the binomial distribution for proportions, or the Poisson distribution for rates.

**Sarah:** Exact methods are computationally a little more involved, but every modern statistical package can produce them. Particularly important for rare conditions or small subpopulations.

**Kiffer:** Okay. Let's pull everything together. Big takeaways.

**Sarah:** First. There are four mathematical forms. Counts, proportions, odds, and rates. Each has different units and properties. Counts have no denominator. Proportions have the numerator inside the denominator and range zero to one. Odds have the numerator outside the denominator and range zero to infinity. Rates have person-time in the denominator and units of one over time.

**Kiffer:** And the practical move. When somebody hands you a number labeled rate in a news report, ask whether the denominator is people or person-time. If it's people, it's actually a risk or proportion, regardless of what they call it.

**Sarah:** Second. Incidence has two faces. Risk and rate. Risk is for closed populations and short risk periods. Rate is for open populations and long risk periods. The relationship is, risk equals one minus e to the negative incidence rate times the time interval. For small values of that product, the linear approximation, risk equals incidence rate times time, works well.

**Kiffer:** Third. Prevalence is determined by both incidence and duration. Prevalence equals the product of incidence rate and average duration, divided by one plus that product. Two diseases with the same prevalence can have wildly different incidence and duration.

**Sarah:** And the conceptual punchline from the sink analogy. Incidence is the tap, recovery and mortality are the drains, prevalence is the water level. Diabetes has a slow tap and barely any drain, so high prevalence. Pancreatic cancer has a slow tap and a wide-open drain, so prevalence stays near zero even though the disease is deadly.

**Kiffer:** Which is why prevalence alone is a poor measure of disease risk. It confounds occurrence and survival. For risk-factor research, use incidence.

**Sarah:** Fourth. Mortality statistics use the same formulas as morbidity, with death as the outcome. Cause-specific mortality requires judgment, and the proximate cause convention introduces variability. Always read mortality statistics as the product of a coding system and a certifying physician's judgment.

**Kiffer:** Fifth. Burden of disease. Years of Life Lost capture premature mortality. Years Lived with Disability capture non-fatal health loss. Disability-Adjusted Life Years are their sum. DALYs put fatal and non-fatal burden on a single healthy years lost scale.

**Sarah:** DALYs add value because they're a common currency, they weight premature death, they make non-fatal burden visible, they enable cost-effectiveness comparisons, and they capture epidemiological transitions over time.

**Kiffer:** But DALYs are not value-neutral. Disability weights are aggregates that may not match individual experience. Weights have shifted across Global Burden of Disease revisions. Methodological choices change rankings.

**Sarah:** Sixth. Specialized measures. Attack rate, secondary attack rate, case fatality rate, proportional morbidity and mortality. Most are technically risks, despite being called rates. Be precise.

**Kiffer:** Seventh. The Canadian numbers in your textbooks come from named surveillance systems. CCDSS for chronic disease prevalence and incidence. CNDSS for legally reportable communicable diseases. FluWatch and RVDSS for influenza and respiratory viruses. Canadian Cancer Registry for cancer. Canadian Vital Statistics for mortality. BCCDC dashboards for provincial reportable disease and overdose surveillance. BC Coroners Service for unregulated drug deaths. CIHI's Discharge Abstract Database and National Ambulatory Care Reporting System for hospital and emergency department data.

**Sarah:** And the CCDSS diabetes example shows that reading one published number, like nine point four percent age-standardized prevalence, requires understanding case definition, denominator, and standardization simultaneously.

**Kiffer:** Eighth. Standardization removes confounding by demographic structure. Direct applies your population's rates to a reference structure. Indirect applies a reference's rates to your population's structure and produces a Standardized Mortality Ratio or Standardized Incidence Ratio. Use direct when you have stable rates in your study. Use indirect when you don't.

**Sarah:** Ninth. Confidence intervals quantify precision. For a proportion, the approximate ninety-five percent confidence interval is the estimate plus or minus one point nine six times the standard error, where the standard error is the square root of p times one minus p divided by n. For a rate, the formula uses person-time and the Poisson distribution. When samples are small or frequencies extreme, switch to exact methods.

**Kiffer:** And one practical note before we wrap. Don't skip the sink analogy simulator. Watching the water level settle to a new equilibrium as you change the tap and drains is the fastest way to internalize that prevalence equals the product of incidence and duration in steady state.

**Sarah:** Next up is Lesson 6. Screening and Diagnostic Tests. Where we go deep on sensitivity, specificity, and predictive values. The properties of tests rather than the properties of populations.

**Kiffer:** Take care, everyone. We'll see you next time.

**Sarah:** See you in Lesson 6.