# Lesson 7 — Measures of Association (v3 expanded)

*Companion-podcast transcript • Sarah & Kiffer* 
*~5874 words • ~32 min audio*

---

**Sarah:** Welcome back to Office Hours. I'm Sarah.

**Kiffer:** And I'm Kiffer. Today we're working through Lesson 7, Measures of Association. And this is the capstone of the first half of the course. Seven lessons in, this is where everything we've built so far finally clicks together into the central output of analytic epidemiology.

**Sarah:** That's a big claim. Let's set it up. Lesson 5 of this material gave us measures of disease frequency. Prevalence, incidence, risk, rate. Then Lesson 6 took the same probabilistic vocabulary and applied it at the level of a single test. Sensitivity, specificity, predictive values. And now Lesson 7 brings those strands together. The two by two contingency logic, plus the disease-frequency vocabulary, gives us measures of association.

**Kiffer:** And that phrase, measures of association, has a very specific meaning. It's the quantitative comparison between exposed and unexposed groups. The number you put on the relationship between an exposure, which is some potential cause, and a disease, which is the outcome. Without that number, you don't have analytic epidemiology. You have a description.

**Sarah:** There are four sections in the lesson. Section one. Three ratio measures. Risk ratio, incidence rate ratio, odds ratio. Section two. Difference measures and the exposed-group attributable fraction. Section three. Population-level measures and how each measure ties to study design. Section four. The hypothesis-testing and confidence-interval machinery that turns each point estimate into a defensible inference.

**Kiffer:** Before we dive into Section one, I want to put a flag in the ground that we're going to keep coming back to. Strength of association is not the same thing as statistical significance. They are answering different questions. The lesson opens with this distinction and closes with this distinction, and we should treat that as a signal.

**Sarah:** Walk us through what each of those phrases means.

**Kiffer:** A measure of association tells you how strongly an exposure is linked to disease. How much more likely, or less likely, disease is in the exposed group compared to the unexposed group. That's a substantive question about magnitude of effect. A p-value tells you how compatible the observed data are with a null hypothesis of no association. That's a sampling question about whether the result could plausibly be due to chance.

**Sarah:** And the key is that those two things can come apart. A strong association can be statistically non-significant. A weak association can be highly statistically significant.

**Kiffer:** Right. If you have a tiny sample, even a very large risk ratio can fail to clear the conventional significance threshold, because the standard error is so wide. And conversely, in a very large sample, even a tiny risk ratio of, say, one point zero five can produce a p-value below zero point zero one, just because the sample is so big that the precision is enormous. The number itself doesn't tell you whether the effect is real and important. The combination does.

**Sarah:** And the lesson's prescription is direct. Always report effect sizes alongside p-values. Two numbers, two questions, both required. We will keep returning to this.

**Kiffer:** Okay. Section one. Ratio measures of association. This is where we set up the two by two table that everything else lives inside.

**Sarah:** And there are two table layouts depending on whether your data are risk-based or rate-based. For risk-based data, columns are exposed and unexposed. Rows are diseased and non-diseased. The four interior cells are labeled. Cell a-one is exposed and diseased. Cell a-zero is unexposed and diseased. Cell b-one is exposed and non-diseased. Cell b-zero is unexposed and non-diseased.

**Kiffer:** And for rate-based data, the structure is similar but the denominator changes. Instead of counting non-diseased people, you count person-time at risk. So the columns are still exposed and unexposed, but the bottom row is now the total person-time accumulated in each group. Cases on top, person-time on bottom.

**Sarah:** Why two layouts? Because these correspond to the two flavors of incidence we covered in Lesson 5. Incidence risk, where the denominator is the population at the start of follow-up, and incidence rate, where the denominator is person-time. Closed populations versus open populations. Same conceptual structure, different specific contents.

**Kiffer:** Good. Now from those tables, three ratio measures of association fall out. Risk ratio, which we abbreviate R-R. Incidence rate ratio, which we abbreviate I-R. And odds ratio, which we abbreviate O-R.

**Sarah:** Risk ratio first. The risk ratio is exactly what it says. The risk in the exposed group divided by the risk in the unexposed group. Each risk is a probability, a number between zero and one. So in our two by two notation, the risk ratio is the count in cell a-one divided by the column total n-one, all over the count in cell a-zero divided by the column total n-zero.

**Kiffer:** And the interpretation is direct. A risk ratio of two means the exposed group has twice the risk of the unexposed group. A risk ratio of zero point five means the exposed group has half the risk. A risk ratio of one is the null value, meaning no association. Above one, harmful exposure. Below one, protective.

**Sarah:** Risk ratios come from cohort studies with closed populations and short follow-up. The kind of design where everyone enters the study at the same time, you follow them for a defined window, and you tally up who got disease.

**Kiffer:** Incidence rate ratio is the parallel measure when you have person-time data. The rate in the exposed divided by the rate in the unexposed. Cases per person-month, or cases per person-year, in each group, divided through. Used in cohort studies with open populations and variable follow-up. It's the same risk-based versus rate-based distinction we saw back earlier on cohort studies.

**Sarah:** And odds ratio. The third ratio measure. The ratio of odds rather than risks.

**Kiffer:** Quick refresher on what odds are, because students sometimes blur this with probability. Odds are the probability of an event divided by the probability of its complement. So if the risk of disease in a group is twenty percent, the odds of disease are zero point two divided by zero point eight, which is zero point two five. Odds and probabilities track each other, but they're not the same number, especially as the probability climbs above ten or twenty percent.

**Sarah:** And the odds ratio in a two by two table has a really clean computational form. The cross-product. The count in cell a-one times the count in cell b-zero, divided by the count in cell a-zero times the count in cell b-one. That's it. Multiply the diagonal, divide by the other diagonal.

**Kiffer:** And here's the property that makes the odds ratio special. Symmetry. You get the exact same value whether you compute it as the odds of disease given exposure, or the odds of exposure given disease. The cross-product formula doesn't care which variable is on the rows and which is on the columns. Flip them and you get the same number.

**Sarah:** And that symmetry is the entire reason the odds ratio is the only valid ratio measure for a case-control study. Walk us through why.

**Kiffer:** In a case-control study, the investigator picks the cases and picks the controls. They decide how many of each. Maybe they enroll one hundred cases and one hundred controls, or one hundred cases and four hundred controls. Whatever ratio they want. Because the investigator is fixing those marginal totals, you can't compute a risk in the exposed group from the data, and you can't compute a risk in the unexposed group. The denominators you'd need are not the natural denominators in the population. They're whatever the investigator decided.

**Sarah:** So risk ratio is incalculable. Rate ratio is incalculable. But the odds ratio somehow survives that? Walk us through why it does.

**Kiffer:** It survives, exactly because of the symmetry. The odds of exposure among cases, divided by the odds of exposure among controls, gives you the same number as the odds of disease among the exposed divided by the odds of disease among the unexposed. So even though the design fixed the case-to-control ratio, the cross-product still produces a meaningful estimate of the underlying relationship between exposure and disease. That's why case-control studies report odds ratios. They have no other choice.

**Sarah:** Let's ground this with the worked example from the lesson. The Brazil water cistern study.

**Kiffer:** Three thousand three hundred ninety-nine households in Brazil. The exposure is having a water cistern. The outcome is diarrhea. The two by two table looks like this. Among households with a cistern, one hundred ninety-four had diarrhea and one thousand five hundred eighty-eight did not. Among households without a cistern, three hundred three had diarrhea and one thousand three hundred fourteen did not.

**Sarah:** Risk in the exposed is one ninety-four divided by seventeen eighty-two, which is about ten point nine percent. Risk in the unexposed is three oh three divided by sixteen seventeen, about eighteen point seven percent. Risk ratio is zero point one zero nine over zero point one eight seven, which equals zero point five eight.

**Kiffer:** And because the risk ratio is below one, the cistern is protective. Specifically, risk is forty-two percent lower among households with a cistern. One minus zero point five eight equals zero point four two.

**Sarah:** Now compute the odds ratio. Cross-product. The count in cell a-one times the count in cell b-zero, divided by the count in cell a-zero times the count in cell b-one. One ninety-four times thirteen fourteen, divided by three oh three times fifteen eighty-eight. That equals zero point five three.

**Kiffer:** So the risk ratio is zero point five eight and the odds ratio is zero point five three. Both protective. Both telling the same story. But notice the odds ratio is further from one than the risk ratio. That's not random. That's a systematic property, and we'll get to it in a minute.

**Sarah:** Second worked example. Migraine incidence rates. This is the rate-based version.

**Kiffer:** Female versus male, ages thirty to forty. One hundred thirty-one cases of migraine across two hundred fifty person-months for women. Forty-four cases across two hundred thirty-six person-months for men. So rate in women is one thirty-one over two fifty, which is zero point five two four cases per person-month. Rate in men is forty-four over two thirty-six, which is zero point one eight six. Incidence rate ratio is two point eight one.

**Sarah:** Migraine rate is two point eight one times higher in women than in men in this age range. The interpretation is the same shape as the risk ratio interpretation. The number two point eight one tells you how many times more often the event occurs in the exposed group, just measured per unit of person-time rather than per person.

**Kiffer:** Okay. Now back to the curious observation we noted in the cistern example. The odds ratio was further from the null than the risk ratio. The lesson generalizes this into a really useful conceptual picture.

**Sarah:** Imagine a number line with the null value of one in the middle. The risk ratio sits closest to one. The incidence rate ratio sits a little further from one. The odds ratio sits furthest from one. This is true whether the exposure is harmful, where all three are above one, or protective, where all three are below one. Odds ratio is always the most extreme. Always.

**Kiffer:** And the gap between the three measures grows as the disease becomes more common. When the disease is rare, all three measures are nearly identical. When the disease is common, they diverge dramatically.

**Sarah:** Which sets up the three rules of thumb that you really want to internalize. First. The rare disease assumption. When the disease prevalence or incidence risk is below five percent, the odds ratio is approximately equal to the risk ratio. They're close enough that you can usually treat them as interchangeable.

**Kiffer:** And the intuition is that when the disease is rare, the count of cases in each group, a-one and a-zero, is small relative to the total in that group, n-one and n-zero. So the count of non-cases, b-one and b-zero, is approximately the total. Which means the odds, a over b, is approximately the same as the risk, a over n. Same number. Same ratio. Odds ratio approximates risk ratio.

**Sarah:** Second rule. Risk ratio approximates incidence rate ratio when the exposure has negligible impact on the total time at risk in the study population. Which usually happens when the disease is rare or when the rate ratio is close to the null.

**Kiffer:** And third rule. The odds ratio estimates the incidence rate ratio directly, with no rare disease assumption needed, when controls in a case-control study are selected using incidence density sampling. Which is a fancy term for picking controls from the at-risk population each time a case occurs, rather than picking them once at the end.

**Sarah:** And there's an interactive simulator built into Section one of the lesson page that I want to flag. You edit the cells of a two by two table, or you slide the outcome prevalence up and down, and you watch the risk ratio and the odds ratio diverge in real time.

**Kiffer:** Set the common outcome preset, which has an outcome prevalence around forty percent, and hold the true risk ratio at two. The odds ratio balloons to about three or even higher. So if you misreport that odds ratio as if it were a risk ratio, you've overstated the harm by fifty percent or more. The rare disease assumption is exactly the gap that simulator visualizes. Ignore it at your peril.

**Sarah:** Okay. Section two. Difference measures and the exposed-group attributable fraction.

**Kiffer:** And the conceptual move from Section one to Section two is important. Ratio measures tell you how many times more likely disease is. Difference measures tell you how many extra cases occur because of the exposure. They are answering related but distinct questions, and both matter for different purposes.

**Sarah:** Risk difference, which we abbreviate R-D, is sometimes also called attributable risk. It is the risk in the exposed minus the risk in the unexposed. Just a subtraction. Risk minus risk. The units are still probabilities, so the answer is on a zero-to-one scale, or you can express it as a percentage.

**Kiffer:** Incidence rate difference is the parallel for rate data. The rate in the exposed minus the rate in the unexposed. Units are cases per unit person-time.

**Sarah:** And the null value for difference measures is zero. Not one. Because subtraction. If risks are equal, the difference is zero. Above zero, harmful. Below zero, protective.

**Kiffer:** Worked example. Smoking and low birth weight. From a cohort of five thousand women followed through pregnancy.

**Sarah:** Three hundred fifty-one of those women smoked during the second trimester. Of those smokers, forty had a low-birth-weight baby. So risk in the exposed is forty over three fifty-one, which is zero point one one four. Eleven point four percent.

**Kiffer:** And of the four thousand six hundred forty-nine non-smokers, three hundred thirty-one had a low-birth-weight baby. Risk in the unexposed is three thirty-one over four six four nine, which is zero point zero seven one. Seven point one percent.

**Sarah:** Risk difference is zero point one one four minus zero point zero seven one, which equals zero point zero four three.

**Kiffer:** And the interpretation is concrete and powerful. For every one hundred women who smoked, approximately four point three additional low-birth-weight babies occurred above what would have happened without smoking. Assuming the relationship is causal, those four point three excess cases are attributable to the smoking itself.

**Sarah:** And this is why difference measures are different from ratio measures. The risk ratio in this same example is one point six. Which sounds modest. The risk difference, four point three additional babies per one hundred women, is the absolute count. Both numbers are about the same exposure-disease pair. But they emphasize different things.

**Kiffer:** Now from the risk difference we get to the attributable fraction in the exposed. The lesson abbreviates this as A-F-e. The proportion of disease in the exposed group that is due to the exposure, assuming the relationship is causal.

**Sarah:** There are two equivalent formulas. First. Attributable fraction in the exposed equals risk difference divided by risk in the exposed. Second. Attributable fraction in the exposed equals risk ratio minus one, divided by risk ratio. Let's apply both to the smoking example to make sure they give the same answer.

**Kiffer:** First formula. Risk difference is zero point zero four three. Risk in the exposed is zero point one one four. Zero point zero four three divided by zero point one one four equals about zero point three seven seven. Thirty-seven point seven percent.

**Sarah:** Second formula. Risk ratio is one point six. One point six minus one is zero point six. Zero point six divided by one point six is zero point three seven five. Thirty-seven point five percent.

**Kiffer:** Same answer up to rounding. So among women who smoked, thirty-seven point five percent of the low-birth-weight cases are attributable to smoking. The other sixty-two point five percent would have happened anyway, because of baseline risk that was present whether or not these women smoked.

**Sarah:** Vaccine efficacy is an important special case of the attributable fraction in the exposed. And the trick to get your head around it is that the exposure is being unvaccinated. Unvaccinated is the factor positive. Vaccinated is the comparison.

**Kiffer:** Let's run through it with simple numbers. Suppose twenty percent of unvaccinated individuals develop the disease, and five percent of vaccinated individuals develop the disease.

**Sarah:** Risk difference is zero point two zero minus zero point zero five, which is zero point one five. Attributable fraction in the exposed is zero point one five divided by zero point two zero, which is zero point seven five. Seventy-five percent.

**Kiffer:** And the interpretation is, the vaccine prevented seventy-five percent of the cases of disease that would have occurred in the vaccinated group if they had not been vaccinated. So a vaccine efficacy of seventy-five percent is a statement about how many cases the vaccine prevented, not about how many vaccinated people stayed disease-free.

**Sarah:** And this matters because students sometimes misread vaccine efficacy. Seventy-five percent does not mean seventy-five percent of vaccinated people are immune. It means seventy-five percent of the cases that would have happened were prevented. Different statement.

**Kiffer:** One more conceptual nuance the lesson wants you to know. The attributable fraction in the exposed, A-F-e, is technically a lower bound for what's called the etiologic fraction. The etiologic fraction is the proportion of cases in the exposed where the exposure was a component of the sufficient cause, in the Rothman pies sense.

**Sarah:** And the reason A-F-e is a lower bound is that exposure may contribute to cases that the baseline risk would have produced anyway, just earlier or differently. So the etiologic fraction can be higher than A-F-e. The distinction matters in detailed mechanistic interpretations of causation, but it rarely changes a policy conclusion.

**Kiffer:** Section three. Population-level measures and study design.

**Sarah:** And the move here is to zoom out from the exposed group to the entire population. Even if an exposure powerfully causes disease in those who are exposed, its public-health importance also depends on how common the exposure is in the population overall.

**Kiffer:** Population attributable risk, which we abbreviate P-A-R. The increase in overall population risk attributable to the exposure. There's a clean way to compute it. Population attributable risk equals risk difference times the prevalence of exposure in the population.

**Sarah:** And it makes intuitive sense. If exposure has a risk difference of zero point one but only one percent of the population is exposed, the population-level burden is small. If the same risk difference applies but fifty percent of the population is exposed, the population-level burden is much larger.

**Kiffer:** Population attributable fraction, abbreviated A-F-p. The proportion of disease in the entire population that is attributable to the exposure, and that would be avoided if the exposure were removed.

**Sarah:** The formula is sometimes called the Levin formula. In words. Attributable fraction in the population equals the prevalence of exposure times the quantity risk ratio minus one, all divided by the prevalence of exposure times the quantity risk ratio minus one, plus one.

**Kiffer:** And the principle is the same as it was for population attributable risk. Both the strength of the association and the prevalence of the exposure matter. A strong risk factor that is rare in the population can have a small population-level impact. A modest risk factor that is common can have a large population-level impact.

**Sarah:** The lesson uses two contrasting examples to make this concrete. Intravenous drug use and human immunodeficiency virus, abbreviated H-I-V. Versus poor diet and chronic disease.

**Kiffer:** Intravenous drug use has a very high risk ratio for H-I-V transmission. Maybe twenty, maybe thirty, depending on the setting. So if you are an intravenous drug user, your individual risk of acquiring H-I-V is dramatically elevated. But intravenous drug use is rare in the general population, perhaps under one percent in most settings. So the population attributable fraction, even with that very high risk ratio, ends up being small. Eliminating intravenous drug use would prevent relatively few cases at the population level, even though the per-person impact is enormous.

**Sarah:** Poor diet has a much more modest risk ratio for chronic disease. Maybe one point five for cardiovascular disease. But it affects a huge proportion of the population. Maybe forty or fifty percent depending on how you define it. So the population attributable fraction is large, even with the modest risk ratio. Improving population diet would prevent many cases.

**Kiffer:** And this is the central insight that A-F-p captures. Public health prioritization is not just about the strength of risk factors. It is about the strength of risk factors weighted by how common they are. We talked about this conceptually back earlier, when we first met the idea behind the Levin formula. Now we have the math to back it up.

**Sarah:** There's also a worked example for the smoking and low-birth-weight cohort that drives the point home from the other direction.

**Kiffer:** From the same cohort of five thousand women, three hundred fifty-one smoked. So prevalence of exposure is about seven percent. Population attributable risk is risk difference times prevalence of exposure, which is zero point zero four three times zero point zero seven, equals about zero point zero zero three. Population attributable fraction is zero point zero zero three divided by overall population risk, which is zero point zero seven four. That works out to about four percent.

**Sarah:** So even though smoking has a meaningful risk ratio of one point six and an attributable fraction in the exposed of thirty-seven point five percent, only four percent of all low-birth-weight babies in this population are attributable to smoking. Why? Because most pregnant women in this cohort did not smoke. The exposure was not common enough.

**Kiffer:** Same exposure. Same association. But two very different fractions, depending on whether you ask about the exposed group or about the population as a whole. Both are correct. Both matter. Use whichever one fits the question you are asking.

**Sarah:** One quick technical note before we move on. The lesson page also flags that when confounding is present, you swap the crude risk ratio in the Levin formula for an adjusted risk ratio. The shape of the calculation is the same, but the inputs come from a model that controls for confounders. We will go deeper on that machinery in Lesson 12.

**Kiffer:** Right. Keep that in your back pocket. For now, the takeaway is just that A-F-p is only as honest as the risk ratio you feed it.

**Sarah:** And then the section closes by mapping each measure of association onto the study design that produces it. This is where the design rules from this material Lessons four through six all come back.

**Kiffer:** Cross-sectional studies. You measure exposure and outcome at the same point in time. The natural measures are prevalence ratios and prevalence odds ratios. You can compute both. Be careful interpreting them as risk ratios, because temporality is unclear.

**Sarah:** Closed-cohort studies, where everyone enters at the start and is followed for a fixed window. The natural measure is the risk ratio. You also get the risk difference. And you can compute the odds ratio if you want.

**Kiffer:** Open-cohort studies, where people enter and leave the at-risk population at different times. The natural measure is the incidence rate ratio. You also get the incidence rate difference.

**Sarah:** Case-control studies. The investigator samples on disease status. The only ratio measure available is the odds ratio. With the rare disease assumption, you can use the odds ratio to approximate the risk ratio. With incidence density sampling, the odds ratio estimates the incidence rate ratio directly, no rare disease assumption needed.

**Kiffer:** And the choice of measure is not aesthetic. It is dictated by the kind of sampling you used and the kind of population you have. You don't pick the measure first. You pick the design first, and the design picks the measure.

**Sarah:** Section four. Hypothesis testing and confidence intervals.

**Kiffer:** And this is where we turn each point estimate, the single number we computed, into a defensible inference about the underlying population. Because a point estimate by itself is just a number. The number means more once we attach a measure of uncertainty to it.

**Sarah:** The basic logic of hypothesis testing has four pieces. First. Form a null hypothesis. The default assumption is that there is no association. So for ratio measures, the null is that the true ratio equals one. For difference measures, the null is that the true difference equals zero.

**Kiffer:** Second. Compute a test statistic from the data. Third. Compare that test statistic to a reference distribution under the null. Fourth. Get a p-value, which is the probability of observing data at least as extreme as what you observed, if the null hypothesis were true.

**Sarah:** And the conventional threshold is alpha equals zero point zero five. Below alpha, you call the result statistically significant. We covered Type one and Type two errors, and statistical power, back earlier, when we worked through sampling.

**Kiffer:** For a two by two table, the standard test of independence is the chi-squared test. You compare observed cell counts to the cell counts you would expect under the null of no association, and the chi-squared statistic measures how far the observed counts are from the expected counts.

**Sarah:** And when the cell counts are small, the chi-squared approximation breaks down. The reference distribution stops fitting well. The fix is Fisher's exact test, which uses the exact hypergeometric probability rather than a continuous approximation.

**Kiffer:** For comparing two proportions directly, you can use the z-test for two proportions. For comparing two rates, you can use a Poisson-based test. The lesson also points to two more advanced workhorses you will meet again earlier in this series. The Wald statistic and the likelihood ratio test, both of which generalize cleanly to regression settings.

**Sarah:** Now confidence intervals. The ninety-five percent confidence interval, abbreviated C-I, is the range of plausible values for the true parameter, given the data. The interpretation is procedural. If you repeated the study many times under identical conditions, ninety-five percent of the computed intervals would contain the true parameter value.

**Kiffer:** And the width of a confidence interval depends on the standard error, abbreviated S-E. The standard error is a measure of the precision of the point estimate. Larger samples give smaller standard errors and tighter intervals.

**Sarah:** For difference measures, like the risk difference and the rate difference, the standard error is computed directly. Take the variance of each risk, divide by the sample size, add them up, take the square root.

**Kiffer:** For ratio measures like the risk ratio, the rate ratio, and the odds ratio, you do not compute the confidence interval directly on the original scale. You work on the log scale. Why? Because the sampling distribution of a ratio is right-skewed. It cannot go below zero, and it has a long upper tail. If you tried to put a symmetric interval directly on the ratio, you would get a lower bound that could go negative, which is meaningless.

**Sarah:** So the recipe is. Take the natural logarithm of the ratio. Compute the standard error on the log scale, using the standard formula for variance of the log ratio. Add and subtract one point nine six times the standard error to get the lower and upper bounds on the log scale. Then exponentiate both bounds back to the original scale.

**Kiffer:** And the practical consequence is that confidence intervals for ratio measures are asymmetric around the point estimate. The upper bound is further from the point estimate than the lower bound is, on the original ratio scale. That asymmetry is real. It is not a bug. It reflects the fact that the variance is symmetric on the log scale, not the original scale.

**Sarah:** There is a clean interpretation rule for using confidence intervals as a surrogate significance test. For a risk ratio, an incidence rate ratio, or an odds ratio, if the ninety-five percent confidence interval includes the value one, the result is not statistically significant at alpha zero point zero five. If it excludes one, the result is significant.

**Kiffer:** For a risk difference or an incidence rate difference, the rule is the same but with zero. If the ninety-five percent confidence interval includes zero, not significant. Excludes zero, significant.

**Sarah:** But the lesson is sharp on a really important point here. Using a confidence interval as a surrogate significance test underutilizes what the confidence interval is telling you. The interval also shows the range of plausible effect sizes, which is far more informative than a binary significant-versus-non-significant classification.

**Kiffer:** Two studies might have the same point estimate but very different confidence intervals. A confidence interval of one point one to one point three is a precise estimate of a small effect. A confidence interval of zero point eight to two point five is an imprecise estimate that is compatible with a small protective effect, no effect, or a large harmful effect. Same point estimate, very different conclusions about what we can defensibly say.

**Sarah:** Which lands us back where we started. Statistical significance is not strength of association. The lesson opens with this point and it closes with this point, and the repetition is intentional.

**Kiffer:** A statistically significant result with a tiny effect size, like a risk ratio of one point zero five, may have very limited public health importance, even if the p-value is below zero point zero zero one. And a non-significant finding with a large point estimate, like a risk ratio of three with a wide confidence interval, may just mean the study was underpowered. The numbers tell you different things and you need both to draw a defensible conclusion.

**Sarah:** Always report effect sizes alongside p-values. Always report confidence intervals alongside point estimates. Two numbers, two questions, both required. That's the discipline this lesson is trying to instill.

**Kiffer:** Okay. Let's pull the takeaways together. This is the capstone of the first half of this material, so we should treat it with appropriate gravity. The first seven lessons of the course took you somewhere specific, and Lesson seven is where you can finally see what they were all building toward.

**Sarah:** First takeaway. Three ratio measures of association. Risk ratio, incidence rate ratio, odds ratio. Each tied to a specific study design. The risk ratio comes from closed cohorts. The incidence rate ratio comes from open cohorts. The odds ratio comes from case-control studies, where it is the only valid measure because of its symmetry property.

**Kiffer:** Second. The rare disease assumption. When the disease is rare, below five percent prevalence or incidence risk, the odds ratio approximates the risk ratio closely enough to use as a substitute. When the disease is common, the odds ratio overstates the magnitude of the association, sometimes by a lot. The interactive simulator in Section one is the cleanest demonstration of that gap. Play with it.

**Sarah:** Third. Difference measures. Risk difference and incidence rate difference. The absolute count of extra cases attributable to exposure. The null is zero, not one. And the attributable fraction in the exposed equals risk ratio minus one, all divided by risk ratio. The proportion of disease in the exposed group that is due to the exposure, assuming causation.

**Kiffer:** Fourth. Vaccine efficacy is the attributable fraction in the exposed where the exposure is being unvaccinated. A vaccine efficacy of seventy-five percent means the vaccine prevented seventy-five percent of the cases that would have occurred. Not seventy-five percent of vaccinated people are immune. The wording matters.

**Sarah:** Fifth. Population-level measures. Population attributable risk and population attributable fraction. These combine the strength of the association with the prevalence of the exposure in the population. A weak risk factor that is common can have a much larger population impact than a strong risk factor that is rare. Intravenous drug use and H-I-V versus poor diet and chronic disease.

**Kiffer:** Sixth. Each measure of association maps onto a specific study design. Cross-sectional gives you prevalence ratios and prevalence odds ratios. Closed cohort gives you risk ratio. Open cohort gives you incidence rate ratio. Case-control gives you odds ratio, which approximates the risk ratio under rare disease, or estimates the incidence rate ratio directly under incidence density sampling.

**Sarah:** Seventh. The hypothesis testing and confidence interval machinery. For ratio measures, work on the log scale, take the natural log, add and subtract one point nine six times the standard error on the log scale, and exponentiate back. The resulting interval is asymmetric on the original scale. If a confidence interval for a ratio measure includes one, it is not significant at alpha zero point zero five. If a confidence interval for a difference measure includes zero, same conclusion.

**Kiffer:** And eighth. The lesson opens and closes with the same point, and we should too. Statistical significance is not strength of association. They are answering different questions. Always report effect sizes alongside p-values. Always report confidence intervals alongside point estimates. The number is incomplete without the uncertainty.

**Sarah:** And as Kiffer said at the start, this is the capstone of the first half of this material. Lessons one through seven took you from causal concepts and surveillance, through sampling and questionnaire design, into measures of disease frequency and screening tests, and now into measures of association. That arc is complete.

**Kiffer:** And the second half of the course will build on this foundation. Lesson eight is a review of study design concepts. Lesson nine covers hybrid study designs, and Lesson ten covers controlled studies. Then Lesson eleven digs into validity in observational studies, and Lesson twelve closes the course with confounding and causal inference.

**Sarah:** Every one of those second-half lessons assumes you can compute and interpret the measures we covered today. They are the outputs that designs deliver and that systematic reviews eventually pool. So if any piece of Lesson seven feels shaky, this is the moment to go back and shore it up. Sit with the cistern example. Run through the smoking calculation. Play with the simulator.

**Kiffer:** And the discipline this lesson tries to teach you is bigger than the specific formulas. It is the habit of distinguishing what your data can support from what they cannot. Strength of association is one question. Statistical significance is another. The size of the population impact is a third. Causal inference, which we will come back to in Lesson twelve, is yet another. Every measure of association points at one of these questions. None of them point at all of them.

**Sarah:** If you walk away from Lesson seven with one habit, let it be this. When you read a study, write down the point estimate, the confidence interval, the prevalence of the exposure in the relevant population, and the design. Four numbers, four pieces of context. Almost every misinterpretation we will catalog in the second half of the course comes from missing one of those four.

**Kiffer:** Take care, everyone.

**Sarah:** See you in Lesson eight, where we'll do a deliberate review of study design concepts before the second half kicks into gear.
