# Lesson 5 — Case-Control Studies (v3 expanded)

*Companion-podcast transcript • Sarah & Kiffer* 
*~5,790 words • ~31 min audio*

---

**Sarah:** Welcome back to Office Hours. I'm Sarah.

**Kiffer:** And I'm Kiffer. Today we're working through Lesson 5, Case-Control Studies. This is the first of the three big analytic observational designs we previewed in Lesson 4, and honestly, of all the designs in this course, this is the one where the design choices matter most. The case-control study is elegant when it's done well and a complete mess when it isn't. The difference between those two outcomes usually comes down to one or two decisions you make before any data are collected.

**Sarah:** Okay, that's a strong opening. Before we go deep, can you remind everyone where this lesson sits in the arc of the course?

**Kiffer:** Sure. Lesson 4 introduced a classification scheme based on how you sample. There are three sampling approaches. First, cross-sectional, where you sample without regard to either exposure or disease, and you measure both at the same moment. Second, cohort, where you sample on exposure and follow people forward in time. And third, case-control, where you sample on the outcome itself. You find people who already have the disease, you find a comparison group who don't, and you look back at their exposures. Today is case-control. Cross-sectional was Lesson 4. Cohort is coming in Lesson 6.

**Sarah:** Right. And let me ask the obvious question. Why would anyone build a study by sampling on the outcome? It seems backwards.

**Kiffer:** Imagine you want to study a rare cancer. A particular childhood leukemia, or a specific brain tumor. If you tried to study that with a cohort design, you'd have to enroll tens of thousands, sometimes hundreds of thousands of people, follow them forward in time for years, and wait for a small handful to develop the disease. That's enormously expensive, slow, and most of the data collection is wasted on people who never get the outcome you care about.

**Sarah:** So the case-control design inverts that. Instead of waiting for the rare event, you start at the event.

**Kiffer:** Exactly. You go to the cancer registry, or the hospital pathology department, you collect the cases that have already occurred, and you find a sensible comparison group. You only need a few hundred people total instead of a few hundred thousand. You can finish in months rather than decades. The catch is the design has to be right. Historically, more case-control studies have been ruined by control selection than by anything else. Choosing the wrong comparison group can produce a confidently reported odds ratio that means nothing.

**Sarah:** Okay. The lesson has four content sections. Section 1 sets up the basic logic and introduces a concept called the study base. Section 2 covers the case series and how to choose controls. Section 3 introduces the two main flavors of the design, risk-based and rate-based, and shows what your final number, the odds ratio, actually means under each. Section 4 closes the loop on comparability, analysis, and reporting.

**Kiffer:** Two ideas from Lesson 4 carry over. First, one of the limitations of cross-sectional studies is that they can only measure prevalence, not incidence. Case-control studies are designed to overcome that limit. Second, the unified-approach discipline from Lesson 4, the thought experiment, the design before data, the forward projection, applies just as much here. Maybe more, because the choice of cases and controls creates more opportunities for things to go wrong.

**Sarah:** All right, let's get into Section 1. What is a case-control study, formally?

**Kiffer:** The basis of the case-control design is to select individuals who have newly developed the disease or outcome of interest, the cases. And, for comparison, individuals who have not developed the disease at the time of selection, the controls. Then you contrast the frequency of exposure factors in the cases with the frequency of exposure factors in the controls. If smokers are more common among cases than among controls, that's an association between smoking and the disease.

**Sarah:** And the lesson is really insistent about a clarification. A case-control study is not a comparison between cases and healthy people. It's a comparison between cases and non-case subjects.

**Kiffer:** Right. The controls might have other diseases. They might be perfectly healthy. The defining property of a control is just that they don't have the specific disease being studied. The lesson gives an operational test. Ask, would this person have been included as a case if they had developed the outcome? If yes, they're a valid control. If no, they're not.

**Sarah:** Why insist on that test?

**Kiffer:** Because it forces you to think clearly about who could have become a case in your study. The pool of people who could have become cases. That pool has a name in the lesson. It's called the study base.

**Sarah:** So the study base is basically just the source population.

**Kiffer:** Effectively, yes. The study base is the population from which the cases and the controls are obtained. The nature of that population determines how controls should be selected. Get the study base wrong and the rest of the design unravels. The lesson lays out three flavors. Primary base, secondary base, and nested.

**Sarah:** Okay, walk through them.

**Kiffer:** A primary base is a source population you can enumerate. In principle, you could write down a list of everyone in it. A provincial cancer registry. A defined geographic catchment area where every case is captured by a public health system. When you have a primary base, sampling is conceptually clean. The trade-off is operational. Primary-base studies are harder to set up and more expensive.

**Sarah:** Got it. Second flavor, secondary base.

**Kiffer:** Secondary base is when cases come from a clinic, a hospital, or a registry that doesn't represent a complete enumeration of all cases in a defined population. The tricky thing is conceptualizing the actual source population. The pool of people who could have become cases. Because if your cases all come from one hospital, the source population is, in some loose sense, all the people who would have ended up at that hospital if they had become cases. Hard to define operationally.

**Sarah:** Right. And the standard solution?

**Kiffer:** Draw controls from the same source. Same hospital. Same clinic. The logic is that those controls also would have ended up at that hospital if they had become cases. So they represent the source population by construction. Workable, but it creates the biggest single hazard in case-control design, which we'll get to in Section 2.

**Sarah:** Okay, third flavor. Nested case-control.

**Kiffer:** A nested case-control study sits inside a larger cohort study. You have a cohort already enrolled and being followed. As cases occur in the cohort, you sample controls from the cohort itself. People who are at risk at the time each case is identified. This is the methodologically powerful version, for two reasons. First, the sampling fractions are known, which means you can estimate disease frequency by exposure status. Most other case-control designs can't do that. Second, it's enormously efficient. Imagine a cohort of 50,000 people with stored blood samples and you want to measure an expensive biomarker. You can't afford to test everyone. So you nest. You measure only on cases plus a sample of controls. You get most of the information at a fraction of the cost.

**Sarah:** The lesson uses four worked examples that come back through the rest of the lesson. They're labeled Examples 9.1 through 9.4 because they come from Chapter 9 of the Dohoo textbook the course is built on. Let's introduce them now.

**Kiffer:** Example 9.1 is Dorgan and colleagues in 2010. They studied serum estradiol levels and breast cancer. Estradiol is the most potent of the natural estrogens, and there's been longstanding interest in whether higher levels relate to breast cancer risk. The team used blood samples from a parent cohort. Around 6,900 women donated blood between 1977 and 1989, all free of cancer at that point. They were followed for two-plus decades. Of about 6,720 women in extended follow-up, 117 developed breast cancer. So this is a nested case-control inside a much larger cohort. Each case got two controls matched on age within two years, blood draw date within one year, and menstrual cycle day within two days.

**Sarah:** And matching on menstrual cycle day is interesting. Estradiol fluctuates across the cycle. So if you don't match on cycle day, you'd be comparing levels measured at different points in the cycle, and that variation would swamp any real effect.

**Kiffer:** Exactly. That's matching to remove a known nuisance source of variation. Example 9.2 is Dore and colleagues in 2004. They studied risk factors for Salmonella Typhimurium infection. Salmonella Typhimurium is a bacterial cause of foodborne illness. Cases were people in Alberta, British Columbia, and Saskatchewan, Canada, between December 1999 and November 2000, with Salmonella Typhimurium confirmed from stool samples. Controls were matched one-to-one on age and province, randomly selected from provincial health registries. So this is a primary-base, rate-based design. The provincial health registry covers essentially the whole population, which is what makes it a primary base.

**Sarah:** Example 9.3 is Magura and colleagues in 2008. Hypercholesterolemia, which is high blood cholesterol, and prostate cancer.

**Kiffer:** Cases were men newly diagnosed with prostate cancer at MeritCare Hospital between 2004 and 2006. Controls came from the primary-care database of the same hospital. Men aged 50 to 74 without cancer who had received annual physicals and lipid profiles in the previous year. This is a textbook secondary-base, risk-based study. Cases come from one hospital. Controls come from the same hospital. The exclusion criteria included other cancers and non-Caucasian race, which I want to flag. Excluding people by race is methodologically and ethically fraught and would not pass review the same way today. Worth being honest that the example is in the textbook for the design lesson, not because we endorse that exclusion.

**Sarah:** Worth being honest about. Example 9.4 is Rodrigo and colleagues in 2011. Gastroenteritis risk factors as a community-based study nested within a larger randomized controlled trial in South Australia.

**Kiffer:** 300 households kept weekly health diaries. The outcome was something they called highly credible gastroenteritis, defined as two or more loose stools, two or more vomiting episodes, or various combinations with abdominal pain or nausea within 24 hours. Controls were matched to cases by study week. So this is nested, rate-based, and the unit of observation includes repeated occurrences within the same household. Those four examples cover the major design variants. Primary versus secondary base. Risk-based versus rate-based. Nested versus standalone. We'll keep coming back to them.

**Sarah:** Now let's set up the workhorse summary statistic. The odds ratio. This is a concept that really confuses students.

**Kiffer:** Yeah, it does. The odds ratio is the cross-product of a two-by-two table. The rows are exposed versus unexposed. The columns are cases versus controls. Four cells. Call them a-one, a-zero, b-one, b-zero. Where a-one is exposed cases, a-zero is unexposed cases, b-one is exposed controls, b-zero is unexposed controls. The odds ratio is a-one times b-zero, divided by a-zero times b-one. That's it. Just the cross-product.

**Sarah:** And why is this the only valid measure of association in a case-control design?

**Kiffer:** When you sampled on the outcome, the ratio of cases to controls in your study is something you chose. If you sampled one control per case, the proportion of cases is fifty percent. If you sampled four controls per case, twenty percent. Neither tells you the actual frequency of the disease in the source population. So you can't compute a risk, because risk requires the actual proportion of people who got the disease. The denominator you'd need was set by your sampling, not by the disease.

**Sarah:** But the odds ratio survives that, because of a symmetry property.

**Kiffer:** Exactly. The cross-product odds ratio gives you the same number whether you compute it as the odds of disease given exposure divided by the odds of disease given non-exposure, or as the odds of exposure given case status divided by the odds of exposure given control status. The two formulas algebraically simplify to the same number. So even though you sampled on case status, the odds ratio you get is the one you would have gotten if you'd sampled on exposure instead. That symmetry is what makes the odds ratio the right tool when you sampled on the outcome. Risk ratios don't have it. Rate ratios don't have it. Only the odds ratio.

**Sarah:** And the lesson includes a small R box where you build a hypothetical two-by-two table for smoking and lung cancer. Fifty cases, 150 controls. You compute the odds ratio as the cross-product, you add a confidence interval using something called the Woolf method. The example gives an odds ratio of about 13.5. Cases had about 13.5 times the odds of being smokers compared with controls. The arithmetic is simple. Understanding what the odds ratio actually estimates, that's where the work is. We'll come back to that in Section 3.

**Kiffer:** Section 2. The case series and the principles of control selection. Two halves. Assembling the case series, then choosing controls. They're not equally hard. The case series side is mostly careful definition. Control selection is where things go off the rails.

**Sarah:** Start with the case series. What are the key decisions?

**Kiffer:** There are four. First, specify the disease, including diagnostic criteria. Second, identify the source or sources of the cases. Third, decide whether you'll use only incident cases or also prevalent ones. And fourth, estimate the required sample size.

**Sarah:** Quick. Define incident and prevalent. They sound similar but they're different.

**Kiffer:** An incident case is someone newly diagnosed. They just developed the disease. A prevalent case is anyone who currently has the disease, regardless of when they were diagnosed. So an incident case is a flow concept, new cases per unit time. A prevalent case is a stock concept, everyone with the disease right now.

**Sarah:** And the lesson says there's virtually unanimous agreement that, when possible, you should use incident cases. Why?

**Kiffer:** There are two reasons. First, prevalent cases bring in survival bias. If a disease kills quickly, the prevalent cases you find are the unusual people who survived long enough to be available. Your case series is enriched for survivors, not for typical cases. Second, prevalent cases let in factors associated with the duration of disease, not just with the development of it. Same idea as the prevalence-equals-incidence-times-duration relationship from Lesson 4. With incident cases, you've stripped out the duration component.

**Sarah:** Diagnostic criteria. The lesson is insistent that the criteria need to be specific, well-defined, and applied uniformly to everyone.

**Kiffer:** Right. Manifestational criteria, meaning the clinical signs and symptoms, plus laboratory or imaging criteria when relevant. And you also have to think about case ascertainment. Are you actually catching all the cases? The lesson flags a specific worry about tertiary care facilities. If your cases come from a specialized referral center, you're enriching for severe or unusual cases. The cases at a tertiary care facility can drift away from typical cases in the broader source population. Your study becomes about a specialized subset, not about the disease in general.

**Sarah:** Okay, onto the harder half. Control selection. The master principle the lesson repeats over and over.

**Kiffer:** Controls should represent the exposure experience of the source population that gave rise to the cases.

**Sarah:** Right. And what does that actually mean?

**Kiffer:** It means that if you measured the prevalence of the exposure in your control group, that prevalence should match the prevalence of the exposure in the population that produced your cases. Not the general public. Not just any healthy people. The population whose disease cases ended up in your case series. If your cases come from a hospital that serves a particular catchment area, your controls should reflect the exposure experience of people in that catchment area.

**Sarah:** And there are four formal principles, drawn from Wacholder and colleagues in 1992.

**Kiffer:** Right. Sander Wacholder and colleagues wrote a really influential set of papers in 1992 that are the standard reference for control selection. The four principles. First, controls should come from the same study base as cases. Second, in a closed population, the closed-population rule applies. Third, in an open population, the open-population rule applies, which takes us into incidence density sampling in Section 3. Fourth, the eligibility period for controls should match the eligibility period for cases. They have to all hold together.

**Sarah:** Now the most useful part of the whole lesson. The catalogue of common control sources, with their strengths and limitations. Every source trades a different strength against a different bias. Six in the lesson.

**Kiffer:** First. Population controls. Sampled from the general population. Voter rolls, community-based registries, health administrative data. If your cases came from a primary base that covers the whole population, population controls are the right match. The limitation is low response rates. People are busy, they don't have a personal stake. Recall bias is also a problem because they have less reason than cases to remember exposures from years ago.

**Sarah:** Okay, second. Hospital controls. People hospitalized for conditions other than the disease being studied.

**Kiffer:** Strengths are accessibility, willingness to participate, and similar recall ability to cases, because they're also sick people thinking about why they ended up in the hospital. The limitation is the killer one. Whatever brought them to the hospital might be related to the exposure you're studying.

**Sarah:** Okay, walk through the canonical example.

**Kiffer:** Imagine you're studying smoking and lung cancer using hospital-based cases and controls. You go to the hospital, you find lung cancer cases. You sample controls from people in the same hospital admitted for other reasons. But many other reasons people end up in the hospital are also related to smoking. Heart attacks. Chronic obstructive pulmonary disease. Other smoking-related cancers. Stroke. So your controls have an artificially high prevalence of smoking compared to the general population. Your odds ratio gets biased toward the null. The smoking-lung-cancer association looks weaker than it really is, because both your cases and your controls are enriched for smokers.

**Sarah:** Third, friend controls. The case names a friend who agrees to participate.

**Kiffer:** Strengths are convenience and similar recall. Friends often share age, education, engagement with the study. The limitation is over-matching. Friends share lifestyle factors. Diet, exercise, smoking, drinking. So if your exposure is anything related to lifestyle, friend controls will look too similar to cases on that exposure. Bunin and colleagues in 2011 documented that friend controls produce biased estimates of association precisely because of over-matching.

**Sarah:** Fourth, neighbourhood controls. People living near the case.

**Kiffer:** Strengths are shared socioeconomic context. Limitation is exactly that shared context. If neighbourhood itself is related to the exposure, you're matching out the variation you're trying to study. Imagine studying environmental air pollution and respiratory disease using neighbourhood controls. The controls breathe the same air. You've removed the variation in your exposure of interest. Your odds ratio collapses to one, even if the exposure really matters.

**Sarah:** Fifth, random digit dialling. Sometimes called R-D-D.

**Kiffer:** You generate random phone numbers within an area code and call them. Strengths in the early days were theoretical population-representativeness. Limitations have grown. Distinguishing business phones from home phones requires extra work. And response rates have collapsed. People don't answer unknown numbers anymore. Cell phones changed the population of phone numbers. So the technique is much weaker than it used to be.

**Sarah:** And sixth, partner controls. The case's spouse or romantic partner.

**Kiffer:** Strengths are very high cooperation. Limitations are that partners share environment, household exposures, and many lifestyle factors. So you're again over-matching. Plus, the age and sex distribution of partners is determined by the case, which can create comparison problems.

**Sarah:** And the takeaway from the catalog is that there's no perfect control source. Every choice gives you something and costs you something. The art of case-control design is matching the control source to your specific research question and the specific exposure you want to study.

**Kiffer:** Section 3. The risk-based versus rate-based design distinction. The densest, most important conceptual part of the lesson. We just established that the odds ratio is the right tool for case-control studies. But there are two main flavors of design, and they give the odds ratio different interpretations.

**Sarah:** First flavor, risk-based design. Sometimes called cumulative incidence design.

**Kiffer:** In a risk-based design, controls are selected from people who didn't become cases by the end of the study period. The source population is closed. Closed means everyone who's at risk is at risk for the same fixed time. Nobody enters or leaves. The risk period has a clear start and end.

**Sarah:** Give me a concrete example.

**Kiffer:** Outbreak investigations. A foodborne illness outbreak from a wedding banquet. The risk period is the meal. Cases are people who got sick within the relevant time after the meal. Controls are people who attended the same event but didn't get sick. The risk window is short and well-defined. The closed-population assumption holds easily.

**Sarah:** And so the odds ratio in this design estimates what?

**Kiffer:** The odds ratio estimates the risk ratio, but only if the disease is rare. The textbook cutoff is roughly less than five percent. This is called the rare-disease assumption.

**Sarah:** Quick definition. The risk ratio is the ratio of the probability of disease in the exposed group to the probability of disease in the unexposed group. If exposed people have a 10 percent chance of getting the disease and unexposed have a 5 percent chance, the risk ratio is two.

**Kiffer:** And the rare-disease assumption matters because, mathematically, when the disease is rare, the number of cases is small relative to the number at risk, so odds and risks become approximately equal. When the disease is common, odds and risks diverge, and the odds ratio overstates the risk ratio. Knol and colleagues in 2008 wrote the modern textbook citation for what the odds ratio estimates under various assumptions. Worth knowing that name.

**Sarah:** Second flavor, rate-based design. Also called incidence density design.

**Kiffer:** This is the design for open populations. Where people enter and leave the population over time, where the time at risk varies from person to person. Most populations epidemiology actually studies behave that way. People are born, die, move, age. The risk period is not a fixed window. So the case-control design has to be rebuilt around person-time rather than head counts.

**Sarah:** Quick, define person-time.

**Kiffer:** Person-time is just the sum of the time each person was at risk. If one person was followed for a year, that's one person-year. A thousand people each followed for a year is a thousand person-years. Five hundred people each followed for two years is also a thousand person-years. The unit captures both how many people and how long. In an open population, person-time is the right denominator for measuring rates.

**Sarah:** So in a rate-based case-control study, you're integrating person-time into the sampling.

**Kiffer:** Right. The technique is called incidence density sampling. As each case occurs, you sample controls from the people who were at risk at that exact moment. So your control sampling reflects the time structure of the risk period.

**Sarah:** Let me make sure I have this. If a case occurs in March, the controls for that case are sampled from people at risk and disease-free in March. If another case occurs in November, the controls are sampled from people at risk in November. You're taking little snapshots throughout the study period.

**Kiffer:** Exactly. And here's the result. With incidence density sampling, the ratio of exposed controls to unexposed controls is approximately equal to the ratio of exposed person-time to unexposed person-time in the source population. That property makes the math work. The lesson states it as Equation 9.4.

**Sarah:** Right. And the consequence for the odds ratio?

**Kiffer:** In a rate-based design with incidence density sampling, the odds ratio directly estimates the incidence rate ratio. With no rare-disease assumption needed.

**Sarah:** Quick definition. The incidence rate ratio is the ratio of the rate of new cases per unit person-time in the exposed group to the rate in the unexposed group. The analog of the risk ratio but for studies that use time at risk in the denominator. Cohort studies in Lesson 6 will get into this in detail.

**Kiffer:** And the no-rare-disease-assumption thing is a big deal. A rate-based case-control design can study common diseases with the same statistical interpretability as cohort studies. You don't have to argue for rarity. The rate-based design just gives you a valid estimate of the rate ratio.

**Sarah:** Incidence density sampling has a few features that feel weird the first time you see them but are mathematically correct. The lesson lists four.

**Kiffer:** First, you don't need to know the time at risk for potential controls. You sample at the moment each case occurs. Second, you don't need to assume the population is stable. Third, the number of controls per case can vary. And fourth, and this is the strangest one, subjects initially identified as controls can subsequently become cases.

**Sarah:** Yeah, walk through that fourth one because it really does sound wrong.

**Kiffer:** Imagine someone is sampled as a control at time t-one, because at t-one they're at risk and disease-free. They're a valid control at that moment. At time t-two, six months later, they develop the disease. They've now become a case. Both samplings are valid. Each contributes to the analysis at the time of each sampling. Looks paradoxical, but it's just bookkeeping. Going back to the Salmonella example, that one is rate-based because the population in those three Canadian provinces over a year was open. People moved in and out. The study used incidence density sampling implicitly by drawing controls matched to cases by province and time period.

**Sarah:** Section 4. Implementation. The practical questions after the design is set.

**Kiffer:** There are five questions. Number of controls per case. Number of control groups. Exposure assessment. Comparability. And reporting.

**Sarah:** Okay, first. Number of controls per case. Most studies use one-to-one. One control per case.

**Kiffer:** The lesson is honest that the one-to-one ratio is partly statistical efficiency, partly convention, not magical. When cases are scarce and controls are easy to get, you can improve precision by adding more controls per case. The benefit tails off. Going from one to two improves precision noticeably. Two to three, smaller. Three to four, smaller still. Beyond four, the marginal gain is barely noticeable, and the cost keeps climbing. Three to four controls per case is usually the practical maximum.

**Sarah:** Second question, number of control groups. Some studies use multiple control groups as a robustness check.

**Kiffer:** A couple of examples. Abubakar and colleagues in 2007 studied Crohn's disease, which is an inflammatory bowel disease. They used 9 hospitals in England with both hospital-derived and community-derived controls. The community controls were sampled randomly from general practitioners. The choice of control group had little impact on their results, which was the point. Brenner and colleagues in 2010 evaluated lung cancer in never-smokers in Toronto. They used both population-based controls, sampled from property tax files, and hospital-based controls from a family medicine clinic. Both gave similar answers.

**Sarah:** And the rule from those examples is, multiple control groups are useful as a robustness check, but they add complexity. If they give different answers, interpretation gets harder fast. Pomp and colleagues in 2010 noted that, in general, the value of more than one control group is limited.

**Kiffer:** Third question, exposure and covariate assessment. Most case-control studies are retrospective. The disease has already occurred when the study begins. So you have to ask people about past exposures. And this is where recall bias becomes a serious threat.

**Sarah:** We met recall bias in Lesson 3 with the INTERPHONE study. Quick refresher. INTERPHONE was an international project that found mobile phone use seemed to be associated with brain tumors, especially on the same side of the head as the tumor. That pattern is biologically implausible, and was likely an artifact of cases differentially recalling phone use after their diagnosis. People with a tumor would naturally search their memory for explanations, and would over-report past phone use on the side of the tumor.

**Kiffer:** Exactly. The principle is, you need a concise definition of exposure. And critically, you need the same measurement process for cases and controls. If you measure exposure differently in the two groups, differential measurement error leaks straight into your odds ratio. Standard mitigation strategies include record-based exposure assessment, where you don't rely on memory at all. You use medical records, employment records, pharmacy records. They're not perfect, but they're not subject to recall bias the way self-report is. And blinding of the data collector. Whoever's collecting the exposure information shouldn't know whether the participant is a case or a control.

**Sarah:** Fourth question, maintaining comparability between cases and controls. The lesson presents three tools. Restriction, matching, and analytic control.

**Kiffer:** Restriction is the simplest. You exclude certain types of subjects from both groups. If you're worried that age confounds the smoking-cancer relationship, you might restrict the study to people aged 50 to 70. You've removed age as a source of variation by design. Matching is the next tool. You pair each case with one or more controls similar on certain characteristics. Same age. Same sex. The Dorgan study matched on age, blood draw date, and menstrual cycle day. Matching can be powerful but it has costs. You can't study the matched variable as an exposure. And matched studies require special analysis, like conditional logistic regression. Third tool. Analytic control. Multivariable statistical models. You collect data on potential confounders and adjust for them in the analysis using techniques like multiple logistic regression. The lesson notes that analytic control is the approach most often relied upon, sometimes combined with restriction.

**Sarah:** Fifth and final question, reporting. STROBE again. STROBE stands for Strengthening the Reporting of Observational Studies in Epidemiology. We met it in Lessons 3 and 4. The international consensus checklist for what observational studies should report.

**Kiffer:** There's a case-control extension within STROBE. Vandenbroucke and colleagues in 2007 wrote the original statement and the case-control extension. The items most likely to go missing in case-control reports include eligibility criteria, the sources and methods of case ascertainment, the rationale for the choice of cases and controls, the matching criteria if you used matching, and the number of controls per case.

**Sarah:** STROBE adherence makes a study evaluable. It doesn't make it good. But you need the first to even attempt the second.

**Kiffer:** And one more thing the lesson hits in Section 4. What does the odds ratio actually estimate, given all the design decisions? The lesson summarizes this as a kind of lookup table.

**Sarah:** Okay, walk through it.

**Kiffer:** First combination, risk-based design with controls sampled at the end of follow-up. The odds ratio estimates the risk ratio, provided the disease is rare in the source population, and provided censoring is unrelated to exposure. Second combination, concurrent sampling, meaning incidence density sampling. The odds ratio estimates the rate ratio in both closed and open populations. No rare-disease assumption needed. Third combination, controls selected from an open population without concurrent sampling. The odds ratio estimates the rate ratio only if the population is stable. Otherwise it's just the odds ratio, with no clean cohort-equivalent interpretation.

**Sarah:** And that connects all the design choices to one bottom line. The same odds ratio number, computed from the same two-by-two table, has different meanings depending on the sampling decisions you made before any data were collected. Design trumps analysis. We saw the slogan in Lesson 4. Section 3 of this lesson is the clearest demonstration of that slogan in action.

**Kiffer:** All right. Let me try to pull the takeaways together.

**Sarah:** Yeah, I want to enumerate them. There are seven I'd want a beginning student to leave with.

**Kiffer:** Yeah, go for it.

**Sarah:** First takeaway, the basic logic. Case-control studies sample on the outcome. They compare exposure frequency between cases and non-cases. Controls are not healthy people. Controls are people who would have been included as cases if they had developed the disease.

**Kiffer:** Second, the study base determines how controls should be selected. Three flavors. Primary base if you can enumerate the source population. Secondary base if cases come from a specialized facility. Nested if you have a parent cohort study. Nested designs let you estimate disease frequency by exposure status, which other case-control designs can't do.

**Sarah:** Third, use incident cases when possible. Apply the same diagnostic criteria uniformly. Aim for complete case ascertainment. Beware tertiary referral centers.

**Kiffer:** Fourth, controls must represent the exposure experience of the source population. Every common control source has its own bias profile. Population controls have low response rates. Hospital controls share hospitalization risk factors. Friend controls over-match on lifestyle. Neighbourhood controls over-match on context. Random digit dialling has collapsing response rates. Partner controls share household exposures. The art is matching the control source to your specific research question and exposure.

**Sarah:** Fifth, the risk-based versus rate-based distinction is the design choice that determines what your odds ratio actually means. Risk-based works for closed populations and short risk periods, like outbreak investigations. The odds ratio estimates the risk ratio under the rare-disease assumption, where rare means roughly under five to ten percent. Rate-based with incidence density sampling works for open populations and any prevalence. The odds ratio estimates the incidence rate ratio with no rare-disease assumption needed.

**Kiffer:** Sixth, the implementation choices in Section 4. Three to four controls per case is usually the practical maximum. Multiple control groups are mainly useful as a robustness check. Use the same exposure assessment process for cases and controls, with blinding when possible. Build comparability through restriction, matching, and analytic control, often in combination. Report through STROBE.

**Sarah:** And seventh, which I think the lesson really wants you to leave with. The odds ratio is the only valid measure of association in a case-control design, and the reason is the symmetry property of the cross-product. It gives you the same number whether you compute it as odds of disease given exposure, or odds of exposure given disease. So it survives sampling on either side of the table. Risk ratios and rate ratios don't have that property. The odds ratio is unique, and it's why it became the workhorse statistic of case-control epidemiology.

**Kiffer:** And one practical recommendation. Don't skip the R box on computing the odds ratio. The cross-product of a two-by-two table, plus a Woolf-method confidence interval. The arithmetic is multiplication and division. But what the odds ratio actually estimates depends on every design decision we walked through. The arithmetic is simple. Understanding is the work.

**Sarah:** And one connection back to the framing lessons. The unified-approach discipline from Lesson 4 applies all the way through. Every choice we discussed, study base, case definition, control source, design flavor, comparability, exposure assessment, that's all design. All of it has to be settled before you collect data. By the time the data come in, the damage from a bad control choice is already baked into your odds ratio. There's no statistical fix for sampling from the wrong source population.

**Kiffer:** Next up is Lesson 6. Cohort Studies. The same design logic flips. Sample on exposure, follow forward in time, measure incidence directly. Each design has strengths the other lacks. Knowing which to reach for in which situation is the practical pay-off of having both in your toolkit.

**Sarah:** Take care, everyone.

**Kiffer:** Yeah, see you there.