# Lesson 9 — Hybrid Study Designs (v3 expanded)

*Companion-podcast transcript • Sarah & Kiffer* 
*~5534 words • ~29.9 min audio*

---

**Sarah:** Welcome back to Office Hours. I'm Sarah.

**Kiffer:** And I'm Kiffer. Today we're working through Lesson 9, Hybrid Study Designs. The framing is clean. Lesson 8 reviewed the four standard observational designs. Cross-sectional, cohort, case-control, and ecological. Lesson 9 introduces the variants that combine, extend, or subsample from those four to solve specific problems.

**Sarah:** Before we go anywhere else, can you slow down and say what a hybrid study design actually is, in plain language? Because I want a beginning student to be able to tell me what makes something hybrid versus standard.

**Kiffer:** Sure. A hybrid design is a variant of one of the classic observational designs that has been engineered to handle a specific methodological problem. Rare or expensive exposures. Transient triggers, like the morning cup of coffee that may or may not have triggered a heart rhythm problem. Surveillance data without an obvious comparison group. The hybrids exist because someone hit a wall with the standard four designs and figured out a workaround.

**Sarah:** And the lesson lists six hybrid designs plus one important sampling strategy. Six is a lot. Is there a way to keep them organized in your head?

**Kiffer:** Yes. The lesson groups them in three buckets, and that's how we'll work through the episode. Bucket one is time-based case-only designs. The case-crossover and the self-controlled case-series. Both use the case as their own control across time. Bucket two is case-only comparison designs. Case-case, case-case-control, and gene-environment case-only. These compare different kinds of cases to each other, or use cases without any control group. Bucket three is case-cohort and two-stage sampling. These keep a cohort backbone but only pay for expensive measurements on a subsample.

**Sarah:** And the trick that the lesson recommends, which I want to make explicit, is that each hybrid was invented to solve one problem. So if you remember the problem, the design comes back to you.

**Kiffer:** Exactly. Case-crossover solves the problem of choosing controls for a transient trigger. Self-controlled case-series solves it for vaccine safety, where a healthy unvaccinated comparison group is hard to find. Case-cohort lets one comparison group support multiple outcomes. Case-only handles gene-environment interaction without controls. Two-stage sampling lets you spend money on detailed measurement only where it matters. Match the problem to the design and you've basically learned the lesson.

**Sarah:** Okay. Let's start with bucket one. Time-based case-only designs. Case-crossover first.

**Kiffer:** The case-crossover design is the observational analogue of the experimental crossover. Each case serves as their own control by comparing exposure during a defined time window before the event with exposure during one or more comparison windows at other times. Maclure in 2007 framed the design as answering the why-now question instead of the why-me question that traditional case-control studies answer.

**Sarah:** Hold on. Walk me through who Maclure is and what why-now versus why-me actually means, because that contrast is doing a lot of work.

**Kiffer:** Malcolm Maclure is an epidemiologist at the University of British Columbia who developed the case-crossover design in 1991 and revisited the framing in 2007. The why-me question is the traditional case-control question. Why did this person get a heart attack while their neighbor didn't? That's a between-person contrast. Genetics, lifestyle, environment, all the stuff that makes one person different from another. The why-now question is different. It assumes the person was already at risk and asks, why did the heart attack happen at three in the afternoon last Tuesday rather than at some other time? That's a within-person contrast across time.

**Sarah:** And the magic of the within-person contrast is that everything that's stable about that person cancels out.

**Kiffer:** Right. By using the same person as both the case and the control, the design automatically controls for every time-invariant confounder. Sex. Age. Genetics. Chronic comorbidities. Smoking status if it's stable. Even confounders the investigator never measured or never even thought of. They're held constant by design, because they're the same person.

**Sarah:** That's a clean idea. Let me try it back. If I want to know whether heavy physical exertion triggers a heart attack, I take people who had a heart attack, I look at what they were doing in the hour before, the risk window, and I compare that to what they were doing during some other comparable hour, the control window. If exertion is more common in the risk window, that's evidence the exposure may have triggered the event.

**Kiffer:** That's it. The lesson lists three conditions for that logic to work. First, the exposure must be transient. Stable exposures like ongoing smoking status or chronic medication use cannot be evaluated, because they would be present in every time window. Second, the outcome must be acute. The event has to happen close in time to the exposure if the causal link is real. Third, the exposure must not be affected by the outcome. If experiencing the event changes future exposure, like a heart attack altering subsequent activity, you have to be careful about which control windows you use.

**Sarah:** And the two big design choices are the length of the risk period and the strategy for choosing control periods.

**Kiffer:** Right. The risk period, sometimes called the case-risk window, is the time during which the exposure, if causal, would have produced the event. Pick it too long and you sweep up irrelevant exposure. Pick it too short and you miss the real effect. For physical exertion and heart attack, the risk window is a few hours. For mobile phone use and motor vehicle crashes, it's about five minutes. For air pollution effects on respiratory hospitalisations, it's typically one day. The right window depends on the biology of the trigger.

**Sarah:** Then there are three referent strategies for choosing the control periods. Walk me through them slowly.

**Kiffer:** Strategy one. Unidirectional referent selection, sometimes called backward-only. Control periods are chosen only from time before the event. This was the original case-crossover approach. It is the right choice when the event itself alters future exposure. A leg injury changes how much you run afterwards. Or food poisoning changes what you eat in the days after. Backward-only avoids that.

**Sarah:** And the limitation is that if the exposure has a long-term trend, like air pollution gradually getting worse over years, comparing only earlier periods to the case-risk period can produce biased estimates.

**Kiffer:** Right. Strategy two. Symmetric bidirectional referent selection. Control periods are chosen both before and after the event, often equally spaced. The intent is that if exposure is trending up or down, the higher and lower exposure values will roughly cancel out. This is now the most widely used approach. The limitation is that it's only valid if the event itself does not affect future exposure.

**Sarah:** And strategy three is the time-stratified design from Janes, Sheppard, and Lumley in 2005.

**Kiffer:** Right. Holly Janes, Lianne Sheppard, and Thomas Lumley, biostatisticians at the University of Washington, formalized this approach. The setup is, you have shared-exposure data available across the entire observation period, like daily air pollution measurements that exist for everyone, every day. You stratify the study period a priori, often by month. When a case occurs, say on a Wednesday in July, all the other Wednesdays in July serve as control periods. That effectively matches on day-of-week and month, and it sidesteps the controversy about how to choose the spacing.

**Sarah:** Okay. Let's run an example because all of this stays abstract until you put it on real data.

**Kiffer:** Yeah. Example one is Thomas and colleagues in 2006. They studied 92 waterborne disease outbreaks in Canada between 1975 and 2001. The hypothesis was that extreme rainfall and warm spring conditions might trigger outbreaks, presumably because heavy runoff washes pathogens into drinking water sources. For each outbreak, the six weeks immediately before onset served as the case-risk period. Control periods were time-stratified, six weeks long, matched to the case on month, day, and ecozone.

**Sarah:** What is an ecozone, for context?

**Kiffer:** An ecozone is one of the broad ecological regions that Canada uses to classify its land area. Things like the Pacific Maritime, the Boreal Plains, the Atlantic Maritime. They group regions with similar climate, vegetation, and hydrology. Matching on ecozone means a Pacific Maritime outbreak gets compared to other six-week windows within the Pacific Maritime, not to a window in the Arctic.

**Sarah:** And how was the data analyzed?

**Kiffer:** Conditional logistic regression, which is the standard tool for matched case-control data. The output identified warmer temperatures and extreme rainfall as plausible contributors. And the design is elegant because it avoids what is otherwise a perennial nightmare in waterborne disease epidemiology, which is finding control communities that did not have an outbreak. Each outbreak community is its own control across time.

**Sarah:** And example two is the Haegebaert salmonella study.

**Kiffer:** Right. Haegebaert and colleagues in 2003. They used a case-crossover design within a foodborne salmonella outbreak that affected mostly residents of long-term care institutions in France. Salmonella, just for definitional clarity, is a genus of bacteria that causes intestinal illness, often via contaminated meat, eggs, or produce. Food exposures during the three days before illness onset were compared with food exposures during a control period three days long, ending two days before the case-risk period.

**Sarah:** And why unidirectional?

**Kiffer:** Because the illness itself would change subsequent food intake. If you got sick on Friday, your eating pattern on Saturday is no longer a fair representation of normal exposure. So they only used earlier control periods. They computed Mantel-Haenszel matched-pair odds ratios for each meat product. Mantel-Haenszel is a stratified analysis method developed by Nathan Mantel and William Haenszel in 1959 for combining odds ratios across matched pairs. The design avoided the otherwise difficult problem of selecting institutionalized controls whose food intake would have to be matched.

**Sarah:** Okay. Now the second design in the time-based bucket. The self-controlled case-series. The lesson abbreviates it as SCCS, but spell it out for me first.

**Kiffer:** Self-controlled case-series. After this we'll just say SCCS. It's a close cousin of the case-crossover, but it generalizes the comparison from discrete control periods to all of an individual's observation time outside the risk window. It was developed by Paddy Farrington and Heather Whitaker in the United Kingdom in the nineteen-nineties for vaccine safety research.

**Sarah:** Why was vaccine safety the original use case?

**Kiffer:** Because finding a clean unvaccinated comparison group for childhood vaccines is essentially impossible. Vaccination coverage is high, kids who don't get vaccinated tend to differ systematically from those who do, and a between-person comparison is loaded with confounding. The within-person comparison side-steps all of that.

**Sarah:** Walk me through the structure of the design.

**Kiffer:** For each individual who has experienced the outcome of interest, an observation period is defined. That's a calendar window during which exposure history and event occurrence are tracked. Within the observation period, one or more risk periods are designated based on the biology of the exposure. For example, six to thirty-five days after vaccination for febrile reactions, because that captures the typical immunological response. All remaining time within the observation period constitutes the control period.

**Sarah:** And the analysis?

**Kiffer:** The analysis compares the rate of events during risk time with the rate during control time, after adjusting for the duration of each. The standard tool is conditional Poisson regression, where each case is its own stratum. The parameter of interest is the relative incidence, which is the event rate during risk time divided by the event rate during control time. This is sometimes called the incidence rate ratio.

**Sarah:** Spell out incidence rate ratio so a beginning student knows what we mean.

**Kiffer:** Sure. The incidence rate is the number of new events divided by the person-time at risk. The incidence rate ratio is one such rate divided by another, in this case the rate during the risk window divided by the rate during the control window. A ratio above one says the event happens more often during the risk window. A ratio of one says no association. Below one would be protective.

**Sarah:** And what's the assumption that has to hold for the SCCS to be valid?

**Kiffer:** The big one is that the occurrence of the event must not alter the probability of future exposure. If a febrile reaction after a first vaccine dose causes parents to skip the booster, the exposure pattern is no longer independent of the outcome, and the relative incidence estimate gets biased. The standard fix is to ignore post-event exposures. Other assumptions. The event must not censor the observation period, so it isn't ideal for outcomes like death. Multiple recurrences must be conditionally independent. And the chosen risk window has to encompass the true biological risk period.

**Sarah:** And the practical example the lesson uses is Gribbin and colleagues from 2011.

**Kiffer:** Right. Gribbin and colleagues used United Kingdom primary-care databases to ask whether starting an antihypertensive medication transiently increased the risk of falls in adults aged sixty and older. Antihypertensives are blood-pressure-lowering drugs. They identified nine thousand eight hundred sixty-two falls between 2003 and 2006. After each prescription, the exposure period was subdivided into day zero, days one through twenty-one, and days twenty-two through sixty. Poisson regression yielded incidence rate ratios for each post-exposure period.

**Sarah:** And why is this question well-suited to an SCCS?

**Kiffer:** Because the comparison is within-person, every patient-level confounder that complicates fall risk is automatically controlled. Frailty. Polypharmacy. Comorbidity. Age. All of those make it hard to compare a patient who started an antihypertensive to a patient who didn't. The SCCS sidesteps the comparison entirely. Each patient contributes only their own time, before and after starting the drug.

**Sarah:** Okay. Bucket one done. Let's move to bucket two. Case-only comparison designs.

**Kiffer:** Three designs in this bucket. Case-case, case-case-control, and case-only for gene-environment interactions. They share the same broad idea, which is that you don't try to recruit a healthy control group at all, because a satisfactory control group is impractical or biased for the question at hand.

**Sarah:** Let's start with case-case.

**Kiffer:** The case-case design compares two case groups drawn from the same surveillance system. McCarthy and Giesecke proposed it in 1999 as a way to identify risk factors that distinguish closely related disease subgroups using routine surveillance data. The classic example is salmonella subtyping. The cases are people infected with salmonella Typhimurium. The controls are people infected with salmonella Heidelberg. Both groups have salmonellosis, both are cases, but the design seeks to identify exposures that distinguish one serotype from the other.

**Sarah:** What's the conceptual win?

**Kiffer:** Two big wins. First, comparable selection experience. Both groups appear in the same surveillance system. Both have passed through similar diagnostic and reporting filters. Selection bias is minimised. Second, comparable recall experience. Both groups have just had an episode of gastrointestinal illness, so their motivation to recall recent food exposures is similar. That cuts down on differential recall bias.

**Sarah:** And what's the cost?

**Kiffer:** Two costs. The case-case design cannot identify shared risk factors. Anything that causes both serotypes equally, like eating any contaminated food, is invisible because it's present in both groups. And the odds ratio it produces is not a true risk measure. It reflects the relative difference in exposure between two subtypes, not the absolute risk of either disease. So if poultry consumption looks more strongly associated with Typhimurium than with Heidelberg, that does not mean eating poultry causes Typhimurium in absolute terms. It means poultry is more characteristic of the Typhimurium subgroup.

**Sarah:** Can you give me an example of that in practice?

**Kiffer:** Sure. Gillespie and colleagues in 2002 used population-based surveillance data from England and Wales to compare the exposure histories of people with Campylobacter coli infection, the rarer species, with those of people with Campylobacter jejuni infection. Campylobacter is the most common bacterial cause of foodborne diarrhoea in many high-income countries. Backward stepwise logistic regression identified differential risk factors. The authors noted explicitly that exposures common to both species would not be detected.

**Sarah:** Okay. Next design in the bucket. Case-case-control.

**Kiffer:** Case-case-control was developed by Kaye and colleagues in 2005, originally in the context of antibiotic-resistant infections. The motivating example was vancomycin-resistant Enterococcus, abbreviated VRE, versus vancomycin-susceptible Enterococcus, VSE. Enterococcus is a genus of bacteria that lives in the human gut. Vancomycin is a powerful antibiotic, often used as a last resort for serious infections.

**Sarah:** And the problem the design solves?

**Kiffer:** The problem is this. If you do a traditional case-control study comparing VRE cases to non-infected controls, you'll find that prior antibiotic use, prolonged hospitalisation, and ICU stay are strong risk factors. But those are also risk factors for VSE, and for any hospital-acquired infection. So you'd be telling the world about the causes of hospital infection in general, not what specifically drives the resistant phenotype.

**Sarah:** Why not just do a case-case design comparing VRE to VSE directly?

**Kiffer:** Because Kaye and colleagues argued that VRE often emerges from external sources, like transmission of an already-resistant strain, rather than from within-patient evolution of a susceptible strain. So contrasting VRE directly with VSE conflates the risk of acquiring a resistant strain with selection pressure on a susceptible one. The case-case-control solution uses two case series, resistant and susceptible, and one control series of people without infection. Two separate logistic regressions are fit, each comparing one case series with the controls.

**Sarah:** And the variables sort into three categories.

**Kiffer:** Three categories. Category A, variables only in the resistant model. These are unique to the resistant phenotype. In the original VRE example, exposure to vancomycin itself or to a roommate carrying VRE might fall in this category. Category B, variables only in the susceptible model. These are unique to the susceptible phenotype. Category C, variables in both models. These are risk factors for the target organism in general, regardless of resistance status. Hospitalisation, prior antibiotic exposure, severity of illness, indwelling catheters. They're real, but they don't help you distinguish resistance.

**Sarah:** And the third design in this bucket is the case-only design for gene-environment interactions.

**Kiffer:** Right. The case-only design uses only cases. No observed control group is recruited. The expected exposure distribution in the hypothetical control population is derived from theoretical or external sources. The design originated in genetic epidemiology, where the population frequency of common alleles can often be specified from external reference data.

**Sarah:** And the big restriction up front is what?

**Kiffer:** The case-only design cannot estimate main effects. It cannot tell you whether a gene independently raises the risk of disease, or whether an environmental exposure independently raises the risk. What it can estimate is interaction between two factors among cases, provided those two factors are independent of each other in the source population.

**Sarah:** I want to slow down on the intuition, because the case-only logic is genuinely strange the first time you see it.

**Kiffer:** Yeah. The intuition is this. Among cases of a disease, if a genetic risk factor and an environmental exposure are independent in the source population but they appear together in cases more often than you'd expect by chance, that excess co-occurrence is evidence of statistical interaction on the multiplicative scale. The two factors combined are doing more than each one would do alone. That's what interaction means.

**Sarah:** And the independence assumption is doing all the work. Why?

**Kiffer:** Because if the gene and the exposure were also causally associated in the source population, meaning carrying the gene affects whether you experience the exposure, then you'd see them co-occur in cases for that reason, not because of interaction. The whole logic depends on being able to assume that the joint distribution among cases would have been the product of the two marginals, except for whatever interaction effect is present.

**Sarah:** Which is why genetic variants and stable demographic traits are the typical effect modifiers.

**Kiffer:** Right. Genetic variants. Sex. Race. Age. They don't change over time and they're plausibly independent of short-term exposures. Armstrong in 2003 and Schwartz in 2005 extended the design beyond genetics, using sex, race, age, and socioeconomic class as effect modifiers of weather-related mortality. Because age and sex are reasonably independent of daily weather, the case-only approach yields a valid interaction estimate.

**Sarah:** Walk me through the Schwartz example, because temperature mortality is intuitive.

**Kiffer:** Schwartz in 2005 investigated whether sex, non-white race, or age over eighty-five modified the effect of extreme temperatures on mortality in Wayne County, Michigan. Wayne County contains Detroit. Weather data identified excessively hot and cold days. The analysis fitted separate logistic regressions for heat and for cold, using each demographic covariate as the outcome and the extreme-weather indicator as the predictor. All three covariates emerged as effect modifiers.

**Sarah:** Wait. The covariate is the outcome of the regression? That sounds backwards.

**Kiffer:** Yeah, this is the part that breaks people's brains. The case-only logistic regression takes the form, the log odds of being female, given you're a case, equals an intercept plus a coefficient times the extreme-heat indicator. A significant coefficient means the proportion of female cases versus male cases differs between heat-exposure days and other days. That's mathematically equivalent to a Poisson model of mortality count as a function of heat, sex, and a heat-by-sex interaction term. The case-only regression coefficient is the interaction term. We never need a control group because the control expectation is built into the assumed independence of sex and heat in the source population.

**Sarah:** Why is this design especially appealing for studying mortality?

**Kiffer:** Because building a comparable control group of people who did not die is conceptually awkward when daily death registry data are already complete. You'd have to define survivors, sample them, match them. The case-only design just uses the death records and asks whether the demographic profile of who dies on a hot day differs from who dies on an ordinary day. That's enough.

**Sarah:** Okay. Bucket two done. Let's go to bucket three. Case-cohort and two-stage sampling.

**Kiffer:** Both designs in this bucket are about doing a cohort-style study while only paying for expensive measurements on a fraction of the participants. They make biobank-scale studies practical.

**Sarah:** Start with case-cohort. Spell out the structure.

**Kiffer:** From a defined source cohort, the investigator draws a random sample called the subcohort at the start of follow-up. Detailed exposure and covariate data are obtained on the subcohort. As follow-up proceeds, all incident cases that arise from the full source cohort, whether or not they happen to fall within the subcohort, are also studied. Detailed measurements are then obtained on those cases. The analysis compares the subcohort to the cases.

**Sarah:** And the win is that you don't have to measure every single person in the full source cohort.

**Kiffer:** Right. The full source cohort is followed for case identification. That can be cheap, especially if you use registries or administrative databases for case ascertainment. But the expensive biomarker assays, the questionnaires, the genotyping, those are done only on the subcohort plus the cases. Not on everyone in the full source cohort. That's a huge cost saving.

**Sarah:** And the lesson highlights one feature that's especially attractive.

**Kiffer:** The big one. A single subcohort can serve as the comparison group for multiple outcomes. If your big cohort is being followed for cardiovascular disease, several cancers, and diabetes, you only need one set of biomarker measurements on the subcohort. Each outcome study then adds detailed measurements on its own cases. A nested case-control study, by contrast, requires fresh control selection for each outcome.

**Sarah:** And there are two analytic flavors. Risk-based and rate-based. Walk me through both.

**Kiffer:** Risk-based case-cohort is for closed cohorts. A fixed group followed for a defined period. Stable exposures over follow-up. The subcohort is sampled by simple or stratified random sampling at the start of follow-up. Cases arising outside the subcohort during follow-up are added. The two case groups, those in and those outside the subcohort, are combined and analysed in a familiar two-by-two case-control format using logistic regression. The odds ratio approximates the relative risk when the disease is rare.

**Sarah:** And rate-based?

**Kiffer:** Rate-based case-cohort is for open cohorts, where entries and exits are possible during follow-up, or where exposures change over time. At the moment a case occurs, eligible members of the subcohort are those who haven't yet experienced the outcome. Their current exposure status, which may have been updated through repeated surveys or stored serial samples, is recorded. The standard analytic tool is a weighted Cox proportional-hazards model. The weights account for the sampling fraction. If the subcohort represents twenty percent of the source cohort, controls are typically up-weighted by five.

**Sarah:** What's a Cox proportional-hazards model?

**Kiffer:** It's the workhorse method for survival analysis. It models the hazard of an event over time as a baseline hazard multiplied by exponential terms for the covariates. The proportional-hazards assumption means the ratio of hazards between exposed and unexposed is constant over time. Sir David Cox proposed it in 1972. In a case-cohort context, we weight individuals by the inverse of their sampling probability so that the analysis recovers the population-level rate ratio.

**Sarah:** And the lesson example is the Auvinen radon study.

**Kiffer:** Right. Auvinen and colleagues in 2005 studied radon and other radionuclides in drinking water and the risk of stomach cancer in a Finnish population of over one hundred forty-four thousand people who drew water from drilled wells between 1967 and 1980. Radon is a naturally occurring radioactive gas. Drilled wells in Finland often have elevated radon because of the local granite. The effective subcohort was three hundred seventy-one. Stomach cancer cases, one hundred seven, were identified through the cancer registry. A proportional-hazards model accounted for how long each subject had been exposed. All hazard ratios were below one, suggesting a protective association.

**Sarah:** Surprising as in actually protective, or as in something else is going on?

**Kiffer:** Probably the latter. Most of the literature on radon and lung cancer shows clear harm, so a protective stomach-cancer signal is more likely a sign of confounding by some unmeasured aspect of well-water diet or geography than a real biological effect. But the methodological point of the example is that the case-cohort design made this question tractable on stored samples, with registry-based case finding, at a fraction of what a full cohort assay would have cost.

**Sarah:** Okay. Now two-stage sampling. The lesson calls it a strategy that can be layered onto any traditional design.

**Kiffer:** Right. Two-stage, sometimes called two-phase, is not a free-standing study type. It's a sampling strategy you can add to a cross-sectional, cohort, or case-control study. Stage one collects readily available, inexpensive data on a large group. Stage two collects more detailed, more expensive data on a strategically selected subsample.

**Sarah:** Let me anchor it with the kind of question it solves. The lesson has a nice example.

**Kiffer:** Yeah. Suppose you want to study whether occupational solvent exposure increases birth-defect risk. Solvents are industrial chemicals like toluene used as cleaners and degreasers. Hospital records can give you basic information on hundreds of thousands of pregnancies cheaply. But a detailed occupational exposure assessment requires a one-hour interview at two hundred dollars per participant. You can't afford that on every pregnancy. So you use stage one for the cheap step on everyone, stage two for the expensive step only where it matters most.

**Sarah:** And the lesson lists three common use cases.

**Kiffer:** Use case one, expensive covariate or exposure measurement. Stage one uses an inexpensive surrogate, like a job title from a registry. Stage two performs a detailed work-up on a subsample. This is the most common application. Use case two, validation substudy. Stage two applies a near-gold-standard measurement to a subsample, and you correct the inferences from the full stage-one data. Use case three, handling missing covariate data. The missing-data subjects become the explicit target of stage two.

**Sarah:** And the key design question is how to sample at stage two.

**Kiffer:** Right. If your stage-one design is a cohort, sample fixed numbers of exposed and unexposed at stage two. If your stage-one design is a case-control, sample fixed numbers of cases and controls. Whenever a surrogate exposure is available at stage one, the most efficient strategy is approximately equal numbers from each of the four exposure-by-disease cells. That allocates information where it matters and prevents small cells from being underrepresented.

**Sarah:** Run me through the worked example in the lesson, because the four-cell logic gets concrete fast.

**Kiffer:** Sure. Suppose your stage-one data come from a hospital registry of fifty thousand pregnancies. From the registry you can identify two thousand birth-defect cases and forty-eight thousand non-cases. A crude exposure indicator, whether the mother held a job classified as industrial, is available for everyone. The cross-classification looks like this. Industrial-exposed cases, one hundred twenty. Other-exposure cases, one thousand eight hundred eighty. Industrial-exposed non-cases, one thousand five hundred. Other-exposure non-cases, forty-six thousand five hundred.

**Sarah:** And stage two with a budget of four hundred interviews?

**Kiffer:** Balanced across the four cells. Take all one hundred twenty industrial-exposed cases. Sample one hundred from the other-exposure cases. Sample one hundred from the industrial-exposed non-cases. Sample eighty from the other-exposure non-cases. Because we've oversampled the small cells, weighting must be applied in analysis to recover correct estimates. The Cain and Breslow methodology from 1988 and the Flanders and Greenland refinement from 1991 handle this. Hanley and colleagues in 2005 give worked examples.

**Sarah:** And the lesson has a real two-stage example. Asthma in Quebec children.

**Kiffer:** Right. Martel and colleagues in 2009 used a two-stage design with three linked Quebec administrative health databases. Stage one was a nested case-control within a cohort of pregnant women and their children. Five thousand two hundred twenty-six asthmatic children, with twenty non-asthmatic children per case, selected using density sampling. Stage two was a mailed questionnaire to a subsample of mothers, balanced across the cells of the stage-one cross-table. Conditional logistic regression at stage one. Unconditional logistic regression with weighting at stage two.

**Sarah:** And the design exploits the cheap administrative data to identify cases and screen on rough covariates, while reserving the expensive questionnaire for the subset where new information will most improve the estimate.

**Kiffer:** Exactly. That's the whole pitch. Let pieces of the design do what they're cheapest at.

**Sarah:** Okay. Pulling the takeaways. Let me list them.

**Kiffer:** Yeah, give me six big ones and let's go in order.

**Sarah:** First takeaway. Hybrid designs are not a separate world. They are problem-solving variants of the four standard designs. Each one was invented to solve a specific challenge with the standard four. Knowing the problem each one solves is the key to remembering when to use it.

**Kiffer:** Second takeaway. The case-crossover design uses each case as their own control across time. It is appropriate for transient exposures and acute outcomes. The three referent strategies, unidirectional, symmetric bidirectional, and time-stratified, are not interchangeable. The choice depends on whether the event itself alters subsequent exposure and on whether time trends in exposure are likely. And the within-person comparison automatically controls every time-invariant confounder, even unmeasured ones.

**Sarah:** Third takeaway. The self-controlled case-series, SCCS, generalizes the case-crossover by using all observation time outside the risk window as the control period. Originally developed for vaccine safety. Conditional Poisson regression yields the relative incidence, which is the rate during risk time over the rate during control time. The key assumption is that the event must not alter future exposure.

**Kiffer:** Fourth takeaway. The case-only comparison designs trade the absence of a healthy control group for cleaner selection and recall comparisons. Case-case studies identify differential risk factors between disease subtypes but cannot detect shared risk factors. Case-case-control studies use two case series and one shared control series, sorting risk factors into category A, unique to resistance, B, unique to susceptibility, and C, shared by the organism. Case-only designs estimate gene-environment interaction without controls, but only when gene and environment are independent in the source population.

**Sarah:** Fifth takeaway. Case-cohort studies sample a subcohort at the start of follow-up and add all incident cases. One subcohort can support studies of multiple outcomes, which makes the design particularly attractive when expensive biomarker assays from stored specimens are involved. Risk-based analysis uses logistic regression on a combined two-by-two table. Rate-based analysis uses a weighted Cox model with weights inversely proportional to sampling probability.

**Kiffer:** Sixth takeaway. Two-stage sampling is not a freestanding design. It's a strategy you layer onto cross-sectional, cohort, or case-control studies. Stage one collects cheap, broadly available data. Stage two collects detailed expensive data on a strategically chosen subsample. For a binary surrogate exposure at stage one, the most efficient stage-two sampling allocates approximately equal numbers across the four cells of the exposure-by-disease table. And the analysis must use weighting and a variance formula that accounts for both stages of sampling.

**Sarah:** And the practical recommendation for this lesson. When you encounter a paper using one of these designs, ask the problem-first question. What was wrong with the standard four designs that pushed the authors to choose this one? If you can articulate that, you've understood the design.

**Kiffer:** And one connection back to earlier lessons worth flagging. Lesson 8 made the point that every observational design produces a particular measure of association, and that measure carries assumptions about how exposure was assigned. The hybrid designs in Lesson 9 don't escape that framework. They tighten or relax assumptions to fit a difficult question. The case-crossover relaxes the requirement to find external controls but tightens the requirement that the exposure be transient. The case-cohort relaxes the requirement to measure everyone but tightens analytic complexity. There's no free lunch.

**Sarah:** Next up is Lesson 10 of this material, Controlled Studies. That's where we cross over from observational to experimental designs and start asking what randomization buys us that none of these hybrids can.

**Kiffer:** Take care, everyone, and we'll see you in the next one.

**Sarah:** See you there. Thanks for listening.
