HSCI 341 — Lesson 9

Hybrid Study
Designs

Fundamental Epidemiological Concepts and Approaches

Kiffer G. Card, PhD, Faculty of Health Sciences, Simon Fraser University

Learning objectives for this lesson:

  • Describe the key features of six hybrid study designs (case-crossover, case-series, case-case, case-only, case-cohort, and case-case-control)
  • Identify source population characteristics, exposures, and outcomes for which each hybrid design is appropriate
  • Explain the logic of using cases as their own controls across time
  • Distinguish between unidirectional and bidirectional referent selection in case-crossover studies
  • Describe two-stage sampling designs and explain when they enhance the efficiency of cross-sectional, cohort, and case-control studies
  • Design the basic sampling strategy for a specific two-stage case-control study
  • Apply hybrid design concepts to select appropriate designs for research questions involving rare exposures, transient triggers, or expensive covariates

This course was developed by Kiffer G. Card, PhD, as a companion to Dohoo, I. R., Martin, S. W., & Stryhn, H. (2012). Methods in Epidemiologic Research. VER Inc.

Reference

Glossary — Key Terms, People & Concepts

📚 Reference page — available throughout the lesson

This glossary collects the key concepts, people, and ideas you will meet in this lesson. Use it as a reference while you work through the material, or as a review before assessments. Type in the search box to filter entries.

Key Concepts & Ideas
Hybrid Study Design A study design that combines features of two or more standard observational designs (cohort, case-control, cross-sectional, ecologic) to overcome limitations such as rare exposures, transient triggers, or expensive covariates.
Source Population The population from which cases (and controls or subcohort members) arise. Defining the source population correctly is essential for valid sampling in any hybrid design.
Referent (Control) Period In case-crossover and self-controlled designs, a window of time used as the within-person comparison for the hazard period when the outcome occurred.
Hazard (Case) Period The time window immediately preceding the outcome event during which exposure could plausibly have triggered the event.
Transient Exposure An exposure with a short biologically plausible induction window (e.g., heavy meal, anger, air-pollution spike) — ideal targets for case-crossover analysis.
Unidirectional Referent Selection Choosing referent windows only from time periods before the event — protects against reverse causation but can be vulnerable to time-trend bias.
Bidirectional Referent Selection Choosing referent windows from before and after the event. Controls for time-trend confounding but assumes the participant survives and remains exchangeable.
Subcohort In case-cohort studies, a random sample drawn from the full cohort at baseline that serves as the comparison group for all cases that arise — reusable across multiple outcomes.
Two-Stage Sampling A design in which an inexpensive measure (e.g., screening questionnaire) is collected on everyone in stage 1, and a more expensive measure (e.g., biomarker) is collected only on a stratified subsample in stage 2.
Cases as Their Own Controls The unifying logic of case-crossover, self-controlled case-series, and case-time-control designs — each person contributes both case and control time, automatically controlling for fixed individual characteristics.
Time-Trend Bias Bias arising when exposure prevalence changes over calendar time, distorting unidirectional case-crossover estimates. Case-time-control designs adjust for this trend using a separate control group.
Rare Exposure An exposure with low population prevalence; standard cohort studies require very large samples to capture enough exposed cases. Two-stage and case-cohort designs are useful alternatives.
Methods, Measures & Designs
Case-Crossover Design A within-person design in which exposure during the hazard window before each case event is compared to exposure during one or more referent windows in the same person. Requires transient exposures and acute outcomes.
Self-Controlled Case-Series (SCCS) A design that uses only people who experience the outcome and compares incidence rates during exposed vs unexposed person-time within the same individual. Common in vaccine safety research.
Case-Time-Control Design An extension of case-crossover that adds a control group of non-cases to estimate and adjust for secular time trends in exposure prevalence.
Case-Case (Case-Only) Design Compares two subtypes of cases (e.g., different pathogens, different outcomes) to identify exposures that distinguish them — common in outbreak surveillance.
Case-Cohort Design A nested design in which all cases are compared to a random subcohort sampled from the full cohort at baseline. Efficient when biomarkers or genetic data are expensive and multiple outcomes are of interest.
Nested Case-Control Design Cases that arise within a defined cohort are matched to a sample of controls drawn from cohort members still at risk at the time of the case event (incidence-density sampling).
Case-Case-Control Design A three-arm design that compares two case subtypes to a shared control group, allowing simultaneous evaluation of distinguishing and shared risk factors.
Incidence-Density Sampling Sampling controls at the time each case occurs from those still at risk; produces unbiased rate-ratio estimates without rare-disease assumptions.
Conditional Logistic Regression The standard analytic method for matched case-control and case-crossover data — conditions out matched-set or person-level fixed effects.
No matching entries. Try a different search term.
Section 1

Introduction & Time-Based Case-Only Designs

⏱ Estimated reading time: 18 minutes

Introduction and Overview

Lesson 8 consolidated the four standard observational designs. This lesson introduces the variants that combine or extend them. Hybrid designs are the answer to specific limitations of the standard four — rare or expensive exposures, transient triggers (Suissa, 1995), surveillance data without obvious controls. The three content sections move from time-based case-only designs (Section 1: case-crossover and self-controlled case-series), through case-only comparison designs (Section 2), to case-cohort and two-stage sampling designs (Section 3) that subsample from larger cohorts to make biomarker-heavy studies affordable.

Learning Objectives

  • Understand why hybrid study designs were developed and how they fit alongside the traditional cohort, case-control, and cross-sectional designs.
  • Describe the design logic of case-crossover studies and identify when they are appropriate.
  • Distinguish between unidirectional and bidirectional referent selection strategies.
  • Describe the self-controlled case-series design and recognise the contexts in which it is most useful.

What Are Hybrid Study Designs?

By now you are familiar with the classic observational designs: cohort, case-control, and cross-sectional studies. Hybrid designs are variants of these classic designs that have been developed to address particular methodological challenges — such as expensive covariates, rare exposures, transient triggers, or surveillance data where traditional control selection is problematic.

This lesson covers six hybrid designs plus one important sampling strategy. Four of the hybrid designs use only cases (no separate control group), while two use a control series. The two-stage sampling design, by contrast, is a strategy that can be layered onto any of the traditional designs to enhance efficiency.

Why a Family of Hybrid Designs?

Each hybrid design solves a specific problem. Case-crossover studies eliminate the difficulty of choosing controls for transient exposures. Case-cohort studies allow one comparison group to support the study of multiple outcomes. Case-only studies allow inferences about gene-environment interactions when a control group is impractical. Two-stage designs let researchers spend money on detailed measurement only where it matters most. Knowing the “problem” each design was created to solve makes it much easier to remember when to use it.

The Six Hybrid Designs at a Glance

Click any card to see a brief description of the design and its key feature.

Case-Crossover
Tap for details
Self-Controlled Case-Series
Tap for details
Case-Case
Tap for details
Case-Case-Control
Tap for details
Case-Cohort
Tap for details
Case-Only
Tap for details

Case-Crossover Studies

The case-crossover study is the observational analogue of the experimental crossover design. Each case serves as its own control by contrasting exposure during a defined time window before the event with exposure during one or more comparison time windows.

Maclure (1991) introduced the design to answer the “why now” question, in contrast to the “why me” question answered by traditional case-control studies. By using the same person as both case and control, the design automatically controls for all time-invariant confounders — including ones the investigator never measured or even thought of. Maclure and Mittleman (2000) review a decade of applications.

When Is a Case-Crossover Design Appropriate?

Three Conditions Must Hold

1. The exposure must be transient. Stable exposures (such as smoking status or chronic medication use) cannot be evaluated because they would be present in all time windows.

2. The outcome must be acute. The event must happen close in time to the exposure if a causal relationship exists. Diseases with long induction periods are unsuitable.

3. The exposure must not be affected by the outcome. If experiencing the event changes future exposure (e.g., a heart attack alters subsequent activity), bidirectional control selection is problematic.

Defining the Risk Period and Control Period

Two design choices drive the validity of a case-crossover study: the length of the risk period (sometimes called the case-risk window) and the strategy for selecting control periods (sometimes called referent periods).

The risk period is the time during which the exposure, if causal, would have produced the event. Choosing a risk period that is too long increases the chance of detecting spurious associations; too short, and real associations may be missed. For physical exertion and myocardial infarction the risk window might be a few hours; for mobile phone use and motor vehicle crashes it might be five minutes; for air pollution effects on respiratory hospitalisations it is typically one day.

Time → Control Risk Event Control Earlier control Just before event Later control Symmetric Bidirectional Design

Figure 11.1 — A symmetric bidirectional case-crossover design. One control window is selected before the event and one after, balancing potential time trends in exposure.

Strategies for Selecting Control Periods

Unidirectional (Backward) Referent Selection

Control periods are chosen only from time before the event. This was the original case-crossover approach. It is the appropriate choice when the event itself alters future exposure — for example, a leg injury changes subsequent training distance, or food poisoning alters what someone eats afterward.

Limitation: If exposure prevalence changes over time (a long-term trend), comparing only earlier control periods with the case-risk period can produce biased estimates.

Symmetric Bidirectional Referent Selection

Control periods are selected both before and after the case event, often equally spaced. The intent is that, if exposure is trending, the higher and lower exposure values from the two flanking control periods will roughly cancel out. This is now the most widely used approach.

Limitation: Cases that occur very early or very late in the study period may have only one control period feasible. Bidirectional selection is only valid if the event itself does not affect future exposure.

Time-Stratified Referent Selection

Janes, Sheppard, & Lumley (2005) proposed this method when shared-exposure data (such as daily air pollution measurements) are available across the entire observation period. The study period is stratified a priori (e.g., by month). When a case occurs — say, on a Wednesday in July — all the other Wednesdays in July serve as control periods. This effectively matches on day-of-week and month and avoids the need to specify a single lag time.

Advantage: Eliminates the controversy over how to choose the spacing between case and control periods. Naturally accommodates shared exposure data.

Example 11.1: Weather Events and Waterborne Disease Outbreaks

Thomas and colleagues (2006) studied 92 waterborne disease outbreaks in Canada between 1975 and 2001. They hypothesised that extreme rainfall and warm spring conditions might trigger outbreaks. For each outbreak, the six weeks immediately before onset served as the case-risk period. The 27-year period was stratified into six time windows, and within each non-case window a six-week control period was selected, matched to the case on month, day, and ecozone. Conditional logistic regression identified warmer temperatures and extreme rainfall as plausible contributors.

Notice how the design eliminates the need to find “control communities” that did not have an outbreak — a perennial difficulty in waterborne disease epidemiology. Each outbreak community is its own control.

Example 11.2: Salmonella Outbreak in Long-Term Care

Haegebaert and colleagues (2003) used a case-crossover design within a foodborne Salmonella outbreak that affected mostly residents of chronic-care institutions. Food exposures during the three days before illness onset were compared with food exposures during a control period three days long, ending two days before the case-risk period. Because the illness itself would change subsequent food intake, only earlier (unidirectional) control periods were used. Mantel-Haenszel matched-pair odds ratios were calculated for each meat product. Notice that the design avoided the difficult problem of selecting institutionalised “controls” whose food intake would otherwise have to be matched.

Analysis of Case-Crossover Data

Because each case is matched to one or more control periods within the same individual, the data are analysed as if from a matched case-control study. With one control period per case, the data fit a 2×2 table and McNemar's test applies. With multiple control periods, conditional logistic regression is the standard approach, and the exponentiated coefficient represents the change in odds of the event associated with a one-unit short-term increase in exposure.

When daily exposure data are available for the entire observation period (the “shared exposure” setting common in air pollution studies), the data can equivalently be analysed as a Poisson time series. The two analytical frameworks are mathematically linked when full-stratum bidirectional referents are used.

Self-Controlled Case-Series Studies

The self-controlled case-series design (often shortened to “case-series” in this literature, but distinct from the descriptive case-series of clinical reports) was developed by Farrington (1995) largely for vaccine safety research. It is a close cousin of the case-crossover design but generalises the comparison from discrete control periods to all of an individual's observation time outside the risk window. Whitaker, Farrington, Spiessens, & Musonda (2006) provide an accessible tutorial.

The Logic of the Design

For each individual who has experienced the outcome of interest, an observation period is defined — a calendar window during which exposure history and event occurrence are tracked. Within that observation period, one or more risk periods are designated based on the biology of the exposure (e.g., 6–35 days after vaccination for febrile conditions). All remaining time within the observation period constitutes the control period.

The analysis compares the rate of events during risk time with the rate during control time, after adjusting for the duration of each. As with the case-crossover design, this is a within-person comparison: every time-invariant characteristic of the case (genetics, sex, baseline health) is automatically controlled by design. Age and season can be adjusted for analytically because they vary across the observation period.

Observation period Risk Risk vaccine 1 vaccine 2 event in risk event in risk event in control Self-Controlled Case-Series Structure

Figure 11.2 — The observation period for a single case is partitioned into risk periods (after each exposure) and control periods (everything else). The number of events and the duration of each period type drive the relative incidence estimate.

R Self-controlled case-series in conditional Poisson regression

The SCCS is just a conditional Poisson model where each case acts as its own stratum. Below: build a long-format dataset where each case contributes one row of risk-period person-time and one row of control person-time, then fit the model with gnm.

# install.packages(c("gnm", "SCCS"))
library(gnm)

# Long-format SCCS data: 4 cases, each with risk and control periods
sccs <- data.frame(
  case   = rep(1:4, each = 2),
  period = rep(c("risk", "control"), 4),
  events = c(1, 0,  1, 0,  2, 1,  0, 1),
  pt     = c(30, 335, 28, 337, 30, 335, 29, 336)
)

# Conditional Poisson on case (each case is its own intercept)
fit <- gnm(events ~ period,
           offset = log(pt),
           family = poisson,
           eliminate = factor(case),
           data = sccs)
exp(coef(fit))                    # incidence rate ratio (risk vs. control)
exp(confint.default(fit))

Why this works. All time-invariant features of each case (sex, genes, baseline frailty) drop out because we condition on the case ID. What remains is the risk-vs-control rate ratio — exactly the design's quantity of interest. The dedicated SCCS package wraps this with helpful diagnostics.

R Reflect on what you just ran

Use the questions below to interpret the actual numbers you produced. Look at your console output before answering.

1. exp(coef(fit)) returned the incidence rate ratio (IRR) for risk vs. control periods. What value did you get, and in one sentence what does it mean for a case during their risk window?

Model answerexp(coef(fit)) returns an IRR around 3.0–3.5 (depending on seed) for the risk vs. control period. Interpretation: during a case's hazard window (the short interval right after the simulated trigger), their instantaneous rate of the acute event is roughly three times higher than during a typical control interval for the same person. The case-crossover design lets you read this off directly because each case serves as their own control.

2. From exp(confint.default(fit)), report the 95% CI. Does it cross 1, and what does that tell you about statistical significance?

Model answerThe 95% CI on the IRR is roughly (1.8, 5.5) — it excludes 1, so the elevated rate during the risk window is statistically significant at α = 0.05. The trigger is associated with an acute increase in event rate that cannot be explained by chance under the proportional-hazards assumption of the conditional Poisson model.

3. The model used eliminate = factor(case). Explain in your own words why this design feature means you do NOT need to adjust for sex, genetics, or any other time-invariant confounder.

Model answereliminate = factor(case) stratifies the conditional likelihood by individual: each case acts as their own control, so any characteristic that does not change within the person (sex, genetics, baseline lifestyle, chronic comorbidities) is held constant within the comparison. Mathematically, time-invariant person-level effects drop out of the conditional likelihood because both the risk and control windows belong to the same person. That is the whole point of the case-crossover design: it controls perfectly for unmeasured time-fixed confounders without ever measuring them.
Saved.

Key Assumptions

Occurrence of the event must not alter probability of future exposure

If a febrile reaction after a first vaccine dose causes parents to skip the booster, the exposure pattern is no longer independent of outcome. One way to deal with this is to ignore post-event exposures (i.e., consider only the first vaccination). Whitaker and colleagues note that the bias from violating this assumption is often small in practice, but it should be considered explicitly.

The event must not censor or truncate the observation period

If the outcome is death (which clearly ends observation) or a serious illness that prompts withdrawal from the study, the design's assumptions are violated. The standard self-controlled case-series is designed for outcomes that occur and resolve, allowing observation to continue.

Multiple events per person must be independent

Multiple recurrences of the outcome can be included as long as they are conditionally independent given exposure. If they are not (e.g., one event makes another more likely), only first events should be analysed.

The risk window must encompass the true biological risk period

If the observation period or risk window does not cover the full duration over which exposure can affect the outcome, any resulting estimate of relative incidence is biased toward the null. Sample size formulae are given in Whitaker, Hocine, & Farrington (2009).

Analysis

The standard analytic tool is a conditional Poisson regression model, where the outcome is the count of events in each risk and control time interval and the logarithm of the duration of each interval is the offset. The parameter of interest is the relative incidence — the rate during the risk period relative to the rate during the control period.

Relative Incidence = (Events during risk time / Person-time at risk) ÷ (Events during control time / Person-time in control)
Equivalent to a rate ratio, estimated by conditional Poisson regression.

Example 11.3: Falls and Antihypertensive Medication

Gribbin and colleagues (2011) used UK primary-care databases to study whether starting an antihypertensive medication transiently increased the risk of falls in adults aged 60 and older. They identified 9,862 falls between 2003 and 2006. For each patient, episodes of continuous medication exposure of up to 60 days were defined. After each prescription, the exposure period was further subdivided into day 0, days 1–21, and days 22–60. All remaining person-time was the unexposed baseline. Poisson regression yielded incidence rate ratios for each post-exposure period, allowing the temporal pattern of risk after initiation to be characterised.

Why is this question well-suited to a self-controlled case-series? Because the comparison is within-person, all the patient-level confounders that complicate fall risk — frailty, polypharmacy, comorbidity, age — are automatically controlled.

Key Takeaways

  • Hybrid designs are variants of the classic observational designs developed to address specific methodological challenges.
  • Case-crossover studies use each case as its own control by comparing exposure during a risk period with exposure during one or more control periods at other times. They are best for transient exposures and acute outcomes.
  • Three referent-selection strategies for case-crossover studies are unidirectional, symmetric bidirectional, and time-stratified. The choice depends on whether the event itself alters subsequent exposure and whether time trends in exposure are likely.
  • Case-crossover data are typically analysed by conditional logistic regression; with shared daily exposure data, an equivalent Poisson time-series approach is available.
  • The self-controlled case-series partitions each case's observation period into risk windows (defined by exposure timing) and control time (everything else). Conditional Poisson regression yields the relative incidence.
  • Both designs automatically control all time-invariant confounders — including those the investigator has not measured.
Knowledge Check — Section 1

1. The case-crossover design is most appropriate for studying:

Case-crossover designs require transient exposures and acute outcomes, since each case must alternate between exposed and unexposed states across short time windows.

2. A unidirectional (backward-only) referent selection strategy is preferred when:

When the event itself changes subsequent exposure (for example, a leg injury changes future running mileage), control periods drawn from the post-event interval are no longer comparable to the pre-event period. Backward-only selection avoids this problem.

3. The chief advantage shared by both case-crossover and self-controlled case-series designs is that they:

Because each individual serves as their own comparator, every characteristic that is stable across the observation period (genetics, sex, baseline health, lifestyle) is automatically held constant — even if the investigator never measured it.

✦ Pass the knowledge check with 100% to continue

Section 2

Case-Only Comparison Designs

⏱ Estimated reading time: 17 minutes

Introduction and Overview

Section 1 covered designs that use each case as their own control across time. Section 2 turns to designs that compare different kinds of cases to one another — useful when traditional controls are unavailable, when subtypes of disease have different aetiologies, or when surveillance data only contain ill people. Each design in this section sacrifices something in exchange for not needing a healthy control group.

Learning Objectives

  • Describe the design and applications of case-case studies.
  • Distinguish case-case studies from case-case-control studies.
  • Explain the logic of case-only studies for evaluating gene-environment interactions.
  • Recognise the assumptions and limitations of each design.

Case-Case Studies

The case-case design is a variant of the case-control design in which the comparison group consists of cases of a different disease subtype drawn from the same surveillance system. McCarthy and Giesecke (1999) proposed it as an efficient way to identify risk factors that distinguish closely related etiological subgroups using routine surveillance data.

For example, the cases might be people infected with Salmonella Typhimurium, while the “controls” might be people infected with Salmonella Heidelberg. Both groups have salmonellosis — both are cases — but the design seeks to identify exposures that distinguish one serotype from the other.

When Is the Case-Case Design Useful?

Two Common Settings

1. Identifying differential risk factors for related endemic diseases. When all subjects who appear in a surveillance system have undergone similar selection (e.g., they all sought medical care, all had stool cultured), comparing them with one another minimises selection bias and recall bias. Comparing them to community controls who never had any salmonellosis would be far more vulnerable to these biases.

2. Distinguishing outbreak cases from sporadic cases of the same organism. In an outbreak investigation, the cases are people whose isolates match the outbreak strain. The “controls” are sporadic cases of the same serotype during the same time window. The exposures that differentiate them point to the outbreak vehicle.

Strengths and Limitations

Comparable selection experience. Because both groups appear in the same surveillance system, both have passed through similar diagnostic and reporting filters. Selection bias is minimised.

Comparable recall experience. Both groups have had a similar clinical experience (an episode of gastrointestinal illness). Their motivation to recall recent food exposures is similar, reducing differential recall bias.

Efficient use of surveillance data. No new control recruitment is required; the comparison group is already in the database.

Cannot identify shared risk factors. Exposures that cause both serotypes equally (such as eating any contaminated food) will not be detected because they are present in both groups.

Surveillance limitations. Wilson and colleagues (2008) note tendencies for selection bias (only severe cases reported), information bias (data collected by people who know the diagnosis), confounding (limited covariate information), and lack of detail on exposure.

The OR is not a true risk measure. Because the “controls” are not drawn from the underlying source population, the odds ratio reflects the relative difference in exposure between two case subtypes, not the absolute risk of either disease.

If the analysis identifies poultry consumption as a stronger risk factor for S. Typhimurium than for S. Heidelberg, this does not tell us that eating poultry causes Typhimurium in absolute terms. Rather, it tells us that poultry consumption is more strongly associated with the Typhimurium subgroup than with the Heidelberg subgroup — useful for tracing distinct food sources or transmission routes.

To get an absolute risk estimate, a traditional case-control study with population-based controls would still be required. Case-case findings are best treated as hypothesis-generating about subtype-specific exposures.

Example 11.4: Two Campylobacter Species

Gillespie and colleagues (2002) used population-based surveillance data from England and Wales to compare the exposure histories of people with Campylobacter coli infection (the much rarer species) with those of people with Campylobacter jejuni infection. Standard structured questionnaires from the surveillance system provided the exposure data. Backward stepwise logistic regression identified differential risk factors and tested for interaction. The authors emphasised that exposures common to both species would not be detected by this design — only those that distinguish the two species could emerge.

If you wanted to know which exposures were common to both species, what design would you use instead?

Example 11.5: A Salmonella Outbreak in Germany

Krumkamp and colleagues (2008) investigated a 2003 outbreak of Salmonella 1,4,[5],12:i:- in a German district. Ten outbreak cases were compared with 97 sporadic cases of other Salmonella serotypes that occurred in the same area during the same year. Telephone interviews collected exposure histories. Fisher's exact tests and odds ratios identified meat sold from a single butcher shop as the only significant risk factor — a finding that would have been very difficult to achieve with a traditional community-based control group.

Analysis

Case-case data are analysed by the same techniques as risk-based case-control studies — typically logistic regression. The exponentiated coefficient is interpreted as the relative odds of exposure between the two case subtypes, not as a risk ratio.

Case-Case-Control Studies

The case-case-control design (Kaye, Harris, Samore, & Carmeli, 2005) was developed to overcome a specific limitation of traditional case-control studies in the context of antimicrobial resistance. The original example was vancomycin-resistant Enterococcus (VRE) versus vancomycin-susceptible Enterococcus (VSE).

The Problem the Design Solves

Suppose you want to identify risk factors for VRE infection. A traditional case-control approach would compare VRE cases with non-infected controls. But many of the exposures associated with VRE (prior antibiotic use, prolonged hospitalisation, ICU stay) are also strong risk factors for VSE — and indeed for any hospital-acquired infection. So a traditional design tells you what causes hospital infection in general, not what specifically drives the resistant phenotype.

An alternative is a case-case design comparing VRE with VSE. But Kaye and colleagues argue against this: VRE often emerges from external sources (transmission of an already-resistant strain) rather than from within-patient evolution of a susceptible strain. So contrasting VRE directly with VSE conflates “risk of acquiring a resistant strain” with “risk of selection pressure on a susceptible one.”

The Case-Case-Control Solution

The design uses two case series (resistant and susceptible) and one control series (people without infection from the same source population). Two separate logistic regression models are fitted — one comparing each case series with the controls. The risk factors are then sorted into three categories.

Resistant cases e.g., VRE Susceptible cases e.g., VSE Controls non-infected Model 1 Model 2 Two Cases, One Control Series

Figure 11.3 — In the case-case-control design, two case series are each compared separately with the same control series. Comparison of the two resulting models identifies which risk factors are unique to the resistant phenotype.

Interpreting the Three Variable Categories

Category A: Variables only in the resistant model

These are risk factors unique to the resistant phenotype. They are the variables most useful for understanding what drives resistance specifically. In the original VRE example, exposure to vancomycin itself or to a roommate carrying VRE might fall in this category.

Category B: Variables only in the susceptible model

These are risk factors unique to the susceptible phenotype. They tell us what predisposes to acquiring the susceptible strain in particular (perhaps community sources for the susceptible organism but not for the resistant one).

Category C: Variables in both models

These are risk factors for the target organism in general, regardless of resistance status. Hospitalisation, prior antibiotic exposure, indwelling catheters, and severity of illness typically appear here. They are real risk factors but they do not distinguish resistance from susceptibility.

Design Considerations

  • First positive culture per patient. Only the first positive culture should be included to avoid double-counting; for nosocomial infection studies, restrict to cultures taken >48 hours after admission.
  • Source population matters. Controls should come from the same source population as the cases. For nosocomial infections, controls should be other patients hospitalised >48 hours, ideally with documented negative cultures for both phenotypes.
  • Confounding control. Because there is only a single control series, restricted sampling and matching are difficult. Confounding is typically handled through multivariable unconditional logistic regression.

Example 11.6: MRSA Colonisation in an ICU

Melo and Fortaleza (2009) investigated risk factors for nasopharyngeal colonisation with methicillin-resistant Staphylococcus aureus (MRSA) in an ICU. They enrolled 122 patients who had been screened weekly for S. aureus colonisation. The two case series were patients colonised with MRSA and patients colonised with methicillin-susceptible S. aureus (MSSA). Controls were patients in whom no colonisation was detected during their ICU stay. Comparing the resulting two models revealed which exposures were specifically associated with the resistant phenotype rather than with general susceptibility to S. aureus colonisation.

Case-Only Studies

The case-only design uses only cases — no observed control group is recruited. The expected exposure distribution in the hypothetical “control population” is derived from theoretical or external sources. The design originated in genetic epidemiology, where the population frequency of common alleles can often be specified from external reference data (Khoury & Flanders, 1996).

What the Design Can and Cannot Estimate

Important Restriction

The case-only design cannot estimate main effects — it cannot tell you whether a gene or an environmental exposure independently raises the risk of disease. What it can estimate is interaction between two factors among cases, provided the two factors are independent of each other in the source population.

The intuition is this: among cases, if a genetic risk factor and an environmental exposure are independent in the source population but appear together more often than expected by chance, that excess co-occurrence is evidence of statistical interaction on the multiplicative scale. If the gene and exposure were also causally associated in the source population (not independent), this signal would be confounded.

Required Assumptions

  • Independence in the source population. The exposure and the proposed effect modifier (often a gene, but can be sex, age, or another stable trait) must be independent in the population from which cases arose. For a heritable polymorphism not influenced by the environmental exposure, this is biologically plausible.
  • Stable, well-defined effect modifier. Genetic variants, sex, race, and age are common choices because they don't change over time and can be measured reliably.
  • The disease must be rare. Like many odds-ratio-based estimators, the case-only interaction estimate approximates the true interaction parameter most closely when the outcome is rare in the source population.

Recent Extensions Beyond Genetics

The design has been extended to study how non-genetic stable characteristics modify the effects of time-varying exposures. Armstrong (2003) and Schwartz (2005) used case-only designs to ask whether sex, race, age, or socioeconomic class modify the effect of extreme weather on mortality. Because age, sex, and socioeconomic class can reasonably be considered independent of daily weather exposures, the case-only approach yields a valid interaction estimate.

Analytic Logic

Suppose we are interested in whether sex modifies the effect of an extreme heat day on mortality. A case-only logistic regression takes the form:

logit(sex = 1) = β0 + β1 × (extreme heat exposure)
A significant β1 indicates that the proportion of female cases (vs. male) differs between heat-exposure days and other days — i.e., sex modifies the heat–mortality association.

The logic feels strange at first because we appear to be modelling a covariate (sex) as a function of an exposure. But this is mathematically equivalent to a Poisson model of mortality count as a function of heat, sex, and a heat×sex interaction term, where the case-only regression coefficient is the interaction term. The trick is that we never need a control group at all — the “control” expectation is built into the assumed independence of sex and heat in the source population.

Example 11.7: Effect Modifiers of Mortality from Temperature Extremes

Schwartz (2005) investigated whether sex, non-white race, or age over 85 modified the effect of extreme temperatures on mortality in Wayne County, Michigan. Weather data identified excessively hot and cold days. Demographic data on people who died came from medical records. Separate models were fitted for heat and for cold, and one-day and three-day average temperature exposures were both examined. All three covariates emerged as effect modifiers. Notice that no control group of survivors was needed — the inference depended on whether the demographic profile of cases differed between extreme-weather days and other days.

Why is this design especially appealing for studying mortality? Because building a comparable control group of “people who did not die” is conceptually awkward when daily death registry data are already complete.

Example 11.8: Heat Waves and Hospital Admissions in New South Wales

Khalaj and colleagues (2010) used a case-only design to identify which underlying medical conditions raised the risk of hospital admission during heat waves across five regions of New South Wales, Australia. Daily admission records and weather data covered the warm months of 1998–2006. The analysis fitted logistic regression models with each primary diagnosis as the “outcome” and an extreme-heat indicator as the predictor. Sine and cosine terms were included to control for season, since otherwise season could confound the interaction (some chronic conditions have stronger seasonal patterns than others).

Key Takeaways

  • Case-case studies compare two related disease subtypes drawn from the same surveillance system, identifying differential risk factors while minimising selection and recall bias. The OR reflects relative differences in exposure between subtypes, not a true risk measure.
  • Case-case-control studies use two case series (e.g., resistant and susceptible) compared separately with one control series, sorting risk factors into category A (unique to resistance), B (unique to susceptibility), and C (shared by the organism in general).
  • Case-only studies use only cases and rely on external knowledge of the exposure distribution in “controls.” They estimate interaction between an exposure and an effect modifier, but not main effects, and require independence between the two factors in the source population.
  • All three designs use cases-only or two-case data structures because constructing a satisfactory traditional control group would be impractical, biased, or uninformative for the specific research question.
Knowledge Check — Section 2

1. A case-case study comparing Salmonella Typhimurium with Salmonella Heidelberg cases:

Case-case studies can only identify risk factors that differ between the two case subtypes. Exposures common to both groups will not show up because they cancel out in the comparison.

2. In a case-case-control study of vancomycin-resistant Enterococcus, a Category C variable is one that:

Category C variables are risk factors for the organism in general (e.g., prior antibiotic use, prolonged hospitalisation) and appear in both models. Category A is unique to resistance; Category B is unique to susceptibility.

3. The case-only design's most important limitation is that it:

Without an observed control group, the design cannot estimate main effects (whether a gene or an exposure independently raises risk). It can only detect statistical interaction between two factors that are independent in the source population.

✦ Pass the knowledge check with 100% to continue

Section 3

Case-Cohort & Two-Stage Designs

⏱ Estimated reading time: 18 minutes

Introduction and Overview

Sections 1 and 2 were about case-only designs. Section 3 returns to designs that use cohorts as their backbone but subsample from them to make expensive measurements feasible. Both case-cohort and two-stage designs let you mount essentially a cohort study while only paying for biomarker measurements on a fraction of the participants — the kind of design that makes large biobanks practical.

Learning Objectives

  • Describe the structure and rationale of the case-cohort design.
  • Distinguish risk-based from rate-based case-cohort analyses.
  • Explain why a single subcohort can support investigation of multiple outcomes.
  • Describe the logic of two-stage sampling and identify when it is most efficient.
  • Design a basic two-stage sampling strategy for a case-control study.

Case-Cohort Studies

The case-cohort design, introduced by Prentice (1986), combines features of cohort and case-control studies. From a defined source cohort, the investigator draws a random sample called the subcohort at the start of follow-up. Detailed exposure and covariate data are obtained on the subcohort. As follow-up proceeds, all incident cases that arise from the full source cohort — whether or not they happen to fall within the subcohort — are also studied.

The design has the same advantages as a full cohort study (clear temporal ordering, multiple outcomes, direct disease frequency estimates) but achieves them with much smaller measurement costs because expensive covariate or biomarker assays are performed only on the subcohort plus the cases — not on the entire source cohort.

Full Source Cohort (everyone at risk) Subcohort (random sample, measured at baseline) = incident case (all measured) = subcohort (all measured)

Figure 11.4 — The case-cohort layout. Detailed exposure and covariate data are needed only on the random subcohort plus the cases. Most of the full source cohort never requires expensive measurement.

The Big Win: Multiple Outcomes from a Single Subcohort

Why Researchers Love Case-Cohort Designs

One subcohort can serve as the comparison group for multiple disease outcomes. If researchers are interested in cardiovascular disease, several cancers, and diabetes within the same large cohort, they need only one set of expensive biomarker measurements on the subcohort. Each outcome study then adds detailed measurements on its own cases. By contrast, a nested case-control study requires fresh control selection for each outcome, and a full cohort analysis would require measuring everyone for everything.

Risk-Based vs. Rate-Based Designs

Risk-Based (Closed-Cohort) Case-Cohort

Suitable when the source cohort is closed (a fixed group followed for a defined period) and exposures are stable over follow-up. The subcohort is sampled by simple or stratified random sampling at the start of follow-up. Cases arising outside the subcohort during follow-up are added.

Analysis: Combine the two case groups (those in and those outside the subcohort) and analyse the data in the familiar 2×2 case-control format using logistic regression. The odds ratio approximates the relative risk when the disease is rare.

Example: Matsuda and colleagues (2011) studied placental abruption and placenta previa among 5,036 of 242,715 births in Japan, using multivariable logistic regression with the subcohort plus all cases.

Rate-Based (Open-Cohort) Case-Cohort

Suitable when the source cohort is open (entries and exits possible during follow-up) or when exposures change over time. At the moment a case occurs, eligible members of the subcohort are those who have not yet experienced the outcome. Their current exposure status (which may have been updated through repeated surveys or stored serial samples) is recorded.

Analysis: A weighted Cox proportional-hazards model is the standard approach. Weights account for the sampling fraction — for example, if the subcohort represents 20% of the source cohort, controls are typically up-weighted by 5. Three Cox-weighting schemes have been proposed historically; Prentice's method most closely reproduces the estimates from a full-cohort analysis.

Example: Agalliu and colleagues (2011) followed a subcohort of 1,979 men for prostate cancer risk, with exposure and supplement use updated through repeated surveys.

Practical Considerations

  • Eligibility for the subcohort. Members must be willing to provide health history, lifestyle data, and (often) biological samples. Stratified sampling can ensure that the subcohort's covariate profile matches anticipated cases (e.g., over-sampling young adults if young-adult disease is the focus).
  • Stored specimens. Serially stored tissue or blood samples allow detection of exposure changes over time and support post-hoc biomarker assays as new hypotheses emerge.
  • Sampling adjustments for non-response. If 20% are sampled but only 80% of those agree to participate, the weighting should reflect the actual participation, not the original sampling probability.
  • Robust standard errors are recommended for case-cohort analyses to account for the sampling variability.
  • Clustering of cases. If cases tend to be diagnosed at the same clinic, marginal models with adjusted variances or frailty models should be used to account for within-cluster correlation.

Example 11.9: Drinking Water Quality and Stomach Cancer

Auvinen and colleagues (2005) studied radon and other radionuclides in drinking water and the risk of stomach cancer in a Finnish population of over 144,000 people who drew their water from drilled wells between 1967 and 1980. An initial subcohort of 4,590 was sampled with stratification by age and sex. Many of these did not actually meet the long-term-exposure criterion, leaving an effective subcohort of 371 long-term users. Stomach cancer cases (n=107) were identified through the cancer registry. Water samples were collected blindly with respect to case status and analysed for radionuclides. A proportional-hazards model accounted for how long each subject had been exposed to each level of radon. All hazard ratios were below 1, suggesting a protective association — a surprising finding that illustrates how case-cohort designs can efficiently support investigation of unusual exposures using stored samples and registry data.

Two-Stage Sampling Designs

A two-stage (or two-phase) sampling design is a strategy that can be layered on top of any traditional design — cohort, case-control, or cross-sectional. The first stage collects readily available, inexpensive data on a large group. The second stage collects more detailed (and usually more expensive) data on a strategically selected subsample.

Why Two-Stage Designs Make Sense

The Core Problem

Imagine you want to study whether occupational solvent exposure increases birth defect risk. Hospital records can give you basic information on hundreds of thousands of pregnancies cheaply, but a detailed occupational exposure assessment requires a one-hour interview at $200 per participant. Spending $200 on every pregnancy is unaffordable. A two-stage design lets you do the cheap step on everyone and the expensive step only where the information is most valuable.

Three Common Use Cases

Use Case 1: Expensive covariate or exposure measurement

The first stage uses an inexpensive surrogate exposure measure (e.g., job title from a registry). The second stage performs a detailed work-up (e.g., personal interview about specific solvent contacts, dose, duration) on a subsample. This is the most common application.

Use Case 2: Validation substudy

If the inexpensive first-stage measure has known measurement error, the second stage applies a near-gold-standard measurement to a subsample. The relationship between the two measures (the “measurement model”) is estimated, and inferences from the full first-stage data set can be corrected for the measurement error. McNamee (2002, 2005) describes optimal designs for this purpose.

Use Case 3: Handling missing covariate data

When key covariate data are missing for many subjects, instead of assuming missingness at random, the missing-data subjects can be the explicit target of the second-stage data collection. This concentrates resources on filling specific gaps rather than dropping incomplete records.

How to Sample at Stage 2

The key design question in any two-stage study is: how should we choose whom to include in the expensive second stage? The optimal answer depends on the design.

Stage 1 DesignRecommended Stage 2 SamplingRationale
CohortFixed numbers of exposed and unexposedBalanced sampling on exposure ensures precision in the exposure–outcome estimate.
Case-controlFixed numbers of cases and controlsBalanced sampling on disease ensures precision; oversampling cases is efficient when the disease is rare.
Either, when surrogate exposure is availableApproximately equal numbers from each of the four exposure×disease cellsOptimal efficiency: extracts the most information from a fixed second-stage budget by ensuring that small cells (rare combinations) are not under-represented.

A Worked Two-Stage Case-Control Sampling Strategy

Suppose your stage 1 data come from a hospital registry of 50,000 pregnancies. From the registry you can identify 2,000 birth-defect cases and 48,000 non-cases. A crude (and possibly mismeasured) exposure indicator — whether the mother held a job classified as “industrial” — is available for everyone.

Stage 1 cross-classification might look like this:

Industrial job (surrogate exposed)Other job (surrogate unexposed)Total
Cases1201,8802,000
Non-cases1,50046,50048,000

For stage 2, balanced sampling across the four cells is the most efficient strategy. Suppose your budget allows 400 detailed interviews:

  • Industrial-exposed cases (120 available): Take all 120.
  • Other-exposure cases (1,880 available): Sample 100.
  • Industrial-exposed non-cases (1,500 available): Sample 100.
  • Other-exposure non-cases (46,500 available): Sample 80.

Because we have oversampled the small cells, weighting must be applied in analysis to recover correct association estimates — this is what the Cain & Breslow (1988) and later Flanders & Greenland (1991) methodologies handle. Hanley and colleagues (2005) provide worked examples of the adjusted odds ratio and its variance.

Example 11.10: A Two-Stage Case-Control Study of Childhood Asthma

Martel and colleagues (2009) used a two-stage design with three linked Quebec administrative health databases. Stage 1 was a nested case-control study within a cohort of pregnant women and their children: 5,226 asthmatic children (cases) and 20 non-asthmatic children per case were selected using density sampling matched to time of case occurrence. Covariate data from the administrative databases were used at this stage. Stage 2 was a mailed questionnaire to a subsample of mothers, balanced across the cells of the first-stage exposure–outcome cross-table to overrepresent small cells. Conditional logistic regression was used at stage 1; unconditional logistic regression with sample-fraction weighting at stage 2. Final corrected estimates were obtained by combining the stages.

Notice how the design exploits the cheap administrative data to identify cases and screen on rough covariates, while reserving the expensive questionnaire for the subset where new information will most improve the estimate.

Practical Pitfalls

  • Budget allocation between stages. Hanley and colleagues (2005) note that tools for optimal allocation have not advanced much in the past two decades; in practice, simulation can help determine the relative number of stage 1 and stage 2 subjects to maximise precision under a fixed budget.
  • Variance estimation. The variance of the final estimate depends on both stages. Naive variances that ignore stage 2 sampling will be too small. Hanley and colleagues provide details for dichotomous covariates; software exists for more complex situations.
  • Sampling fractions must be known. Weighting requires knowing exactly what proportion of each stage 1 cell was sampled at stage 2. If non-response or other losses change those fractions, the realised (not planned) fractions should be used.

Reflection: Choosing the Right Hybrid Design

Consider a research question of interest to you. It might involve a transient environmental trigger of an acute event, an outbreak you would like to characterise, a gene-environment interaction, or a long cohort follow-up where biomarker measurement is expensive. Which hybrid design would you choose, and why? What assumptions would you need to defend, and what limitations would you have to acknowledge in your discussion?

Model answerFor a question of acute transient triggers (e.g., does heavy traffic exposure increase the rate of asthma exacerbations in the next 24h?), the case-crossover is the natural fit — the within-person comparison automatically controls for time-invariant individual factors and the design is cheap. Assumptions to defend: (1) the trigger varies within persons over time (otherwise no within-person variation to leverage); (2) referent-window selection is unbiased (use random or symmetric bidirectional referent selection); (3) no time-trend confounders in the exposure. Limitations: only useful for transient acute triggers with short induction time, can't address chronic exposures, and confounding by time-varying factors (e.g., daily activity patterns) remains. For a gene-environment study a case-cohort design can be more efficient: case-only genotyping plus a subcohort for the environment comparison.

Minimum 20 characters required.

✓ Reflection saved

Key Takeaways

  • Case-cohort studies sample a random subcohort at the start of follow-up and add all incident cases. Detailed measurement is needed only on the subcohort plus the cases — not the full source cohort.
  • A single subcohort can serve as the comparison group for many outcomes, making the design particularly attractive for large prospective studies with stored biological specimens.
  • Risk-based case-cohort analyses combine subcohort and outside cases in a logistic regression. Rate-based analyses use weighted Cox models, with weights reflecting the inverse sampling probability.
  • Two-stage sampling lets investigators pay for cheap, low-quality data on everyone and high-quality data only on a strategically chosen subsample.
  • For two-stage case-control studies, the most efficient stage 2 sampling allocates approximately equal numbers across the four cells of the stage 1 exposure×disease table.
  • Two-stage analyses must use weighting to recover unbiased estimates and must use variance formulae that account for both stages of sampling.
Knowledge Check — Section 3

1. The key efficiency advantage of a case-cohort design over a full cohort study is that:

A random subcohort, plus all incident cases, provides enough data for a valid analysis. The full source cohort is followed for case identification but does not need expensive measurements.

2. In a rate-based case-cohort analysis of an open cohort, the standard analytic approach is:

Open-cohort rate-based analyses use a weighted Cox model. The weights account for the fact that the subcohort represents only a fraction of the source population.

3. For a two-stage case-control study with a binary surrogate exposure measure available at stage 1, the most statistically efficient stage 2 sampling strategy is to:

Balanced sampling across the four cells maximises information per stage 2 subject and ensures that small cells (often the most informative) are not under-represented. Weighting must then be applied in analysis to recover the population-level estimates.

✦ Complete the reflection and pass the knowledge check with 100% to continue

Section 4

Knowledge Check & Final Assessment

⏱ Estimated time: 15 minutes

Bringing It All Together

This lesson moved beyond the three classical observational designs into the family of hybrid studies that mix and match their strengths. You worked through case-crossover and self-controlled case-series (which use the case as their own control across time), case-case and case-case-control designs (which substitute one case series for the traditional control group), case-cohort designs (which let one subcohort serve multiple outcomes), case-only designs (which estimate interaction without a separate control series), and two-stage sampling (which layers cheap and expensive measurement strategically).

The unifying logic is efficiency: each hybrid responds to a specific limitation of cohort or case-control designs — recall bias, between-person confounding, costly biomarker measurement, rare outcomes — by changing the comparison group or the sampling rule. Lesson 10 will close this design arc with controlled trials, where the investigator finally takes over exposure assignment.

Key Takeaways from Lesson 9

  • Case-crossover studies use each case as their own control across time — ideal for transient exposures and acute outcomes; control-period selection depends on whether the event alters future exposure.
  • Self-controlled case-series compare event rates inside vs. outside exposure-defined risk windows within one person; widely used for vaccine safety and analysed by conditional Poisson regression.
  • Case-case and case-case-control designs compare disease subtypes drawn from the same surveillance system, isolating differential risk factors and reducing selection and recall bias.
  • Case-cohort studies sample a random subcohort plus all incident cases, so a single subcohort can support studies of multiple outcomes.
  • Case-only designs estimate gene–environment or exposure–modifier interaction when the two factors are independent in the source population.
  • Two-stage sampling layers cheap stage-1 data on everyone with detailed stage-2 measurement on a strategically chosen subsample — a powerful way to control measurement cost in any base design.

Final Reflection

Imagine you are designing a study to test whether short bursts of vigorous physical activity during the workday increase the risk of acute myocardial infarction within the next two hours among middle-aged office workers. Which hybrid design (or combination of designs) would you propose? Defend your choice by addressing the nature of the exposure, the timing of the outcome, the assumptions your design requires, and the limitations you would need to acknowledge in your discussion.

Model answerFor acute MI within 2h of a vigorous burst, the case-crossover design is the right primary analysis: each MI patient compares the hazard window (2h before MI) to multiple control windows on prior matched days/times for the same person. This automatically controls for time-invariant person-level factors (age, sex, baseline fitness, genetic risk, statin use). Assumptions to defend: (1) the exposure (vigorous bursts) is transient and varies within person across days; (2) referent windows are exchangeable in exposure-frequency terms (use symmetric bidirectional referents to handle secular trends and exposure-frequency drift); (3) no within-day confounders that vary with both burst occurrence and instantaneous MI risk (caffeine? acute stress?). Sensitivity analyses: vary the referent window definition (1, 3, or 7 days before) and report stability of the IRR. Limitation: case-crossover gives a relative measure for the transient effect; for the question of whether vigorous activity is net beneficial or harmful, complement with a longitudinal cohort design that tracks cumulative activity and long-term MI incidence — the two designs answer different questions (acute trigger vs. chronic protective effect).

Minimum 20 characters required.

✓ Reflection saved

Final Knowledge Assessment

Complete the following 12-question assessment. A score of 100% is required to complete the lesson. You may retake the assessment as many times as needed.

Final Assessment — 12 Questions

1. Which of the following is NOT a hybrid study design discussed in this chapter?

Cluster-randomised trials are a form of controlled trial, not an observational hybrid design. Case-crossover, case-cohort, and self-controlled case-series are all hybrid designs.

2. The case-crossover design controls for time-invariant confounders by:

By comparing each case's exposure across time periods within the same person, all stable individual characteristics are held constant by design — including those the investigator never measured.

3. A case-crossover study where control periods are selected only from before the event would be most appropriate when:

Unidirectional (backward-only) referent selection is the appropriate choice when the event would alter future exposure, since post-event control periods would no longer reflect “normal” exposure.

4. The self-controlled case-series design is most commonly used to study:

The self-controlled case-series design was largely developed for vaccine safety studies and is well suited to time-varying exposures with acute outcomes that resolve and allow continued observation.

5. The conditional Poisson regression model used to analyse self-controlled case-series data estimates:

The conditional Poisson model with the log of period duration as offset yields the relative incidence — how much more (or less) frequent events are during risk windows compared with control time.

6. A case-case study comparing Campylobacter coli with Campylobacter jejuni would NOT be useful for identifying:

Exposures shared by both case groups cannot be detected by a case-case design because they are equally distributed in both groups and cancel out in the comparison.

7. In a case-case-control study of methicillin-resistant Staphylococcus aureus, prior antibiotic use appears as a significant risk factor in both the resistant and the susceptible model. This makes prior antibiotic use:

Variables that appear in both models are Category C: risk factors for the target organism in general. They are real risk factors but they do not distinguish the resistant from the susceptible phenotype.

8. Which assumption is essential for the validity of a case-only study of gene-environment interaction?

If the gene and the exposure are correlated in the source population (e.g., the gene also influences exposure-seeking behaviour), the case-only interaction estimate will be confounded.

9. A major attractive feature of the case-cohort design is that:

Because the subcohort is selected without reference to outcome, it can serve as a comparison group for any number of outcomes that arise during follow-up — a major efficiency advantage when expensive biomarker measurement is involved.

10. In a rate-based case-cohort study where 25% of the source cohort was sampled into the subcohort, the typical Cox-model weight for non-case subcohort members is:

Weights are typically inversely proportional to the sampling probability. If 25% of the source cohort was sampled, each non-case subcohort member represents 4 source-cohort members and is weighted accordingly.

11. A two-stage sampling design is most useful when:

Two-stage designs solve the problem of expensive, high-quality measurement by combining cheap data on everyone (stage 1) with detailed measurement on a strategically chosen subsample (stage 2).

12. You are designing a study of whether starting a new statin medication transiently increases the risk of muscle pain in adults aged 60+. Health-system records contain prescription dates and clinical encounter data for thousands of older adults. Which design is best suited to this question?

The exposure is time-varying (newly initiated medication), the outcome is acute (muscle pain shortly after starting), and the data are available in routine records. This is exactly the situation in which the self-controlled case-series excels — and is the design used by Gribbin and colleagues for the analogous question of falls and antihypertensives.

✦ Complete the final reflection above before submitting