Hybrid Study
Designs

Fundamental Epidemiological Concepts and Approaches

Learning objectives for this lesson:

Describe the key features of six hybrid study designs (case-crossover, case-series, case-case, case-only, case-cohort, and case-case-control)
Identify source population characteristics, exposures, and outcomes for which each hybrid design is appropriate
Explain the logic of using cases as their own controls across time
Distinguish between unidirectional and bidirectional referent selection in case-crossover studies
Describe two-stage sampling designs and explain when they enhance the efficiency of cross-sectional, cohort, and case-control studies
Design the basic sampling strategy for a specific two-stage case-control study
Apply hybrid design concepts to select appropriate designs for research questions involving rare exposures, transient triggers, or expensive covariates

This course was developed by Dr. Kiffer G. Card, Faculty of Health Sciences, Simon Fraser University based on Dohoo, I. R., Martin, S. W., & Stryhn, H. (2012). Methods in Epidemiologic Research. VER Inc.

Reference

Glossary: Key Terms, People & Concepts

📚 Reference page, available throughout the lesson

This glossary collects the key concepts, people, and ideas you will meet in this lesson. Use it as a reference while you work through the material, or as a review before assessments. Type in the search box to filter entries.

Key Concepts & Ideas

Hybrid Study Design A study design that combines features of two or more standard observational designs (cohort, case-control, cross-sectional, ecologic) to overcome limitations such as rare exposures, transient triggers, or expensive covariates.

Source Population The population from which cases (and controls or subcohort members) arise. Defining the source population correctly is essential for valid sampling in any hybrid design.

Referent (Control) Period In case-crossover and self-controlled designs, a window of time used as the within-person comparison for the hazard period when the outcome occurred.

Hazard (Case) Period The time window immediately preceding the outcome event during which exposure could plausibly have triggered the event.

Transient Exposure An exposure with a short biologically plausible induction window (e.g., heavy meal, anger, air-pollution spike); these are ideal targets for case-crossover analysis.

Unidirectional Referent Selection Choosing referent windows only from time periods before the event; this protects against reverse causation but can be vulnerable to time-trend bias.

Bidirectional Referent Selection Choosing referent windows from before and after the event. Controls for time-trend confounding but assumes the participant survives and remains exchangeable.

Subcohort In case-cohort studies, a random sample drawn from the full cohort at baseline that serves as the comparison group for all cases that arise, and is reusable across multiple outcomes.

Two-Stage Sampling A design in which an inexpensive measure (e.g., screening questionnaire) is collected on everyone in stage 1, and a more expensive measure (e.g., biomarker) is collected only on a stratified subsample in stage 2.

Cases as Their Own Controls The unifying logic of case-crossover, self-controlled case-series, and case-time-control designs, in which each person contributes both case and control time, automatically controlling for fixed individual characteristics.

Time-Trend Bias Bias arising when exposure prevalence changes over calendar time, distorting unidirectional case-crossover estimates. Case-time-control designs adjust for this trend using a separate control group.

Rare Exposure An exposure with low population prevalence; standard cohort studies require very large samples to capture enough exposed cases. Two-stage and case-cohort designs are useful alternatives.

Methods, Measures & Designs

Case-Crossover Design A within-person design in which exposure during the hazard window before each case event is compared to exposure during one or more referent windows in the same person. Requires transient exposures and acute outcomes.

Self-Controlled Case-Series (SCCS) A design that uses only people who experience the outcome and compares incidence rates during exposed vs unexposed person-time within the same individual. Common in vaccine safety research.

Case-Time-Control Design An extension of case-crossover that adds a control group of non-cases to estimate and adjust for secular time trends in exposure prevalence.

Case-Case (Case-Only) Design Compares two subtypes of cases (e.g., different pathogens, different outcomes) to identify exposures that distinguish them; this is common in outbreak surveillance.

Case-Cohort Design A nested design in which all cases are compared to a random subcohort sampled from the full cohort at baseline. Efficient when biomarkers or genetic data are expensive and multiple outcomes are of interest.

Nested Case-Control Design Cases that arise within a defined cohort are matched to a sample of controls drawn from cohort members still at risk at the time of the case event (incidence-density sampling).

Case-Case-Control Design A three-arm design that compares two case subtypes to a shared control group, allowing simultaneous evaluation of distinguishing and shared risk factors.

Incidence-Density Sampling Sampling controls at the time each case occurs from those still at risk; produces unbiased rate-ratio estimates without rare-disease assumptions.

Conditional Logistic Regression The standard analytic method for matched case-control and case-crossover data; it conditions out matched-set or person-level fixed effects.

No matching entries. Try a different search term.

Section 1

Introduction & Time-Based Case-Only Designs

⏱ Estimated reading time: 18 minutes

Lesson 9 · HSCI 341

Hybrid Study Designs

Reworking the standard observational designs to handle problems they cannot solve on their own.

Section 1 of 3

Introduction and Time-Based Case-Only Designs

Case-crossover and self-controlled case-series studies, where each case acts as their own control across time.

Maclure, 1991

Case-crossover: the “why now” question

Each case serves as their own control by comparing exposure in the risk window before the event with exposure in one or more earlier or later control windows.

Traditional case-control

Compares cases with separate control individuals. Asks: why did this person get sick?

Case-crossover

Compares the same person across time. Asks: why did it happen at this moment?

When it applies

Three conditions for a valid case-crossover

Transient exposure

The exposure must vary within a person across time. Stable exposures produce no within-person contrast.

Acute outcome

The event must occur close in time to the trigger. Long induction periods make the risk window undefined.

Event does not alter exposure

If the outcome changes future exposure, post-event control windows are no longer comparable to the risk window.

Design choice

Three referent-selection strategies

Unidirectional = before only. Bidirectional = before and after. Time-stratified = all matching days in the same calendar stratum.

Analysis

Matched analysis for case-crossover data

One control period per case

McNemar’s test applies. The matched odds ratio is the standard measure of association.

Multiple control periods

Conditional logistic regression. The exponentiated coefficient is the odds ratio for a one-unit increase in short-term exposure.

With shared daily exposure data (e.g., air pollution), a Poisson time-series model is mathematically equivalent.

Farrington, 1995

Self-controlled case-series

Relative incidence = (event rate in risk periods) ÷ (event rate in control periods), estimated by conditional Poisson regression.

Carry forward

Next: comparing case subtypes

Within-person comparison eliminates all time-invariant confounding by design.
Case-crossover suits discrete transient exposures; self-controlled case-series suits continuous risk-window modelling.
Both require: exposure independence of outcome, non-censoring by outcome, and a well-defined risk window.

Introduction and Overview

An earlier lesson consolidated the four standard observational designs. This lesson introduces the variants that combine or extend them. Hybrid designs are the answer to specific limitations of the standard four, such as rare or expensive exposures, transient triggers (Suissa, 1995), and surveillance data without obvious controls. The three content sections move from time-based case-only designs (this section: case-crossover and self-controlled case-series), through case-only comparison designs (a later section), to case-cohort and two-stage sampling designs (a later section) that subsample from larger cohorts to make biomarker-heavy studies affordable.

Learning Objectives

Understand why hybrid study designs were developed and how they fit alongside the traditional cohort, case-control, and cross-sectional designs.
Describe the design logic of case-crossover studies and identify when they are appropriate.
Distinguish between unidirectional and bidirectional referent selection strategies.
Describe the self-controlled case-series design and recognise the contexts in which it is most useful.

What Are Hybrid Study Designs?

By now you are familiar with the classic observational designs: cohort, case-control, and cross-sectional studies. Hybrid designs are variants of these classic designs that have been developed to address particular methodological challenges such as expensive covariates, rare exposures, and transient triggers, or surveillance data where traditional control selection is problematic.

This lesson covers six hybrid designs plus one important sampling strategy. Four of the hybrid designs use only cases (no separate control group), while two use a control series. The two-stage sampling design, by contrast, is a strategy that can be layered onto any of the traditional designs to enhance efficiency.

Why a Family of Hybrid Designs?

Each hybrid design solves a specific problem. Case-crossover studies eliminate the difficulty of choosing controls for transient exposures. Case-cohort studies allow one comparison group to support the study of multiple outcomes. Case-only studies allow inferences about gene-environment interactions when a control group is impractical. Two-stage designs let researchers spend money on detailed measurement only where it matters most. Knowing the “problem” each design was created to solve makes it much easier to remember when to use it.

The Six Hybrid Designs at a Glance

Click any card to see a brief description of the design and its key feature.

Case-Crossover

Tap for details

Self-Controlled Case-Series

Tap for details

Case-Case

Tap for details

Case-Case-Control

Tap for details

Case-Cohort

Tap for details

Case-Only

Tap for details

Case-Crossover Studies

The case-crossover study is the observational analogue of the experimental crossover design. Each case serves as its own control by contrasting exposure during a defined time window before the event with exposure during one or more comparison time windows.

Maclure (1991) introduced the design to answer the “why now” question, in contrast to the “why me” question answered by traditional case-control studies. By using the same person as both case and control, the design automatically controls for all time-invariant confounders, including ones the investigator never measured or even thought of. Maclure and Mittleman (2000) review a decade of applications.

When Is a Case-Crossover Design Appropriate?

Three Conditions Must Hold

1. The exposure must be transient. Stable exposures (such as smoking status or chronic medication use) cannot be evaluated because they would be present in all time windows.

2. The outcome must be acute. The event must happen close in time to the exposure if a causal relationship exists. Diseases with long induction periods are unsuitable.

3. The exposure must not be affected by the outcome. If experiencing the event changes future exposure (e.g., a heart attack alters subsequent activity), bidirectional control selection is problematic.

Defining the Risk Period and Control Period

Two design choices drive the validity of a case-crossover study: the length of the risk period (sometimes called the case-risk window) and the strategy for selecting control periods (sometimes called referent periods).

The risk period is the time during which the exposure, if causal, would have produced the event. Choosing a risk period that is too long increases the chance of detecting spurious associations; too short, and real associations may be missed. For physical exertion and myocardial infarction the risk window might be a few hours; for mobile phone use and motor vehicle crashes it might be five minutes; for air pollution effects on respiratory hospitalisations it is typically one day.

Figure 11.1. A symmetric bidirectional case-crossover design. One control window is selected before the event and one after, balancing potential time trends in exposure.

Strategies for Selecting Control Periods

Unidirectional (Backward) Referent Selection

Control periods are chosen only from time before the event. This was the original case-crossover approach. It is the appropriate choice when the event itself alters future exposure, for example when a leg injury changes subsequent training distance, or food poisoning alters what someone eats afterward.

Limitation: If exposure prevalence changes over time (a long-term trend), comparing only earlier control periods with the case-risk period can produce biased estimates.

Symmetric Bidirectional Referent Selection

Control periods are selected both before and after the case event, often equally spaced. The intent is that, if exposure is trending, the higher and lower exposure values from the two flanking control periods will roughly cancel out. This is now the most widely used approach.

Limitation: Cases that occur very early or very late in the study period may have only one control period feasible. Bidirectional selection is only valid if the event itself does not affect future exposure.

Time-Stratified Referent Selection

Janes, Sheppard, & Lumley (2005) proposed this method when shared-exposure data (such as daily air pollution measurements) are available across the entire observation period. The study period is stratified a priori (e.g., by month). When a case occurs, say on a Wednesday in July, all the other Wednesdays in July serve as control periods. This effectively matches on day-of-week and month and avoids the need to specify a single lag time.

Advantage: Eliminates the controversy over how to choose the spacing between case and control periods. Naturally accommodates shared exposure data.

Example 11.1: Weather Events and Waterborne Disease Outbreaks

Thomas and colleagues (2006) studied 92 waterborne disease outbreaks in Canada between 1975 and 2001. They hypothesised that extreme rainfall and warm spring conditions might trigger outbreaks. For each outbreak, the six weeks immediately before onset served as the case-risk period. The 27-year period was stratified into six time windows, and within each non-case window a six-week control period was selected, matched to the case on month, day, and ecozone. Conditional logistic regression identified warmer temperatures and extreme rainfall as plausible contributors.

Notice how the design eliminates the need to find “control communities” that did not have an outbreak, a perennial difficulty in waterborne disease epidemiology. Each outbreak community is its own control.

Example 11.2: Salmonella Outbreak in Long-Term Care

Haegebaert and colleagues (2003) used a case-crossover design within a foodborne Salmonella outbreak that affected mostly residents of chronic-care institutions. Food exposures during the three days before illness onset were compared with food exposures during a control period three days long, ending two days before the case-risk period. Because the illness itself would change subsequent food intake, only earlier (unidirectional) control periods were used. Mantel-Haenszel matched-pair odds ratios were calculated for each meat product. Notice that the design avoided the difficult problem of selecting institutionalised “controls” whose food intake would otherwise have to be matched.

Analysis of Case-Crossover Data

Because each case is matched to one or more control periods within the same individual, the data are analysed as if from a matched case-control study. With one control period per case, the data fit a 2×2 table and McNemar's test applies. Intuitively, this comparison ignores occasions when exposure was the same in both windows and asks, among the occasions where it differed, whether exposure fell in the risk window more often than in the control window. With multiple control periods, conditional logistic regression is the standard approach, and the exponentiated coefficient represents the change in odds of the event associated with a one-unit short-term increase in exposure.

When daily exposure data are available for the entire observation period (the “shared exposure” setting common in air pollution studies), the data can equivalently be analysed as a Poisson time series. The two analytical frameworks are mathematically linked when time-stratified referents are used.

Self-Controlled Case-Series Studies

The self-controlled case-series design (often shortened to “case-series” in this literature, but distinct from the descriptive case-series of clinical reports) was developed by Farrington (1995) largely for vaccine safety research. It is a close cousin of the case-crossover design but generalises the comparison from discrete control periods to all of an individual's observation time outside the risk window. Whitaker, Farrington, Spiessens, & Musonda (2006) provide an accessible tutorial.

The Logic of the Design

For each individual who has experienced the outcome of interest, an observation period is defined, namely a calendar window during which exposure history and event occurrence are tracked. Within that observation period, one or more risk periods are designated based on the biology of the exposure (e.g., 6–35 days after vaccination for febrile conditions). All remaining time within the observation period constitutes the control period.

The analysis compares the rate of events during risk time with the rate during control time, after adjusting for the duration of each. As with the case-crossover design, this is a within-person comparison: every time-invariant characteristic of the case (genetics, sex, baseline health) is automatically controlled by design. Age and season can be adjusted for analytically because they vary across the observation period.

Figure 11.2. The observation period for a single case is partitioned into risk periods (after each exposure) and control periods (everything else). The number of events and the duration of each period type drive the relative incidence estimate.

R Self-controlled case-series in conditional Poisson regression

The SCCS is just a conditional Poisson model where each case acts as its own stratum. Below: build a long-format dataset where each case contributes one row of risk-period person-time and one row of control person-time, then fit the model with gnm.

# install.packages(c("gnm", "SCCS"))
library(gnm)

# Long-format SCCS data: 4 cases, each with risk and control periods
sccs <- data.frame(
  case   = rep(1:4, each = 2),
  period = rep(c("risk", "control"), 4),
  events = c(1, 0,  1, 0,  2, 1,  0, 1),
  pt     = c(30, 335, 28, 337, 30, 335, 29, 336)
)

# Conditional Poisson on case (each case is its own intercept)
fit <- gnm(events ~ period,
           offset = log(pt),
           family = poisson,
           eliminate = factor(case),
           data = sccs)
exp(coef(fit))                    # incidence rate ratio (risk vs. control)
exp(confint.default(fit))

Why this works. All time-invariant features of each case (sex, genes, baseline frailty) drop out because we condition on the case ID. What remains is the risk-vs-control rate ratio, which is exactly the design's quantity of interest. The dedicated SCCS package wraps this with helpful diagnostics.

R Reflect on what you just ran

Use the questions below to interpret the actual numbers you produced. Look at your console output before answering.

1. exp(coef(fit)) returned the incidence rate ratio (IRR) for risk vs. control periods. What value did you get, and in one sentence what does it mean for a case during their risk window?

Model answerexp(coef(fit)) returns an incidence rate ratio of about 23 for the risk vs. control period. Because the dataset is fixed (there is no random simulation here), everyone gets the same value. Interpretation: during a case's risk window, the interval right after the trigger, their event rate is roughly 23 times higher than during their control time. The self-controlled case-series lets you read this off directly, because each case serves as their own control and the ratio is a pure within-person contrast.

2. From exp(confint.default(fit)), report the 95% CI. Does it cross 1, and what does that tell you about statistical significance?

Model answerThe 95% CI on the IRR is wide, roughly (4, 124), because the toy dataset has only a handful of events; it still excludes 1, so the elevated rate during the risk window is statistically significant at α = 0.05. The trigger is associated with an acute increase in event rate that is unlikely to be chance under the constant-rate (log-linear) assumption of the conditional Poisson model.

3. The model used eliminate = factor(case). Explain in your own words why this design feature means you do NOT need to adjust for sex, genetics, or any other time-invariant confounder.

Model answereliminate = factor(case) stratifies the conditional likelihood by individual: each case acts as their own control, so any characteristic that does not change within the person (sex, genetics, baseline lifestyle, chronic comorbidities) is held constant within the comparison. Mathematically, time-invariant person-level effects drop out of the conditional likelihood because both the risk and control windows belong to the same person. That is the whole point of the self-controlled case-series (and of its close relative the case-crossover): it controls perfectly for unmeasured time-fixed confounders without ever measuring them.

Saved.

Key Assumptions

Occurrence of the event must not alter probability of future exposure ▼

If a febrile reaction after a first vaccine dose causes parents to skip the booster, the exposure pattern is no longer independent of outcome. One way to deal with this is to ignore post-event exposures (i.e., consider only the first vaccination). Whitaker and colleagues note that the bias from violating this assumption is often small in practice, but it should be considered explicitly.

The event must not censor or truncate the observation period ▼

If the outcome is death (which clearly ends observation) or a serious illness that prompts withdrawal from the study, the design's assumptions are violated. The standard self-controlled case-series is designed for outcomes that occur and resolve, allowing observation to continue.

Multiple events per person must be independent ▼

Multiple recurrences of the outcome can be included as long as they are conditionally independent given exposure. If they are not (e.g., one event makes another more likely), only first events should be analysed.

The risk window must encompass the true biological risk period ▼

If the observation period or risk window does not cover the full duration over which exposure can affect the outcome, any resulting estimate of relative incidence is biased toward the null. Sample size formulae are given in Whitaker, Hocine, & Farrington (2009).

Analysis

The standard analytic tool is a conditional Poisson regression model, where the outcome is the count of events in each risk and control time interval and the logarithm of the duration of each interval is the offset. The parameter of interest is the relative incidence, that is, the rate during the risk period relative to the rate during the control period.

Relative incidence

Relative Incidence = (Events during risk time / Person-time at risk) ÷ (Events during control time / Person-time in control)

The relative incidence compares the event rate during risk periods with the event rate during control periods; it is a rate ratio, estimated by conditional Poisson regression.

Example 11.3: Falls and Antihypertensive Medication

Gribbin and colleagues (2011) used UK primary-care databases to study whether starting an antihypertensive medication transiently increased the risk of falls in adults aged 60 and older. They identified 9,862 falls between 2003 and 2006. For each patient, episodes of continuous medication exposure of up to 60 days were defined. After each prescription, the exposure period was further subdivided into day 0, days 1–21, and days 22–60. All remaining person-time was the unexposed baseline. Poisson regression yielded incidence rate ratios for each post-exposure period, allowing the temporal pattern of risk after initiation to be characterised.

Why is this question well-suited to a self-controlled case-series? Because the comparison is within-person, all the patient-level confounders that complicate fall risk (frailty, polypharmacy, comorbidity, age) are automatically controlled.

Key Takeaways

Hybrid designs are variants of the classic observational designs developed to address specific methodological challenges.
Case-crossover studies use each case as its own control by comparing exposure during a risk period with exposure during one or more control periods at other times. They are best for transient exposures and acute outcomes.
Three referent-selection strategies for case-crossover studies are unidirectional, symmetric bidirectional, and time-stratified. The choice depends on whether the event itself alters subsequent exposure and whether time trends in exposure are likely.
Case-crossover data are typically analysed by conditional logistic regression; with shared daily exposure data, an equivalent Poisson time-series approach is available.
The self-controlled case-series partitions each case's observation period into risk windows (defined by exposure timing) and control time (everything else). Conditional Poisson regression yields the relative incidence.
Both designs automatically control all time-invariant confounders, including those the investigator has not measured.

✦ Pass the knowledge check with 100% to continue

Section 3

Case-Cohort & Two-Stage Designs

⏱ Estimated reading time: 18 minutes

Section 3 of 3

Case-Cohort & Two-Stage Designs

Subsampling from large cohorts to make expensive measurement affordable while preserving validity.

Prentice, 1986

Case-cohort design: one subcohort, many outcomes

Two analytic forms

Risk-based vs. rate-based case-cohort

Risk-based (closed cohort)

Logistic regression combining subcohort and outside cases. OR approximates relative risk when disease is rare.

Rate-based (open cohort)

Weighted Cox proportional-hazards model. Weights are the inverse of the subcohort sampling fraction.

Weighted Cox hazard ratio

\[\color{#0B7B6B}{h(t)} = \color{#C2410C}{h_0(t)} \exp(\color{#6D28D9}{\beta} \color{#1D4ED8}{X}), \quad \color{#BE185D}{w_i} = \frac{1}{\pi_i}\]

h(t) hazard at time th₀(t) baseline hazardβ log hazard ratioX covariatesw_i sampling weight (inverse selection probability)

Two-stage sampling

Cheap data on everyone; detailed data on the right subset

Stage 1 (everyone)

Inexpensive surrogate data from registries, job titles, or routine records. Fast and cheap.

Stage 2 (subsample)

Detailed exposure assessment, biomarker assay, or in-person interview, on the fraction where precision is gained.

Stage 2 sampling rule

Balanced four-cell sampling is most efficient

Analysis uses inverse probability weights equal to the reciprocal of each cell’s sampling fraction.

Analytical pitfalls

Variance, weighting, and non-response

Naive standard errors are too small. Variance must account for both stages of sampling.
Use realised sampling fractions (post-non-response) rather than planned ones in the inverse probability weights.
Budget allocation between stages is not standardised; simulation under a fixed total cost is the most practical guide.

Wrapping up

The lesson’s unifying logic

Case-cohort

One subcohort supports multiple outcomes; weighted Cox or logistic regression recovers valid estimates.

Two-stage sampling

Stage 1 cheap data on everyone; stage 2 detailed data on the most informative subsample. Weighting required.

Every hybrid design answers a specific question that a classical design handles poorly. Match the design to the problem.

Introduction and Overview

Earlier sections were about case-only designs. This section returns to designs that use cohorts as their backbone but subsample from them to make expensive measurements feasible. Both case-cohort and two-stage designs let you mount essentially a cohort study while only paying for biomarker measurements on a fraction of the participants, the kind of design that makes large biobanks practical.

Learning Objectives

Describe the structure and rationale of the case-cohort design.
Distinguish risk-based from rate-based case-cohort analyses.
Explain why a single subcohort can support investigation of multiple outcomes.
Describe the logic of two-stage sampling and identify when it is most efficient.
Design a basic two-stage sampling strategy for a case-control study.

Case-Cohort Studies

The case-cohort design, introduced by Prentice (1986), combines features of cohort and case-control studies. From a defined source cohort, the investigator draws a random sample called the subcohort at the start of follow-up. Detailed exposure and covariate data are obtained on the subcohort. As follow-up proceeds, all incident cases that arise from the full source cohort, whether or not they happen to fall within the subcohort, are also studied.

The design has the same advantages as a full cohort study (clear temporal ordering, multiple outcomes, direct disease frequency estimates) but achieves them with much smaller measurement costs because expensive covariate or biomarker assays are performed only on the subcohort plus the cases, not on the entire source cohort.

Figure 11.4. The case-cohort layout. Detailed exposure and covariate data are needed only on the random subcohort plus the cases. Most of the full source cohort never requires expensive measurement.

The Big Win: Multiple Outcomes from a Single Subcohort

Why Researchers Love Case-Cohort Designs

One subcohort can serve as the comparison group for multiple disease outcomes. If researchers are interested in cardiovascular disease, several cancers, and diabetes within the same large cohort, they need only one set of expensive biomarker measurements on the subcohort. Each outcome study then adds detailed measurements on its own cases. By contrast, a nested case-control study requires fresh control selection for each outcome, and a full cohort analysis would require measuring everyone for everything.

Risk-Based vs. Rate-Based Designs

Risk-Based (Closed-Cohort) Case-Cohort

Suitable when the source cohort is closed (a fixed group followed for a defined period) and exposures are stable over follow-up. The subcohort is sampled by simple or stratified random sampling at the start of follow-up. Cases arising outside the subcohort during follow-up are added.

Analysis: Combine the two case groups (those in and those outside the subcohort) and analyse the data in the familiar 2×2 case-control format using logistic regression. The odds ratio approximates the relative risk when the disease is rare.

Example: Matsuda and colleagues (2011) studied placental abruption and placenta previa among 5,036 of 242,715 births in Japan, using multivariable logistic regression with the subcohort plus all cases.

Rate-Based (Open-Cohort) Case-Cohort

Suitable when the source cohort is open (entries and exits possible during follow-up) or when exposures change over time. At the moment a case occurs, eligible members of the subcohort are those who have not yet experienced the outcome. Their current exposure status (which may have been updated through repeated surveys or stored serial samples) is recorded.

Analysis: A weighted Cox proportional-hazards model is the standard approach. Weights account for the sampling fraction; for example, if the subcohort represents 20% of the source cohort, controls are typically up-weighted by 5. Three Cox-weighting schemes have been proposed historically; Prentice's method most closely reproduces the estimates from a full-cohort analysis.

Example: Agalliu and colleagues (2011) followed a subcohort of 1,979 men for prostate cancer risk, with exposure and supplement use updated through repeated surveys.

Practical Considerations

Eligibility for the subcohort. Members must be willing to provide health history, lifestyle data, and (often) biological samples. Stratified sampling can ensure that the subcohort's covariate profile matches anticipated cases (e.g., over-sampling young adults if young-adult disease is the focus).
Stored specimens. Serially stored tissue or blood samples allow detection of exposure changes over time and support post-hoc biomarker assays as new hypotheses emerge.
Sampling adjustments for non-response. If 20% are sampled but only 80% of those agree to participate, the weighting should reflect the actual participation, not the original sampling probability.
Robust standard errors are recommended for case-cohort analyses to account for the sampling variability.
Clustering of cases. If cases tend to be diagnosed at the same clinic, marginal models with adjusted variances or frailty models should be used to account for within-cluster correlation.

Example 11.9: Drinking Water Quality and Stomach Cancer

Auvinen and colleagues (2005) studied radon and other radionuclides in drinking water and the risk of stomach cancer in a Finnish population of over 144,000 people who drew their water from drilled wells between 1967 and 1980. An initial subcohort of 4,590 was sampled with stratification by age and sex. Many of these did not actually meet the long-term-exposure criterion, leaving an effective subcohort of 371 long-term users. Stomach cancer cases (n=107) were identified through the cancer registry. Water samples were collected blindly with respect to case status and analysed for radionuclides. A proportional-hazards model accounted for how long each subject had been exposed to each level of radon. All hazard ratios were below 1, suggesting a protective association, a surprising finding that illustrates how case-cohort designs can efficiently support investigation of unusual exposures using stored samples and registry data.

Two-Stage Sampling Designs

A two-stage (or two-phase) sampling design is a strategy that can be layered on top of any traditional design, whether cohort, case-control, or cross-sectional. The first stage collects readily available, inexpensive data on a large group. The second stage collects more detailed (and usually more expensive) data on a strategically selected subsample.

Why Two-Stage Designs Make Sense

The Core Problem

Imagine you want to study whether occupational solvent exposure increases birth defect risk. Hospital records can give you basic information on hundreds of thousands of pregnancies cheaply, but a detailed occupational exposure assessment requires a one-hour interview at $200 per participant. Spending $200 on every pregnancy is unaffordable. A two-stage design lets you do the cheap step on everyone and the expensive step only where the information is most valuable.

Three Common Use Cases

Use Case 1: Expensive covariate or exposure measurement ▼

The first stage uses an inexpensive surrogate exposure measure (e.g., job title from a registry). The second stage performs a detailed work-up (e.g., personal interview about specific solvent contacts, dose, duration) on a subsample. This is the most common application.

Use Case 2: Validation substudy ▼

If the inexpensive first-stage measure has known measurement error, the second stage applies a near-gold-standard measurement to a subsample. The relationship between the two measures (the “measurement model”) is estimated, and inferences from the full first-stage data set can be corrected for the measurement error. McNamee (2002, 2005) describes optimal designs for this purpose.

Use Case 3: Handling missing covariate data ▼

When key covariate data are missing for many subjects, instead of assuming missingness at random, the missing-data subjects can be the explicit target of the second-stage data collection. This concentrates resources on filling specific gaps rather than dropping incomplete records.

How to Sample at Stage 2

The key design question in any two-stage study is: how should we choose whom to include in the expensive second stage? The optimal answer depends on the design.

Stage 1 Design	Recommended Stage 2 Sampling	Rationale
Cohort	Fixed numbers of exposed and unexposed	Balanced sampling on exposure ensures precision in the exposure–outcome estimate.
Case-control	Fixed numbers of cases and controls	Balanced sampling on disease ensures precision; oversampling cases is efficient when the disease is rare.
Either, when surrogate exposure is available	Approximately equal numbers from each of the four exposure×disease cells	Optimal efficiency: extracts the most information from a fixed second-stage budget by ensuring that small cells (rare combinations) are not under-represented.

A Worked Two-Stage Case-Control Sampling Strategy

Suppose your stage 1 data come from a hospital registry of 50,000 pregnancies. From the registry you can identify 2,000 birth-defect cases and 48,000 non-cases. A crude (and possibly mismeasured) exposure indicator, namely whether the mother held a job classified as “industrial”, is available for everyone.

Stage 1 cross-classification might look like this:

	Industrial job (surrogate exposed)	Other job (surrogate unexposed)	Total
Cases	120	1,880	2,000
Non-cases	1,500	46,500	48,000

For stage 2, balanced sampling across the four cells is the most efficient strategy. Suppose your budget allows 400 detailed interviews:

Industrial-exposed cases (120 available): Take all 120.
Other-exposure cases (1,880 available): Sample 100.
Industrial-exposed non-cases (1,500 available): Sample 100.
Other-exposure non-cases (46,500 available): Sample 80.

Because we have oversampled the small cells, weighting must be applied in analysis to recover correct association estimates; this is what the Cain & Breslow (1988) and later Flanders & Greenland (1991) methodologies handle. Hanley and colleagues (2005) provide worked examples of the adjusted odds ratio and its variance.

Example 11.10: A Two-Stage Case-Control Study of Childhood Asthma

Martel and colleagues (2009) used a two-stage design with three linked Quebec administrative health databases. Stage 1 was a nested case-control study within a cohort of pregnant women and their children: 5,226 asthmatic children (cases) and 20 non-asthmatic children per case were selected using density sampling matched to time of case occurrence. Covariate data from the administrative databases were used at this stage. Stage 2 was a mailed questionnaire to a subsample of mothers, balanced across the cells of the first-stage exposure–outcome cross-table to overrepresent small cells. Conditional logistic regression was used at stage 1; unconditional logistic regression with sample-fraction weighting at stage 2. Final corrected estimates were obtained by combining the stages.

Notice how the design exploits the cheap administrative data to identify cases and screen on rough covariates, while reserving the expensive questionnaire for the subset where new information will most improve the estimate.

Practical Pitfalls

Budget allocation between stages. Hanley and colleagues (2005) note that tools for optimal allocation have not advanced much in the past two decades; in practice, simulation can help determine the relative number of stage 1 and stage 2 subjects to maximise precision under a fixed budget.
Variance estimation. The variance of the final estimate depends on both stages. Naive variances that ignore stage 2 sampling will be too small. Hanley and colleagues provide details for dichotomous covariates; software exists for more complex situations.
Sampling fractions must be known. Weighting requires knowing exactly what proportion of each stage 1 cell was sampled at stage 2. If non-response or other losses change those fractions, the realised (not planned) fractions should be used.

Reflection: Choosing the Right Hybrid Design

Consider a research question of interest to you. It might involve a transient environmental trigger of an acute event, an outbreak you would like to characterise, a gene-environment interaction, or a long cohort follow-up where biomarker measurement is expensive. Which hybrid design would you choose, and why? What assumptions would you need to defend, and what limitations would you have to acknowledge in your discussion?

Model answerFor a question of acute transient triggers (e.g., does heavy traffic exposure increase the rate of asthma exacerbations in the next 24h?), the case-crossover is the natural fit, because the within-person comparison automatically controls for time-invariant individual factors and the design is cheap. Assumptions to defend: (1) the trigger varies within persons over time (otherwise no within-person variation to leverage); (2) referent-window selection is unbiased (use random or symmetric bidirectional referent selection); (3) no time-trend confounders in the exposure. Limitations: only useful for transient acute triggers with short induction time, can't address chronic exposures, and confounding by time-varying factors (e.g., daily activity patterns) remains. For a gene-environment study a case-cohort design can be more efficient: case-only genotyping plus a subcohort for the environment comparison.

Minimum 20 characters required.

✓ Reflection saved

Key Takeaways

Case-cohort studies sample a random subcohort at the start of follow-up and add all incident cases. Detailed measurement is needed only on the subcohort plus the cases, not the full source cohort.
A single subcohort can serve as the comparison group for many outcomes, making the design particularly attractive for large prospective studies with stored biological specimens.
Risk-based case-cohort analyses combine subcohort and outside cases in a logistic regression. Rate-based analyses use weighted Cox models, with weights reflecting the inverse sampling probability.
Two-stage sampling lets investigators pay for cheap, low-quality data on everyone and high-quality data only on a strategically chosen subsample.
For two-stage case-control studies, the most efficient stage 2 sampling allocates approximately equal numbers across the four cells of the stage 1 exposure×disease table.
Two-stage analyses must use weighting to recover unbiased estimates and must use variance formulae that account for both stages of sampling.

✦ Complete the reflection and pass the knowledge check with 100% to continue

HSCI 341, Lesson 9

Fundamental Epidemiological Concepts and Approaches

Hybrid StudyDesigns

Learning objectives for this lesson:

Glossary: Key Terms, People & Concepts

Introduction & Time-Based Case-Only Designs

Hybrid Study Designs

Introduction and Time-Based Case-Only Designs

Case-crossover: the “why now” question

Traditional case-control

Case-crossover

Three conditions for a valid case-crossover

Transient exposure

Acute outcome

Event does not alter exposure

Three referent-selection strategies

Matched analysis for case-crossover data

One control period per case

Multiple control periods

Self-controlled case-series

Next: comparing case subtypes

Introduction and Overview

Learning Objectives

What Are Hybrid Study Designs?

Why a Family of Hybrid Designs?

The Six Hybrid Designs at a Glance

Case-Crossover Studies

When Is a Case-Crossover Design Appropriate?

Three Conditions Must Hold

Defining the Risk Period and Control Period

Strategies for Selecting Control Periods

Unidirectional (Backward) Referent Selection

Symmetric Bidirectional Referent Selection

Time-Stratified Referent Selection

Example 11.1: Weather Events and Waterborne Disease Outbreaks

Example 11.2: Salmonella Outbreak in Long-Term Care

Analysis of Case-Crossover Data

Self-Controlled Case-Series Studies

The Logic of the Design

R Reflect on what you just ran

Key Assumptions

Analysis

Example 11.3: Falls and Antihypertensive Medication

Key Takeaways

Case-Only Comparison Designs

Case-Only Comparison Designs

Case-case studies

The comparison group

The advantage

What the odds ratio means (and does not mean)

Case-case-control studies

Three categories of risk factor

Category A

Category B

Category C

Case-only designs: interaction without controls

Next: when measurement is the bottleneck

Introduction and Overview

Learning Objectives

Case-Case Studies

When Is the Case-Case Design Useful?

Two Common Settings

Strengths and Limitations

Example 11.4: Two Campylobacter Species

Example 11.5: A Salmonella Outbreak in Germany

Analysis

Case-Case-Control Studies

The Problem the Design Solves

The Case-Case-Control Solution

Interpreting the Three Variable Categories

Design Considerations

Example 11.6: MRSA Colonisation in an ICU

Case-Only Studies

What the Design Can and Cannot Estimate

Important Restriction

Required Assumptions

Recent Extensions Beyond Genetics

Analytic Logic

Example 11.7: Effect Modifiers of Mortality from Temperature Extremes

Example 11.8: Heat Waves and Hospital Admissions in New South Wales

Hybrid Study
Designs