Cohort
Studies

Evaluating Epidemiological Research

Learning objectives for this lesson:

Distinguish between open and closed source populations as they relate to cohort study design
Describe the major design features of risk-based and rate-based cohort studies
Identify hypotheses and population types consistent with risk-based and rate-based cohort studies
Elaborate the principles used to select and measure the exposure in cohort studies
Design and implement a valid cohort study to investigate a specific hypothesis

This course was developed by Dr. Kiffer G. Card, Faculty of Health Sciences, Simon Fraser University.

Reference

Glossary: Key Terms, People & Concepts

📚 Reference page, available throughout the lesson

This glossary collects the key concepts, people, and ideas you will meet in this lesson. Use it as a reference while you work through the material, or as a review before assessments. Type in the search box to filter entries.

Key Concepts & Ideas

Cohort A defined group of people followed over time. In a cohort study, the cohort is classified by exposure status at baseline (or during follow-up) and then watched for outcomes.

Closed (Fixed) Cohort A cohort with fixed membership, all of whom are followed (or attempted to be followed) for a defined window. The natural setting for risk-based analyses.

Open (Dynamic) Cohort A cohort whose membership changes over time as people enter and leave. Person-time is the appropriate denominator and incidence rates the natural measure.

Prospective Cohort A cohort study in which exposure is measured at the start, and outcomes are then watched for as time passes. Best protection against differential measurement of exposure.

Retrospective (Historical) Cohort A cohort study assembled using existing records about exposure that occurred in the past, with outcomes also already observed. Faster than prospective designs but constrained by available data quality.

Ambidirectional Cohort A cohort study that uses both retrospective and prospective elements, e.g., reconstructing past exposures from records, then continuing prospective follow-up for new outcomes.

Exposure A factor whose effect on a health outcome is being investigated. In cohort studies, exposure status is fixed (or measured longitudinally) before the outcome occurs.

Outcome The health state or event whose occurrence the cohort is being followed for. Must be defined operationally and measured consistently across exposure groups.

Person-Time The sum of time each individual is observed and at risk for the outcome, e.g., person-years. Denominator for incidence-rate calculations in open cohorts.

Loss to Follow-Up Participants who can no longer be observed before the study ends. Threatens validity if loss is differential by exposure or outcome.

Censoring When follow-up ends before the outcome is observed. Right-censoring (most common) is handled in survival analysis; informative censoring biases estimates.

Risk Ratio (RR, Relative Risk) The cumulative incidence in the exposed divided by that in the unexposed. The natural effect measure from a closed cohort.

Incidence Rate Ratio (IRR) The incidence rate in the exposed divided by the incidence rate in the unexposed. The natural effect measure from an open cohort with person-time follow-up.

Hazard Ratio (HR) A ratio of instantaneous failure rates, the effect measure produced by Cox proportional-hazards models. Often interpreted similarly to an IRR when the proportional-hazards assumption holds.

Risk Difference (Attributable Risk) Cumulative incidence in the exposed minus that in the unexposed. Captures the absolute, not relative, public-health impact of exposure.

Population Attributable Fraction (PAF) The proportion of disease in the population that would be eliminated if the exposure were removed (assuming causality). Combines effect size with how common the exposure is.

Confounding A distortion of the exposure–outcome association by a third variable associated with both. Cohort studies handle it through restriction, stratification, and multivariable adjustment.

Healthy-Worker Effect A specific selection bias in occupational cohorts: people who are employed are systematically healthier than the general population, biasing comparisons against external referents.

Immortal Time Bias A specific bias arising when, by design, members of one exposure group cannot have the outcome during a stretch of follow-up, e.g., classifying treatment status using post-baseline information.

External Comparison Group A reference group drawn from outside the cohort (e.g., national rates) when an internal unexposed group is unavailable. Used in occupational cohorts; vulnerable to the healthy-worker effect.

Internal Comparison Group An unexposed (or differently exposed) reference group sampled from the same source population as the exposed. Generally less biased than external comparisons.

STROBE Checklist Reporting guideline (Strengthening the Reporting of Observational Studies in Epidemiology) with specific items for cohort designs, sampling, exposure measurement, follow-up, statistical methods (von Elm et al, 2007).

Methods & Study Designs

Cohort Study An observational design that classifies individuals by exposure and follows them over time to compare outcome occurrence. The closest observational analogue to an experiment.

Risk-Based Cohort Set in a closed cohort with full follow-up. Reports cumulative incidence and the risk ratio.

Rate-Based Cohort Set in an open or dynamic cohort with person-time follow-up. Reports incidence rates and the incidence-rate ratio, well suited to long follow-up with entries, exits, and censoring.

Kaplan–Meier Estimator A nonparametric estimator of the survival function from time-to-event data, accommodating right-censoring. Visualized as the familiar “step” survival curve.

Cox Proportional-Hazards Model A semi-parametric regression for time-to-event data that estimates hazard ratios while leaving the baseline hazard unspecified. Workhorse of modern cohort analysis.

Key People & Cohorts

Framingham Heart Study (1948–) A landmark prospective cohort begun in Framingham, Massachusetts that gave epidemiology the term “risk factor” (Kannel et al, 1961) and identified hypertension, smoking, cholesterol, and diabetes as major drivers of cardiovascular disease, see Dawber, Meadors, & Moore (1951) for the original design paper and Mahmood et al (2014) for a historical overview.

British Doctors Study (1951–2001) Doll & Hill's (1954) prospective cohort of UK physicians that established smoking as a cause of lung cancer. Followed for half a century with extraordinary retention; the 50-year follow-up appears in Doll et al (2004).

Nurses' Health Study (1976–) A massive prospective cohort of US nurses that has produced foundational evidence on diet, hormones, and chronic disease (Colditz, Manson, & Hankinson, 1997).

Richard Doll (1912–2005) British epidemiologist whose case-control and cohort work with Bradford Hill established the smoking–lung cancer link and modeled rigorous long-term cohort follow-up.

Austin Bradford Hill (1897–1991) British statistician and epidemiologist; co-author with Doll of the British Doctors Study, and author of the “Hill viewpoints” for assessing causality from observational data (Hill, 1965).

David Cox (1924–2022) British statistician who introduced the proportional-hazards model that bears his name (Cox, 1972); one of the most cited statistical methods in cohort analysis.

No matching entries. Try a different search term.

Section 1

Introduction & Cohort Study Design

⏱ Estimated reading time: 15 minutes

Section 1 of 4

Introduction & Cohort Study Design

The logic of the design, the study-group choices, and the open versus closed population question.

The core logic

What a cohort study is

Follow disease-free subjects from exposure to outcome, then compare disease frequency between exposed and non-exposed groups.

Landmarks

The studies that defined the method

Framingham Heart Study (1948–)

Enrolled residents of one Massachusetts town; coined the term “risk factor” (Kannel et al, 1961). Three generations now under study.

British Doctors Study (1951–2001)

Doll & Hill followed United Kingdom physicians for fifty years. Established smoking as a cause of lung cancer with extraordinary retention across half a century.

Study-group selection

Three structures

Two-cohort design

Exposure status known in advance. Recruit an exposed and a non-exposed group separately.

Single (longitudinal) cohort

Exposure unknown at recruitment. Select one group with a range of exposures, classify later.

Virtual cohort

Assembled from existing records. McCartney et al (2010) used deli purchase data to trace an E. coli O157 outbreak.

Timing

Prospective vs. retrospective

Prospective

Disease has not yet occurred. Exposure measured at baseline; investigators follow forward in time. Richer data, slower, more costly.

Retrospective

Follow-up has already ended. Both exposure and outcome reconstructed from existing records. Faster and cheaper, but constrained by data quality.

A third option, the ambidirectional cohort, reconstructs past exposure from records and then continues prospective follow-up for new outcomes.

Population structure

Open vs. closed source populations

Feature	Closed	Open
Membership	Fixed at start	Enters and leaves
Follow-up	Full risk period	Variable per subject
Best for	Short risk periods	Chronic disease
Measure	Risk (cumulative incidence)	Rate (incidence density)

Carry forward

Into the next section

Cohort studies follow disease-free subjects from exposure to outcome and compare incidence directly.
The design closely resembles a trial; exposure is classified, not randomised.
Three selection structures (two-cohort, single longitudinal, virtual) depend on what data you already have.
Open vs. closed population dictates whether the denominator is persons or person-time, and which measure of disease frequency is valid.

Introduction and Overview

An earlier lesson ended with a promise: cohort studies invert the case-control logic by sampling on the exposure rather than the disease, and that inversion lets us measure incidence directly without the rare-disease assumption that complicated odds-ratio interpretation. This lesson cashes that promise. Across four content sections we walk from the basic logic of cohort design (this section), to the choice between risk-based and rate-based flavors that should now feel familiar from a later section, to the surprisingly difficult problem of measuring exposure in a longitudinal setting (a later section), and finally to the practical questions of comparability, follow-up, outcome ascertainment, and analysis (a later section). The unified-design discipline from an earlier lesson still applies; the lessons of an earlier lesson about pre-specified analysis plans apply with extra force, because cohort studies often run for decades and offer many opportunities for selective reporting.

Learning Objectives

Describe the fundamental logic of the cohort study design.
Distinguish between open and closed source populations.
Differentiate between prospective and retrospective cohort designs.
Recognise how cohort studies relate to controlled trials.

What Is a Cohort Study?

The word cohort denotes a group of study subjects that has a defined characteristic in common. In epidemiological study design, that characteristic is usually exposure status. In a cohort study, we follow subjects from exposure to outcome (Grimes and Schulz, 2002).

▸ INTERACTIVE STORY, THE TOWN THAT WAS FOLLOWED
Open full screen ↗

Walk through the Framingham Heart Study from 1948 enrollment to three generations of follow-up. Next ▶ advances scenes.

A 7-scene retelling of the most famous cohort study ever launched: town enrollment (Dawber, Meadors, & Moore, 1951), baseline measurements, decades of follow-up, incidence comparisons, the birth of the term "risk factor" (Kannel et al, 1961), and the three generations still under study today (Mahmood et al, 2014).

Key Idea

A cohort study closely resembles a controlled trial, without the randomisation of exposure. We start with subjects who do not yet have the disease, classify them by exposure, follow them forward in time, and compare the frequency of the outcome between exposure groups.

Most frequently, the outcome is the occurrence of a specific disease, but cohort studies can also examine outcomes such as birth weight, body mass index, blood pressure, or quality of life. Subjects are usually individuals, but can also be groups (e.g., families).

The cohort design's modern reputation rests on a small number of landmark studies that the rest of this lesson will return to repeatedly: the Framingham Heart Study (Dawber, Meadors, & Moore, 1951), which gave epidemiology the term “risk factor” (Kannel et al, 1961); the British Doctors Study (Doll & Hill, 1954; Doll et al, 2004), which followed UK physicians for 50 years and pinned down the smoking–lung cancer link; the Whitehall II civil-servant cohort (Marmot et al, 1991), which exposed a graded socioeconomic gradient in chronic disease; the Nurses' Health Study (Colditz, Manson, & Hankinson, 1997); the multinational EPIC cohort (Riboli et al, 2002); and the recent generation of population biobanks, UK Biobank (Sudlow et al, 2015) and the Canadian Longitudinal Study on Aging (Raina et al, 2019).

Figure, The logic of cohort design: classify disease-free subjects by exposure, follow them forward, compare the disease frequency between groups.

Selecting the Study Group

How we select the cohort depends on what we know in advance. The three flip cards below name the standard choices, click each one and notice that the choice flows from the data already in hand, not from any abstract preference for one design over another.

Two-Cohort Design

Click to learn more

Single (Longitudinal) Cohort

Click to learn more

Virtual Cohort

Click to learn more

In both two-cohort and single-cohort designs, after selecting subjects we (1) verify they meet inclusion criteria, (2) confirm exposure status, (3) ensure they do not yet have the outcome, then (4) follow them for a defined period and compare incidence between exposure groups.

Whichever cohort structure you pick, the next decision is whether the follow-up has already happened or whether you will be doing it as the study runs.

Prospective vs. Retrospective Designs

Cohort studies can be conducted either way, depending on whether suitable records already exist (Euser et al, 2009). The two tabs below put the trade-offs side by side.

In a prospective cohort study, the disease has not yet occurred when the study begins. Subjects are recruited, exposure is assessed at baseline, and they are followed forward in time as outcomes develop.

Advantages: Allows more detailed information-gathering and careful recording of exposure, confounders, and outcome timing (see Examples 8.6, 8.7, and 8.9).

Disadvantages: Time-consuming and expensive; vulnerable to losses to follow-up over long study periods.

In a retrospective cohort study, the follow-up period has already ended and the disease event has already occurred when subjects are selected (Hudson et al, 2005). Investigators reconstruct exposure and outcome from existing records.

Advantages: Faster and cheaper; useful when good historical records exist (Examples 8.1, 8.4, 8.5).

Disadvantages: Requires suitable existing databases; depth of information is limited to what was recorded.

Beyond the timing question is a structural one about the population itself: does its membership stay fixed for the duration of follow-up, or do people enter and leave? You met this distinction in an earlier lesson; it returns here as a more central concern, because cohort follow-up is what makes it operational.

Open vs. Closed Source Populations

The nature of the source population determines the appropriate design. This is a critical decision that affects everything from sample-size calculations to the choice of analytic methods. Read the table below as a checklist for matching disease type to design type, chronic outcomes almost always require open-population, rate-based handling, and a later section builds out exactly what that requires.

Feature	Closed Population	Open Population
Membership	Fixed at start of study	Subjects can enter and leave
Follow-up	All subjects observed for full risk period	Variable time-at-risk per subject
Best disease type	Short risk period (e.g., outbreaks)	Long or chronic risk period (e.g., cancers)
Disease frequency	Risk (cumulative incidence)	Rate (incidence density)
Acceptable losses	Few or none preferred (<10%)	Time-at-risk accounted for explicitly

Key Examples

Three published examples bring the design choices we have just enumerated into one place. We will refer back to these throughout the lesson by number, so it is worth pausing on each one to identify which boxes the investigators ticked: prospective vs. retrospective, two-cohort vs. single, open vs. closed, risk-based vs. rate-based.

Example 8.1, Retrospective Risk-Based (Discharge Against Medical Advice) ▼

Choi et al (2011) conducted a hospital-based cohort study where the exposure was discharge against medical advice (DAMA) versus discharged with medical advice (DWMA). The outcome was readmission within 14 days. Each DAMA patient was matched with one DWMA patient by 10-year age group, gender, and clinical characteristics. Because all patients were observable for the full 14-day risk period, this is a classic risk-based design. Conditional logistic regression accounted for the matching. Result: 26% of DAMA patients were readmitted within 14 days versus only 3% of DWMA patients.

Example 8.2, Continuous-Scale Outcome (Environmental Tobacco Smoke) ▼

Crane et al (2011) conducted a retrospective cohort study based on interviews with 11,000+ women who gave birth in two Canadian provinces (2001–2009). Eleven per cent self-declared exposure to environmental tobacco smoke. Outcomes included infant body dimensions, Apgar scores, respiratory distress syndrome, and stillbirth. Multiple regression was used to control for confounders. Tobacco smoke was associated with lower birth weight, smaller body size, and increased stillbirths. Note: when outcomes are on a continuous scale (e.g., birth weight), the cohort design still applies; we just use linear rather than logistic regression.

Example 8.3, Propensity-Score Matching (Antipsychotics & Falls) ▼

Mehta et al (2010) used a population-based retrospective cohort to investigate falls and fractures in adults ≥50 years. The exposure was atypical versus typical antipsychotic agents. More than 60 covariates were combined into a propensity score, and the “Greedy 5-1 matching technique” was used to match subjects with similar scores. Each exposure group contained 5,580 people. While the hazard ratio did not differ significantly between drug classes, taking any antipsychotic for >90 days was associated with HR = 1.8 for falls or fractures.

Stating the Study Objective

Each study should clearly specify:

The target population (to which inferences will be made)
The source population (from which the study group will be drawn)
The unit of observation (individuals or groups)
The exposure, the disease, and the follow-up period
The setting (context or venue) of interest
If biology is known: the amount or duration of exposure thought to cause disease, and the relevant time window for exposure (current vs. lifetime vs. historical)

Key Takeaways

Cohort studies follow disease-free subjects from exposure forward to outcome.
The design parallels a controlled trial, minus randomisation.
Two-cohort designs select by exposure status; longitudinal designs select a single group with a range of exposures.
Studies can be prospective or retrospective; the difference is timing relative to outcome occurrence.
Closed populations call for risk-based designs; open populations call for rate-based designs.

The takeaways above name what changed conceptually compared with case-control designs. The R box that follows makes the change concrete: because we sampled on exposure, the same kind of 2×2 table you met in earlier lessons now yields a risk ratio and an incidence rate ratio directly, no rare-disease assumption required.

R Risk ratio and incidence rate from cohort data

What you'll do: compute risk, incidence rate, the risk ratio, and the incidence rate ratio from a small simulated cohort. What to take away: sampling on exposure unlocks measures of disease frequency that case-control designs simply cannot deliver, and a later section will show why the choice between risk-based and rate-based handling determines which of these two ratios is the right summary in any given study.

A cohort lets you compute a risk (cumulative incidence) or a rate (incidence density) directly, you sampled by exposure, not by outcome. Below is a hand calculation of both from a small simulated cohort.

# 1000 exposed and 1000 unexposed individuals followed for up to 5 years.
#   exposed: 80 events in 4500 person-years
# unexposed: 30 events in 4900 person-years

events <- c(exposed = 80,   unexposed = 30)
n      <- c(exposed = 1000, unexposed = 1000)
py     <- c(exposed = 4500, unexposed = 4900)

risk <- events / n
rate <- events / py * 1000          # per 1000 person-years

RR   <- risk["exposed"] / risk["unexposed"]   # risk ratio
IRR  <- rate["exposed"] / rate["unexposed"]   # incidence rate ratio

round(data.frame(risk, rate, RR = RR, IRR = IRR), 3)

Console output

risk rate RR IRR exposed 0.080 17.778 2.667 2.904 unexposed 0.030 6.122 2.667 2.904

Why both? The risk ratio answers "how many times more likely is an exposed person to develop disease over the follow-up window?" The rate ratio answers "per unit of person-time, how much more frequent is the event in exposed people?" In an open cohort with variable follow-up, the rate-based answer is usually the right one.

R Reflect on what you just ran

Use the questions below to interpret the output you produced. Look at your console table before answering.

1. The risk in the exposed group was 0.080 (80/1000) and in the unexposed group 0.030 (30/1000), giving RR = 2.667. Translate that risk ratio into a plain-English sentence about how the cohort's 5-year cumulative incidence differs by exposure.

Model answerOver 5 years of follow-up, 8 per 100 exposed people developed the outcome versus 3 per 100 unexposed; so exposed individuals had 2.67 times the cumulative risk of the outcome compared with unexposed. In absolute terms, the risk difference is 5 per 100 (50 per 1000) attributable to the exposure over 5 years, a meaningful effect size whether you report it as a ratio or a difference.

2. The incidence rate ratio (IRR = 2.904) is slightly larger than the risk ratio (RR = 2.667). Looking at the person-time denominators (4500 vs 4900), why does dividing by person-years instead of headcount nudge the ratio upward? Which group lost more person-time, and what does that suggest about follow-up in this cohort?

Model answerThe exposed group contributed 4500 person-years vs. 4900 for the unexposed, meaning the exposed group lost more time on average, either through earlier events (cases stop accruing person-time at the event) or through earlier loss-to-follow-up / censoring. Because the rate is cases / person-time, a smaller denominator makes the exposed rate larger relative to headcount, so the IRR (2.904) sits a bit above the RR (2.667). The pattern is the canonical sign that events or censoring are unevenly distributed across exposure groups, and is exactly what rates were designed to handle.

3. If the cohort were instead a closed population with everyone followed exactly 5 years (no losses), would the RR and IRR converge? Explain which measure you would report and why.

Model answerYes, they converge. With equal follow-up and no losses, the only thing that stops a person adding person-time is developing the outcome, so each group’s person-time stays close to (people × time) and the two ratios track each other. They match almost exactly when the outcome is uncommon over the window, because then the person-time lost to early cases is negligible; a small gap can remain when the outcome is common, since the group with more events loses a little more person-time. In that closed-cohort setting, reporting the risk ratio is cleaner because cumulative incidence (a probability) is easier to interpret than a rate. Once you have differential censoring or open-cohort dynamics, the rate-based (IRR/HR) framework is required because the risk-based estimator becomes biased.

Saved.

The reflection below asks you to use the timing distinction in a concrete research scenario. After working through it and the knowledge check, a later section returns to the risk-vs.-rate split that the R box just previewed and shows what each design buys, costs, and assumes.

Reflection

Think of a health question you find compelling. Would you address it with a prospective or retrospective cohort design? What records or recruitment infrastructure would you need? What might be lost or gained by each choice in your specific case?

Model answerFor a fast-moving exposure–outcome (e.g., antibiotic prescribing patterns and 30-day Clostridioides difficile infection), a retrospective cohort assembled from administrative data is faster, cheaper, and feasible, the records already exist, the inclusion window can be defined by ICD-coded events, and follow-up is short enough that linkage is reliable. Trade-offs: you inherit the data quality of the chart, miss any exposure not coded, and have no biomarker corroboration. For a slower-moving exposure (e.g., air pollution and dementia), a prospective cohort is the right choice: you can pre-specify the exposure assessment (personal monitors, residential geocoding), measure covariates before the outcome, and avoid recall bias, at the cost of decades of follow-up and study budget. The infrastructure you need is different too: administrative-data prospective cohorts (UK Biobank, Sudlow et al, 2015; the Canadian Longitudinal Study on Aging, Raina et al, 2019) sit between the two.

Minimum 20 characters required.

✓ Reflection saved

Section 2

Risk-Based & Rate-Based Designs

⏱ Estimated reading time: 15 minutes

Section 2 of 4

Risk-Based & Rate-Based Designs

Two denominators, two effect measures, and the logic that chooses between them.

Risk-based design

Closed cohort, cumulative incidence

Eq 8.1 · Risk and risk ratio

\[ \color{#0B7B6B}{R_1} = \frac{\color{#C2410C}{a_1}}{\color{#1D4ED8}{n_1}} \qquad \color{#6D28D9}{R_0} = \frac{\color{#C2410C}{a_0}}{\color{#1D4ED8}{n_0}} \qquad \color{#BE185D}{\text{RR}} = \frac{\color{#0B7B6B}{R_1}}{\color{#6D28D9}{R_0}} \]

R₁ risk in exposedR₀ risk in unexposeda new casesn people at riskRR risk ratio

Valid only when the cohort is closed and all subjects are observable for the full risk period. Losses should be few (some authors use <10% as a threshold).

Best for: acute outbreaks, surgical complications, short-window outcomes.

Rate-based design

Open cohort, incidence density

Eq 8.2 · Rate and rate ratio

\[ \color{#0B7B6B}{I_1} = \frac{\color{#C2410C}{a_1}}{\color{#1D4ED8}{t_1}} \qquad \color{#6D28D9}{I_0} = \frac{\color{#C2410C}{a_0}}{\color{#1D4ED8}{t_0}} \qquad \color{#BE185D}{\text{IRR}} = \frac{\color{#0B7B6B}{I_1}}{\color{#6D28D9}{I_0}} \]

I₁ rate in exposedI₀ rate in unexposeda new casest person-timeIRR rate ratio

Each subject contributes person-time until the event, censoring, or study end. Accommodates variable follow-up and dynamic populations.

Best for: chronic disease, long follow-up, open cohorts (e.g., 10-year breast cancer study, Example 8.7).

Analytic choice

Poisson vs. Cox

Poisson regression

Person-time as the offset. Use when the incidence rate is reasonably constant over follow-up. Produces the incidence rate ratio directly.

Example: rugby injury rates over one season (Example 8.6).

Cox proportional hazards

Semi-parametric. Use when the rate changes substantially over follow-up. The workhorse of modern cohort analysis.

Example: smoking and invasive breast cancer over 10 years (Example 8.7).

Sample size

Planning the study

Initial sample-size estimates typically assume a risk-based design even when a rate-based analysis is planned. That gives a workable ballpark for early planning.

Modern software extends to: unequal group sizes, survival-time outcomes (Matsui, 2005), strata-matched designs (Mazumdar et al, 2006), and time-varying exposures (Basagana et al, 2011).

For rate-based designs, the person-time denominator can be derived from the expected rate and the follow-up window once preliminary estimates exist.

Carry forward

Into the next section

Risk-based: closed cohort, subjects as denominator, cumulative incidence and risk ratio.
Rate-based: open cohort or variable follow-up, person-time as denominator, incidence rates and rate ratio.
Poisson regression suits constant rates; Cox proportional hazards suits long follow-up with changing rates.
Risk ratio and rate ratio are distinct quantities. They converge only in a closed cohort with no censoring.

Introduction and Overview

An earlier section established what a cohort study is and named the design choices its investigators have to make. This section drills into the most consequential of those choices: whether to count events per person (risk) or events per unit of person-time (rate). The two designs share a 2×2 layout but differ in what they assume about the population and what they let you say about disease frequency. Sample-size planning, surprisingly, is a useful place to start, because the calculation is the same for both designs even when the analysis ends up being different.

Learning Objectives

Describe the design and assumptions of risk-based (cumulative incidence) cohort studies.
Describe the design and analysis of rate-based (incidence density) cohort studies.
Identify hypotheses and population types appropriate for each design.
Calculate and interpret the basic measures of disease frequency for each design.

Sample Size

Initial sample-size estimates are usually performed assuming an equal number of exposed and non-exposed subjects, and assuming the disease is measured by risk (Section 8.2.2). This approach is often sufficient for initial planning even if the population is open and a rate-based design must ultimately be used.

Modern Sample-Size Software

Recent software allows for unequal sample sizes, repeated measures, multivariable regression models, and proportional hazards models. Specialised methods exist for competing risks (Latouche and Porcher, 2007), survival-time outcomes (Matsui, 2005), strata-matched designs (Mazumdar et al, 2006), and time-varying exposures (Basagana et al, 2011).

Risk-Based (Cumulative Incidence) Designs

This is the simplest form of cohort study, but several assumptions must hold:

Exposure groups are defined at the start of the study and remain unchanged (fixed cohorts).
The study groups are closed, all subjects must be observed for the full risk period.
There should be few or no losses (some authors use >10% losses as a cut-point that casts doubt on validity).

When Risk-Based Designs Work Best

Risk-based designs work best for diseases with a relatively short risk period (e.g., acute infections, post-surgical complications). For chronic diseases such as many cancers, where the risk period is lifelong and often longer than feasible follow-up, a rate-based design is preferred.

2×2 Table: Risk-Based Cohort Design

	Exposed	Non-exposed	Total
Diseased	a₁	a₀	m₁
Non-diseased	b₁	b₀	m₀
Total	n₁	n₀	n

We select n₁ exposed and n₀ non-exposed individuals (free of disease) from the source population, follow them for the full follow-up period, and observe a₁ exposed cases and a₀ non-exposed cases. The two risks of interest are:

Eq 8.1

\[ \color{#0B7B6B}{R_1} = \frac{\color{#C2410C}{a_1}}{\color{#1D4ED8}{n_1}} \qquad \color{#6D28D9}{R_0} = \frac{\color{#C2410C}{a_0}}{\color{#1D4ED8}{n_0}} \]

The risk in the exposed and the risk in the unexposed are each the new cases in that group divided by the number of people at risk in it.

The Denominator

In risk-based designs, the denominator is the number of subjects in each exposure category. This is only valid because every subject is observed for the full risk period, otherwise, who you count and who you don’t would depend on follow-up time.

Risk-based designs are conceptually clean but operationally fragile, their assumptions break the moment people leave the cohort or the risk period extends beyond a few months. The rate-based alternative was developed precisely to handle the populations where those assumptions do not hold.

Rate-Based (Incidence Density) Designs

In many cohort studies, not every subject is under observation for the full risk period, especially when:

The source population is dynamic (subjects enter and leave).
The follow-up period is long.
Subjects are added part-way through the biological risk period.
A significant proportion of subjects withdraw from the study.
Exposure status itself changes during the study.

In these situations, we cannot just count exposed and non-exposed subjects. Instead, we accumulate the amount of ‘at-risk time’ contributed by each subject in each exposure category. The denominator becomes person-time, not persons. For example, 10 people each followed for 2 years and 20 people each followed for 1 year both contribute 20 person-years, so this denominator credits every subject for exactly as long as we actually observed them.

2×2 Table: Rate-Based Cohort Design

	Exposed	Non-exposed	Total
Diseased	a₁	a₀	m₁
Person-time at risk	t₁	t₀	T

Each subject contributes ‘at-risk’ time until they develop the disease, are lost to follow-up, or the study ends. The two rates of interest are:

Eq 8.2

\[ \color{#0B7B6B}{I_1} = \frac{\color{#C2410C}{a_1}}{\color{#1D4ED8}{t_1}} \qquad \color{#6D28D9}{I_0} = \frac{\color{#C2410C}{a_0}}{\color{#1D4ED8}{t_0}} \]

The incidence rate in the exposed and the rate in the unexposed are each the new cases divided by the person-time at risk accumulated in that group.

Choice of Analysis

If follow-up is relatively short and rates are reasonably constant, Poisson models are appropriate. If follow-up is long and the assumption of a constant rate is not tenable, survival analysis (e.g., Cox proportional hazards) is preferred (Cox, 1972; see Chapter 19).

You have now seen both designs from the inside. The next subsection puts them side by side; read it as a decision aid for matching design to research situation, not as a statement that one design is generally better than the other.

Comparing the Two Designs

Risk-based designs are best when:

The population is closed (fixed cohort).
The risk period is short (so all subjects can be observed for the full period).
Losses to follow-up are minimal (under ~10%).
Examples: acute outbreaks, surgical complications within 30 days, hospital readmissions within 14 days (Example 8.1).

Rate-based designs are best when:

The population is open (dynamic).
Follow-up is long or the risk period is chronic.
Subjects enter or leave the study at different times.
Exposure status may change during follow-up.
Examples: rugby injury rates over a season (Example 8.6), invasive breast cancer over 10+ years (Example 8.7), fracture incidence over decades (Example 8.8).

Risk denominator: the number of subjects in each exposure category. Counts people.

Rate denominator: the cumulative person-time at risk in each exposure category. Counts time.

This means risk is dimensionless (a proportion between 0 and 1), while a rate has units of cases per person-time (e.g., 4.0 per 1,000 person-years).

Key Examples

The four examples below illustrate the design choice with real published studies. The first two are risk-based; the last two are rate-based. As you expand each, ask yourself why the investigators chose what they chose, in every case the source population's behaviour and the follow-up window's length will be doing most of the work.

Example 8.4, Risk-Based (Time-of-Day & Surgical Complications) ▼

Kelz et al (2009) compared morbidity and mortality following 56,000+ general and vascular surgical procedures (2001–2004). Time of operation was grouped into seven 2-hour periods. Risk of mortality within 30 days had a moderately strong association with start times after 9:30 pm (OR = 1.22), and morbidity had OR = 1.32 for late-night surgeries. However, when emergency cases were excluded, no odds ratios were significant. The excess crude risk was largely explained by the nature of the clinical cases, an important reminder about confounding by indication.

Example 8.5, Risk-Based (Cervical Screening in HIV-Positive Women) ▼

Leece et al (2010) followed approximately 250 HIV-positive women receiving care at the Ottawa Hospital General Campus Immunodeficiency Clinic (2002–2005). The outcome was undergoing cervical screening; predictors included demographics, HIV status, and primary care provider status. Analysis combined χ² tests with logistic regression. The 12 women without a primary-care provider were less likely to undergo screening (RR = 1.6) than the 84 women with providers. The authors noted that abnormal screening results were common and that recent low CD4 cell count was the only significant predictor.

Example 8.6, Rate-Based (Rugby Injury Rates) ▼

Chalmers et al (2011) followed 704 male amateur rugby players (aged 13+) over a season. The ‘time’ component was a game, with a total of 6,263 player-games of follow-up. Exposures included age, ethnicity, experience, BMI, smoking, previous injury, training, weather, ground conditions, foul play, and protective equipment. Because rates were reasonably constant over the period, Poisson regression was appropriate. Notable findings: Pacific Island vs. Maori ethnicity (IR = 1.5), ≥40 hours of strenuous activity weekly (IR = 1.5), playing while injured (IR = 1.5), foul play (IR = 1.9), and headgear use (IR = 1.2).

Example 8.7, Rate-Based (Smoking & Breast Cancer) ▼

Luo et al (2011) drew on the Women’s Health Initiative Observational Study: 90,000+ women aged 50–79 followed across 40 US clinical centres. Smoking exposure was characterised in detail (status, age started, age quit, cigarettes/day, pack-years). Over an average of 10.3 years of follow-up, 3,520 incident invasive breast cancers were identified. Because of the long follow-up, Cox proportional hazards models (rather than Poisson) were used. Findings: HR = 1.09 for former smokers and HR = 1.16 for current smokers. Among lifetime non-smokers, only those with the highest passive-smoke exposure had increased risk; no significant dose-response trend was seen.

Key Takeaways

Risk-based designs use number of subjects as the denominator and require a closed cohort followed for the full risk period.
Rate-based designs use person-time as the denominator and accommodate dynamic populations and variable follow-up.
The choice between Poisson and Cox proportional hazards depends on whether the rate is reasonably constant or changes substantially over follow-up.
Initial sample-size calculations can be done assuming a risk-based design even when the analysis will ultimately be rate-based.

The reflection below asks you to apply the choice from this section to a specific long-running occupational cohort. After working through it, a later section turns to a problem that stays mostly hidden in textbook treatments: how do you actually measure exposure when it can change over years or decades of follow-up?

Reflection

Suppose you are studying the effect of a workplace exposure (e.g., shift work) on cardiovascular disease over 20 years. Workers can join or leave the company at any time. Which design (risk-based or rate-based) is more appropriate, and why? What practical issues would arise that wouldn’t arise in a 14-day hospital readmission study?

Model answerRate-based is appropriate. Workers entering and leaving the company across a 20-y window create an open cohort where person-time, not headcount, is the natural denominator. Practical issues: (a) defining exposure time, how to handle workers who move in and out of shift work; (b) healthy-worker effect, long-tenured workers are healthier than the general working population, biasing toward null; (c) healthy-worker survivor effect, those who keep doing shift work are those who tolerate it, dynamically depleting the susceptible from the exposed group; (d) time-varying confounders like BMI or hypertension that may be affected by past shift work; (e) loss-to-follow-up when workers leave the company. None of these arise in a 14-day readmission study because the window is too short for healthy-worker dynamics to matter, the cohort stays effectively closed, and confounder profiles are fixed at index admission. Methods: standardised mortality ratios with internal comparators, g-methods for time-varying exposure, sensitivity for unmeasured occupational confounders.

Minimum 20 characters required.

✓ Reflection saved

Section 5

Final Review & Assessment

⏱ Estimated time: 20 minutes

Bringing It All Together

Where an earlier lesson looked backward from outcome to exposure, this lesson followed the arrow forward. An earlier section framed the cohort study as something close to a controlled trial without randomisation: pick a source population, classify exposure, follow people, and watch incidence accumulate. An earlier section then forced the central design choice, risk-based (cumulative incidence) for closed populations followed for a fixed time, or rate-based (incidence density) for open populations where person-time is the natural denominator, and connected each to its analytic toolkit (binomial / log-binomial vs. Poisson / survival).

Earlier sections zoomed in on the operational details that decide whether a cohort study survives appraisal. How exposure is scaled (dichotomous, ordinal, continuous, compound), whether it changes over time, how the induction period is handled, and how comparability is engineered through restriction, matching, and analytic control all determine whether the rate ratio you eventually report is estimating what you think it is. Blinded outcome ascertainment, careful handling of loss to follow-up, and STROBE-aligned reporting then turn a defensible design into a study other researchers can actually use.

The final reflection asks you to put the entire arc to work as a brief cohort proposal of your own; the 15-question assessment then checks the conceptual content directly. A later lesson will pull the camera back further to look at ecological and group-level designs (where the unit of analysis stops being the person) and the comparability and inference logic you just built will keep paying off there.

Key Takeaways from this lesson

Cohort studies follow exposed and unexposed people forward in time, giving you direct access to incidence, something case-control studies cannot deliver.
The choice between risk-based (closed cohort, cumulative incidence) and rate-based (open cohort, incidence density) designs is dictated by the source population and the follow-up structure, not by preference.
Person-time is the right denominator whenever follow-up is uneven or membership is dynamic; it makes Poisson and survival analyses possible.
Exposure must be measured on a meaningful scale, with explicit handling of induction periods and time-varying status, misclassification here flows through the whole analysis.
Comparability is engineered through restriction, matching, and analytic control; blinded outcome ascertainment and rigorous tracking of loss-to-follow-up protect internal validity.
STROBE-aligned reporting closes the loop: a cohort study is only as useful as its design, conduct, and analysis are visible to the reader.

R Activity, Risk, rate, RR/IRR, and a 95% CI for the rate ratio

The companion R script r-activities/HSCI_230_Lesson_5_Cohort_Studies.R walks through a small simulated cohort end-to-end: compute cumulative incidence (risk) and incidence rate (per 1000 person-years) for exposed vs. unexposed groups, derive the risk ratio (RR) and incidence rate ratio (IRR), then build a Wald 95% confidence interval for the IRR on the log scale, the same workflow you will reach for when appraising a published cohort.

# 1000 exposed and 1000 unexposed individuals followed for up to 5 years.
#   exposed:  80 events in 4500 person-years
# unexposed:  30 events in 4900 person-years

events <- c(exposed = 80,   unexposed = 30)
n      <- c(exposed = 1000, unexposed = 1000)
py     <- c(exposed = 4500, unexposed = 4900)

risk <- events / n
rate <- events / py * 1000          # per 1000 person-years

RR  <- risk["exposed"] / risk["unexposed"]   # risk ratio
IRR <- rate["exposed"] / rate["unexposed"]   # incidence rate ratio

round(data.frame(risk, rate, RR = RR, IRR = IRR), 3)

## -----------------------------------------------------------------------------
## Stretch: 95% CI for the rate ratio (Wald approximation on the log scale)
## -----------------------------------------------------------------------------
log_irr   <- log(IRR)
se_logirr <- sqrt(1/events["exposed"] + 1/events["unexposed"])
ci_irr    <- exp(log_irr + c(-1, 1) * 1.96 * se_logirr)
round(c(IRR = IRR, lower = ci_irr[1], upper = ci_irr[2]), 3)

Reflection

Design a brief cohort study proposal for a health question of your choice. Specify: (1) the research question and hypothesis, (2) whether the source population is open or closed, (3) whether you would use a risk-based or rate-based design and why, (4) how you would define and measure exposure (and on what scale), (5) how you would ensure comparability of groups, and (6) what analytic approach you would use.

Model answer(1) Question/hypothesis: Among adults 18–45, does sustained intake of ultra-processed food (NOVA-4) > 30% of total energy increase 10-year incidence of metabolic syndrome? (2) Source population: open cohort, community-recruited Vancouver-area adults; movers can be retained. (3) Design: rate-based with risk-set methods (Cox PH), to use person-time and handle censoring/loss cleanly. (4) Exposure measurement: 24-h dietary recalls (3 per year) coded under NOVA, then collapsed to %energy from NOVA-4; analysed continuously with restricted cubic splines, plus a pre-specified clinical threshold at 30%. (5) Comparability: baseline equivalence on income, education, ethnicity, physical activity, family history; restriction to participants without metabolic syndrome at baseline; DAG-guided adjustment for SES and physical activity (NOT for BMI, which is a mediator). (6) Analysis: Cox PH with the continuous exposure, multiple imputation for missing covariates, sensitivity analyses lagging exposure 2 years to address reverse causation, and pre-registration on OSF.

Minimum 20 characters required.

✓ Reflection saved

Final Knowledge Assessment

This assessment covers all sections of this lesson. You must score 100% to complete the lesson. Review the feedback after each attempt.

🎉 Congratulations!

You have completed this lesson: Cohort Studies.

You now understand the design, implementation, analysis, and reporting of cohort studies, including risk-based and rate-based designs, exposure measurement principles, comparability strategies, and STROBE reporting guidelines.

Earlier lessons covered the three workhorse observational designs at the level of the individual: cross-sectional, case-control, and cohort. A later lesson changes the unit of analysis. Ecological and Group-Level Studies uses populations rather than individuals as the unit of observation, a strategy that opens up routinely-collected data for epidemiology but introduces a famous interpretive trap (the ecological fallacy) you will need to recognize.

HSCI 230, Lesson 5

Evaluating Epidemiological Research

CohortStudies

Learning objectives for this lesson:

Glossary: Key Terms, People & Concepts

Introduction & Cohort Study Design

Introduction & Cohort Study Design

What a cohort study is

The studies that defined the method

Framingham Heart Study (1948–)

British Doctors Study (1951–2001)

Three structures

Two-cohort design

Single (longitudinal) cohort

Virtual cohort

Prospective vs. retrospective

Prospective

Retrospective

Open vs. closed source populations

Into the next section

Introduction and Overview

Learning Objectives

What Is a Cohort Study?

Key Idea

Selecting the Study Group

Prospective vs. Retrospective Designs

Open vs. Closed Source Populations

Key Examples

Stating the Study Objective

Key Takeaways

R Reflect on what you just ran

Reflection

Reflection

Risk-Based & Rate-Based Designs

Risk-Based & Rate-Based Designs

Closed cohort, cumulative incidence

Open cohort, incidence density

Poisson vs. Cox

Poisson regression

Cox proportional hazards

Planning the study

Into the next section

Introduction and Overview

Learning Objectives

Sample Size

Modern Sample-Size Software

Risk-Based (Cumulative Incidence) Designs

When Risk-Based Designs Work Best

2×2 Table: Risk-Based Cohort Design

The Denominator

Rate-Based (Incidence Density) Designs

2×2 Table: Rate-Based Cohort Design

Choice of Analysis

Comparing the Two Designs

Key Examples

Key Takeaways

Reflection

Reflection

The Exposure

The Exposure

Four types of exposure variable

Dichotomous

Continuous

Ordinal

Compound

Permanent vs. non-permanent exposures

Permanent exposures

Non-permanent exposures

Time before disease can arise

Contributing to both categories

Into the next section

Introduction and Overview

Learning Objectives

Why Exposure Measurement Matters

Measurement Is Not Trivial

Scales of Exposure Measurement

Permanent vs. Non-Permanent Exposures

The Induction Period & Time-at-Risk

Handling the Induction Period

Changing Exposure Status

Cohort
Studies