Modelling Survival Data

Exploratory Data Analysis For Epidemiology

Kiffer G. Card, PhD, Faculty of Health Sciences, Simon Fraser University

Learning objectives for this lesson:

Distinguish between non-parametric, semi-parametric, and parametric analyses of survival data
Carry out non-parametric analyses using actuarial and Kaplan-Meier life tables
Generate and interpret survivor and cumulative hazard function graphs
Understand relationships among survivor S(t), failure F(t), hazard h(t), and cumulative hazard H(t) functions
Fit and interpret a Cox proportional hazards model including hazard ratios
Evaluate the proportional hazards assumption using graphical and statistical methods
Incorporate time-varying covariates and stratified analyses in Cox models
Describe parametric survival models (exponential, Weibull, Gompertz) and accelerated failure time models
Understand frailty models for accounting for unmeasured covariates

This course was developed by Kiffer G. Card, PhD, as a companion to Dohoo, I. R., Martin, S. W., & Stryhn, H. (2012). Methods in Epidemiologic Research. VER Inc.

Section 1

Introduction & Non-Parametric Analyses

⏱ Estimated time: 20 minutes

What Is Survival Analysis?

Survival analysis (also called time-to-event analysis) is concerned with a specific type of outcome: the time that passes before a particular event of interest occurs. The event might be death, disease onset, recovery, relapse, or any other well-defined transition. What distinguishes survival data from other continuous outcomes is the presence of censoring—incomplete information about when the event occurred for some individuals.

Survival data have two important structural features. First, the time variable is subject to strict left truncation—no values can be less than zero. Second, the distribution of survival times is typically right-skewed, meaning that a few individuals have very long survival times that pull the mean to the right. These features make standard linear regression inappropriate for modelling survival outcomes.

Understanding Censoring

Censoring is a unique and defining feature of survival analysis. It occurs when we have incomplete information about a subject's survival time. A censored observation tells us that the true event time exceeds the observed follow-up time, but we do not know the exact event time. Standard regression methods cannot properly handle censored observations—survival analysis methods are specifically designed to extract the maximum information from both censored and uncensored observations.

Types of Censoring

There are several types of censoring, each arising from different circumstances. Understanding the type of censoring present in your data is critical for choosing the appropriate analytical method.

▶

Right Censoring

Click to learn more

↔

Interval Censoring

Click to learn more

◀

Left Censoring

Click to learn more

Truncation vs Censoring

Truncation differs from censoring in an important way. With censoring, we know the subject exists but have incomplete event time information. With truncation, the subject is entirely excluded from the study because their event time falls outside an observable window. Left truncation (delayed entry) occurs when subjects are only observed if they have survived long enough to enter the study. This distinction affects how the risk set is calculated.

Quantifying Survival Time

Several summary measures can describe survival data:

Mean survival time—often difficult to estimate accurately because the longest survival times may be censored
Median survival time—the time at which 50% of subjects have experienced the event; more robust to censoring than the mean
n-year survival—the proportion of subjects surviving beyond a specified time point (e.g., 5-year survival rate)
Incidence rate—the number of events divided by the total person-time at risk

Three Approaches to Survival Analysis

There are three broad approaches to analysing survival data, each with different assumptions and capabilities:

Non-parametric methods—make no assumptions about the shape of the survival or hazard function (e.g., Kaplan-Meier, actuarial life tables)
Semi-parametric methods—leave the baseline hazard unspecified while modelling the effect of predictors parametrically (e.g., Cox proportional hazards model)
Parametric methods—specify a particular distributional form for the baseline hazard (e.g., exponential, Weibull, Gompertz models)

Actuarial Life Tables

The actuarial (or life-table) method is one of the oldest approaches to estimating survival. Time is divided into pre-specified intervals, and survival is estimated within each interval. The key quantities in an actuarial life table are:

Symbol	Quantity	Description
l_j	Starting number at risk	Number of subjects alive at the start of interval j
w_j	Withdrawals	Number censored (withdrawn) during interval j
r_j	Adjusted risk set	r_j = l_j − w_j/2 (assumes withdrawals at midpoint)
d_j	Failures (events)	Number experiencing the event during interval j
q_j	Risk of failure	q_j = d_j / r_j
p_j	Survival probability	p_j = 1 − q_j
S_j	Cumulative survival	Product of all p_j values up to interval j

The Kaplan-Meier Estimator

The Kaplan-Meier (KM) estimator, also called the product-limit estimator, is the most widely used non-parametric method for estimating the survivor function. Unlike the actuarial method, the KM estimator recalculates the survival probability at each actual event time rather than at pre-specified intervals.

Kaplan-Meier Estimator (Eq 19.1)

S(t) = Π (r_j − d_j) / r_j

Key properties of the Kaplan-Meier estimator:

It is a step function that only changes at observed failure times
It is piecewise constant between events
It is non-increasing—the estimated survival can only stay the same or decrease over time
It is right-continuous—at each event time, the function takes the value after the drop

The Nelson-Aalen Estimator

While the Kaplan-Meier estimator directly estimates the survivor function S(t), the Nelson-Aalen estimator provides a non-parametric estimate of the cumulative hazard function H(t).

Nelson-Aalen Cumulative Hazard Estimator (Eq 19.2)

H(t) = Σ d_j / r_j

The Nelson-Aalen estimator sums the ratio of events to the risk set at each failure time up to time t. It estimates the expected number of events that would have occurred up to time t if the process could be repeated. Note that the survivor function can be estimated from the cumulative hazard as S(t) = e^−H(t).

Reflection

Why is censoring a unique challenge in survival analysis compared to other regression approaches? How might ignoring censored observations bias your results?

Reflection saved!

* Complete the quiz and reflection to continue.

Section 3

Cox Proportional Hazards Model

⏱ Estimated time: 25 minutes

The Cox Model

The Cox proportional hazards model is the most widely used regression model for survival data. Introduced by Sir David Cox in 1972, it is a semi-parametric model—it models the effect of predictors on the hazard without making any assumptions about the shape of the baseline hazard function.

Cox Proportional Hazards Model (Eq 19.13)

h(t) = h₀(t) × e^βX

In this model, h₀(t) is the baseline hazard (the hazard when all predictors are zero), and e^βX is the multiplicative effect of the predictor(s). The baseline hazard h₀(t) is left completely unspecified—it can take any shape. This is the key advantage of the Cox model: no distributional assumption about the underlying survival times is needed.

The Hazard Ratio

Hazard Ratio (Eq 19.14)

HR = h(t) / h₀(t) = e^βX

The hazard ratio (HR) is the primary measure of effect in the Cox model. For a one-unit increase in a predictor X, the hazard is multiplied by e^β. For example, if β = 0.693 for a treatment variable, then HR = e^0.693 = 2.0, meaning the treated group has twice the hazard (rate of events) compared to the reference group. An HR > 1 indicates increased hazard (shorter survival), while HR < 1 indicates decreased hazard (longer survival).

Log-Hazard Form (Eq 19.15)

ln h(t) = ln h₀(t) + βX

The Proportional Hazards Assumption

The Cox model assumes that the hazard ratio between any two individuals is constant over time. This means the effect of a predictor does not change as time passes. For example, if treatment halves the hazard at 1 month, it must also halve the hazard at 12 months. When this assumption is violated, the estimated hazard ratio represents a weighted average of the time-varying effects, and its interpretation becomes ambiguous. Testing this assumption is a critical step in Cox model validation.

Estimation: Partial Likelihood

Because the baseline hazard is left unspecified, the Cox model cannot use standard maximum likelihood estimation. Instead, it uses partial likelihood (also called conditional likelihood), which estimates the β coefficients without needing to estimate h₀(t). The partial likelihood considers only the ordering of events—at each failure time, it asks: given the current risk set, what is the probability that this particular individual was the one to fail?

Handling Ties

When two or more events occur at exactly the same time (ties), the exact partial likelihood becomes computationally expensive. Several approximations are available:

Breslow method—the simplest approximation, adequate when ties are few
Efron method—a better approximation that is the default in many software packages
Exact methods—computationally intensive but most accurate when ties are common

Example: Cox Model Results

Predictor	β	SE(β)	HR (e^β)	95% CI	P-value
Treatment (1 = new drug)	−0.47	0.14	0.63	0.48–0.82	0.001
Age (per 10 years)	0.35	0.08	1.42	1.21–1.66	<0.001
Stage III vs I	1.10	0.20	3.00	2.04–4.44	<0.001
Stage II vs I	0.52	0.18	1.68	1.18–2.40	0.004

In this example, patients receiving the new drug have 37% lower hazard of death (HR = 0.63) compared to the control group, after adjusting for age and disease stage. Each 10-year increase in age is associated with a 42% increase in the hazard. Stage III patients have 3 times the hazard compared to Stage I patients.

Stratified Cox Models

Stratified Cox Model (Eq 19.16)

h_j(t) = h_0j(t) × e^βX

When the proportional hazards assumption is violated for a particular variable, one solution is to stratify on that variable. In the stratified Cox model, each stratum j has its own baseline hazard h_0j(t), but the regression coefficients (β) are assumed to be the same across all strata. This allows the stratifying variable to have a completely flexible effect on survival without needing to estimate or specify its functional form.

Time-Varying Predictors and Effects

The standard Cox model assumes that predictor values are measured at baseline and remain constant. However, some predictors change over the course of follow-up:

Time-varying predictors: The actual value of the predictor changes during follow-up (e.g., a patient is discharged from hospital and their treatment status changes from “inpatient” to “outpatient”). This requires splitting the follow-up time into intervals and updating predictor values.
Time-varying effects: The predictor itself may be fixed at baseline, but its effect on the hazard changes over time. This is modelled by including a predictor × time interaction term and indicates a violation of the proportional hazards assumption.

Validating the Cox Model

Thorough model validation involves checking several aspects of model performance. Different types of residuals serve different diagnostic purposes.

📊

Cox-Snell Residuals

Click to learn more

📈

Schoenfeld Residuals

Click to learn more

📑

Martingale Residuals

Click to learn more

🔎

Deviance Residuals

Click to learn more

⚠

Score Residuals

Click to learn more

Graphical assessment of proportional hazards

Two graphical methods are commonly used. First, log-cumulative hazard plots: plot ln H(t) (or equivalently, ln(−ln S(t))) against ln(t) for each group. If the curves are roughly parallel, the PH assumption is reasonable. Second, observed vs predicted plots: compare the Kaplan-Meier survival curves to the Cox model-predicted curves for each group. Good agreement supports the model.

Statistical tests for proportional hazards

The most common statistical test uses Schoenfeld residuals. The test regresses the scaled Schoenfeld residuals against time (or a function of time). A significant P-value indicates that the effect of the predictor changes with time, violating the PH assumption. A global test that combines results across all predictors is also available.

Overall model fit and discrimination

Several measures assess how well the model fits and discriminates. Cox-Snell residuals assess overall goodness-of-fit. Harrell’s C concordance statistic measures discriminative ability—the proportion of all pairs of subjects that the model correctly orders by predicted risk. Values of C range from 0.5 (chance) to 1.0 (perfect discrimination). An R² analogue has also been proposed for survival models.

Independent censoring assumption

All standard survival analysis methods assume that censoring is independent (non-informative)—meaning that censored subjects have the same future risk of the event as uncensored subjects who are still being followed. If sicker patients are more likely to drop out (informative censoring), the survival estimates will be biased. This assumption cannot be fully tested from the data, but sensitivity analyses can explore the impact of different censoring mechanisms.

Reflection

When the proportional hazards assumption is violated for a predictor, what are the practical implications for interpreting the hazard ratio? How would you communicate this to a clinical audience?

Reflection saved!

* Complete the quiz and reflection to continue.

HSCI 410 — Lesson 7

Exploratory Data Analysis For Epidemiology

Modelling Survival Data

Learning objectives for this lesson:

Introduction & Non-Parametric Analyses

What Is Survival Analysis?

Types of Censoring

Quantifying Survival Time

Three Approaches to Survival Analysis

Actuarial Life Tables

The Kaplan-Meier Estimator

The Nelson-Aalen Estimator

Section 1 Knowledge Check

Reflection

Survivor, Failure & Hazard Functions

The Survivor Function

The Failure Function

The Probability Density Function and Hazard Function

The Cumulative Hazard Function

Hazard Function Shapes

Constant Hazard (Exponential Distribution)

Increasing Hazard (Weibull with p > 1)

Decreasing Hazard (Weibull with p < 1)

Comparing Survival Curves

Section 2 Knowledge Check

Reflection

Cox Proportional Hazards Model

The Cox Model

The Hazard Ratio

Estimation: Partial Likelihood

Handling Ties

Example: Cox Model Results

Stratified Cox Models

Time-Varying Predictors and Effects

Validating the Cox Model

Section 3 Knowledge Check

Reflection

Parametric Models, AFT & Frailty

Why Parametric Models?

Common Parametric Models

Exponential Model

Weibull Model

Gompertz Model

Accelerated Failure Time (AFT) Models

Choosing a Parametric Model

Frailty Models

Section 4 Knowledge Check

Reflection

Lesson 7 — Comprehensive Assessment

Final Reflection

Final Assessment (15 Questions)

Lesson 7 Complete!