# Lesson 8 — Survival Data (v3 expanded)

*Companion-podcast transcript • Sarah & Kiffer* 
*~5482 words • ~29.7 min audio*

---

**Sarah:** Welcome back to Office Hours. I'm Sarah.

**Kiffer:** And I'm Kiffer. Today we're working through Lesson 8, Survival Data. And this is one of my favorite lessons in the whole methods sequence, because survival analysis is genuinely its own world.

**Sarah:** What do you mean by its own world? Why isn't it just another flavor of regression?

**Kiffer:** Because the outcome itself is structured differently. In every regression we've done so far, the outcome is a single number, or a category, or a count. You measure it once and you're done. In survival analysis, the outcome is two pieces of information bound together. How long someone was followed, and whether the event happened during that follow-up window.

**Sarah:** And the question shifts from whether something happens to when it happens. That's a really important reframe. Time to death after a heart attack. Time to disease recurrence after cancer treatment. Time to first job after graduation. Time to next infection after recovery. Time to relapse after addiction treatment. Time to the next birth in a fertility study.

**Kiffer:** All of those have the same shape. There's a starting point, often called time zero or the origin. There's a clock that ticks forward. And there's an event of interest that may or may not happen during the period we're watching.

**Sarah:** And the wrinkle, the thing that makes survival analysis its own toolbox rather than a special case of linear regression, is that we usually don't get to see the event happen for everybody. The clock runs out. People disappear. The study ends.

**Kiffer:** That problem is called censoring, and it's the central methodological challenge of the whole field. Spend ten minutes with censoring and you understand most of what makes survival analysis distinctive.

**Sarah:** So let's set the scene. The lesson has four sections in the textbook, but we're going to focus on the three big content sections that matter most for a beginning epidemiology student. Section one is descriptive survival analysis with Kaplan-Meier curves, the log-rank test, and median survival time. Section two is Cox proportional hazards regression, which is the workhorse multivariable model. And section three is diagnostics and competing risks, which is where things get interesting in the real world.

**Kiffer:** And before we go deeper, two quick notes about why survival data are different from anything else we've covered. First, survival times have a hard floor at zero. You can't have negative survival time. Time runs forward. So the distribution is left-truncated at zero by construction.

**Sarah:** Second, the distribution of survival times is almost always right-skewed. Most people have the event somewhere in the middle. A few people last a really long time. That long tail pulls the mean to the right and makes the mean a bad summary statistic. The median tends to be more honest.

**Kiffer:** Right. And those two structural features, the floor at zero and the right skew, are why standard linear regression is inappropriate. Linear regression assumes a symmetric distribution of residuals around a mean. Survival times don't behave that way.

**Sarah:** Okay. Section one. Kaplan-Meier curves. Let's start with censoring, because that's the foundation everything else sits on.

**Kiffer:** Censoring is what happens when we have incomplete information about someone's event time. We know they made it to a certain point without the event. We don't know what happened after that point. The simplest example is right censoring, which is also the most common.

**Sarah:** Right censoring means the event time, if it ever occurs, is to the right of where we last observed the person on the time axis. They survived at least until the censoring time. We don't know whether they had the event later.

**Kiffer:** And there are three big reasons people get right-censored in epidemiologic studies. Reason one. They get lost to follow-up. They moved. They stopped answering the phone. They dropped out of the registry. The clinic lost their chart. Whatever the reason, the study can no longer track them, and at the moment they disappear we know only that they were event-free.

**Sarah:** Reason two. They die from a competing cause before the outcome of interest can occur. Imagine you're studying time to cancer recurrence and one of your participants gets hit by a bus. They didn't recur, but they also can't recur anymore. We'll come back to competing risks because they need their own treatment. But for the standard Kaplan-Meier framework, that participant gets censored at the time of the competing event.

**Kiffer:** And reason three. The study ends. This one is structural rather than behavioral. You enrolled participants over a five-year window. You followed them until December thirty-first of a particular year. Anyone still event-free at that closing date is administratively censored. Their time variable is the time from enrollment to the end of the study. Their event indicator is zero.

**Sarah:** And the methodologically critical point is that censoring carries real information. A censored observation is not the same as a missing observation. We know something. We know the person made it to time t without the event. The methods are designed to use that information rather than throw it away.

**Kiffer:** If you ignored censoring and just analyzed the people who had the event, you'd be doing something pretty weird. You'd be saying the average survival time among people who happened to die within the study window is the average survival time, full stop. But that's not the average survival time of the population. That's the average among the unlucky subset.

**Sarah:** Or if you treated censoring as if the event happened at the censoring time, you'd be claiming everyone died at their censoring time. Which is the opposite kind of mistake. You'd dramatically overestimate the event rate.

**Kiffer:** So we need a method that uses the censored observations honestly. They contribute information up until the moment they're censored, and then they leave the dataset. That's exactly what the Kaplan-Meier estimator does.

**Sarah:** Let's name it carefully. Kaplan-Meier, or K M for short. Sometimes called the product-limit estimator. Introduced by Edward Kaplan and Paul Meier in nineteen fifty-eight. It's the most widely used non-parametric method for estimating the survival function in the world. Non-parametric meaning we make no assumption about the shape of the underlying distribution.

**Kiffer:** And the survival function itself, often written as S of t, is the probability that an individual is still event-free at time t. So S of zero equals one. Everyone is alive at the start. As time goes on, S of t can only stay the same or go down. It can never go up. And if we follow people long enough, in most settings, it will eventually approach zero.

**Sarah:** And the Kaplan-Meier estimator computes that survival probability empirically by walking through the observed event times and asking, at each event time, what fraction of the people who were still being followed survived this particular moment? Then it multiplies those fractions together to get the cumulative survival probability.

**Kiffer:** That's why it's called the product-limit estimator. It's a product of conditional survival probabilities. Step by step, event by event.

**Sarah:** Let me try to walk through what reading a Kaplan-Meier curve actually looks like. Picture the plot. The horizontal axis is time. Days, months, years, whatever your follow-up unit is. The vertical axis is the estimated probability of remaining event-free, which runs from zero to one.

**Kiffer:** The curve starts in the upper left corner, at time zero and probability one. Everyone is event-free at the start. Then as time moves forward to the right, the curve drops at each observed event time. The drop is proportional to one over the number of people still at risk just before the event.

**Sarah:** And between events the curve is flat. It only changes at event times. That's why it looks like a staircase. It's a step function. Piecewise constant between events. Right-continuous, meaning at each event time the function takes the lower value. And non-increasing. The estimated survival probability can only go down or stay the same.

**Kiffer:** Censored observations don't cause the curve to drop. Instead they're typically marked with little tick marks on the curve. A tick at year three means somebody was censored at year three. They left the risk set without having the event. The curve continues onward but with a smaller risk set, so the next drop will be slightly bigger.

**Sarah:** And the at-risk set is one of those concepts that beginning students sometimes glide past, but it really does the work. The at-risk set at time t is the group of people who, just before time t, are still being followed and have not yet had the event. They could in principle have the event right now. Censored observations are removed from the at-risk set at their censoring time.

**Kiffer:** So as time moves forward, the risk set shrinks. People leave it either because they had the event or because they were censored. Each event drops the curve by an amount proportional to one over the current size of the risk set. So an event later in follow-up, when fewer people remain, produces a bigger visible drop than an event near the start when the risk set was large.

**Sarah:** Now an important assumption underneath all of this. Independent censoring. Sometimes called non-informative censoring. The idea is that the people who get censored at time t have, on average, the same future risk of the event as the people who remain under observation.

**Kiffer:** If that assumption is wrong, the Kaplan-Meier estimator is biased. Suppose sicker patients drop out of follow-up because they're feeling worse and stop showing up to clinic. Then the people who remain under observation are systematically healthier. The Kaplan-Meier curve will overestimate survival, because the at-risk set has been enriched with the people who weren't going to have the event anyway.

**Sarah:** There's a really nice interactive widget in the lesson that lets students dial up what's called informative censoring. Keep the censoring rate low and crank the informative censoring slider, so sicker patients drop out preferentially. Then you watch the Kaplan-Meier curve drift higher and higher above the truth. It's a beautiful demonstration of how an assumption violation breaks an estimator.

**Kiffer:** And the lesson is honest that this assumption can't really be tested from the data. You have to argue it on substantive grounds. Why are people getting censored? Is the censoring mechanism plausibly independent of future event risk? Sensitivity analyses can explore the impact of different scenarios, but you can't prove the assumption holds.

**Sarah:** Okay, now we have a Kaplan-Meier curve. The next question is how to compare two curves. Imagine you have a treatment group and a control group, each with their own Kaplan-Meier curve, and the curves separate. Is that separation real or just chance?

**Kiffer:** That's the log-rank test. Sometimes also called the Mantel-Cox test. It's the most widely used test for comparing survival curves between two or more groups. It's a non-parametric test, which means it doesn't assume any particular distribution for survival times.

**Sarah:** The intuition is that at each observed event time, we calculate how many events would be expected in each group if the survival distributions were truly identical. We compare that to how many events actually occurred. We sum the discrepancies across all event times. The resulting statistic follows approximately a chi-squared distribution.

**Kiffer:** And the log-rank test is most powerful when the hazards are proportional, meaning the ratio of hazards between the two groups is constant over time. We're going to come back to that assumption when we talk about Cox regression because it's the same assumption.

**Sarah:** If the curves cross, the log-rank test loses power. It assumes a consistent direction of difference. When two survival curves cross, one group does better early and the other does better later, and the log-rank test averages those effects toward zero.

**Kiffer:** There are alternatives. The Wilcoxon test, also called the Breslow or generalized Wilcoxon, gives more weight to early time points when the risk set is larger. Tarone-Ware uses the square root of the at-risk number as a weight, splitting the difference. Peto-Peto-Prentice uses the Kaplan-Meier estimate itself as a weight. Each of these is more sensitive to early differences than the standard log-rank.

**Sarah:** But the log-rank is the default. If you only know one survival comparison test, that's the one to know.

**Kiffer:** Then there's median survival time. The simplest summary statistic for survival data. The median is the time at which the Kaplan-Meier curve drops below zero point five. Half the cohort has had the event by that time. Half hasn't.

**Sarah:** And it's a much better summary than the mean, for two reasons. First, the mean is heavily affected by the long right tail. A few people who survive forever can pull the mean way up. Second, you can compute the median even when the curve never reaches zero. As long as the curve crosses zero point five somewhere in the observed window, you have a median.

**Kiffer:** If the curve doesn't reach zero point five, the median survival time is undefined. Sometimes you'll see a paper report something like, median survival not reached, with however many years of follow-up. That's a clue the cohort was followed long enough to make some claims, but not long enough for the event to happen to half of them.

**Sarah:** And you can also report n-year survival, like five-year survival or ten-year survival. The proportion of people event-free at a fixed time horizon. That's standard in oncology, where five-year and ten-year survival are the canonical clinical summaries.

**Kiffer:** Okay. That's section one. Kaplan-Meier curves describe survival in a single sample. Log-rank tests compare survival across groups. Median survival summarizes the curve in a single number. All non-parametric. No assumptions about the shape of the hazard.

**Sarah:** And those tools are descriptive. They tell you what's happening. They don't help much when you have multiple predictors and you want to estimate adjusted effects. For that we need regression. Section two.

**Kiffer:** Section two. Cox proportional hazards regression. Introduced by Sir David Cox in nineteen seventy-two. The most widely used regression model for survival data on the planet.

**Sarah:** And before we write down the model, let's define what we're modeling. Cox regression models the hazard function. Not the survival function. Not the cumulative incidence function. The hazard function.

**Kiffer:** The hazard function, often written h of t, is the instantaneous rate at which events are occurring at time t, conditional on having survived up to time t. So it's a rate, not a probability. It can exceed one. The units are events per unit time.

**Sarah:** The conditional part is critical. The hazard at time t is asking, among the people who are still event-free at time t, how fast are events happening right now? It's not the unconditional rate. It's the rate among the survivors.

**Kiffer:** And the hazard is the function most directly tied to biological or clinical mechanism. If you want to know how risk evolves over time, you look at the hazard. A constant hazard means risk is the same regardless of how long you've already survived. An increasing hazard means risk grows over time, like with most age-related diseases. A decreasing hazard means risk is highest early and falls, which is what you see in post-surgical mortality where the riskiest period is right after the operation.

**Sarah:** Okay. The Cox model. Let me say it in words because we agreed not to use formula symbols. The hazard for individual i at time t equals the baseline hazard at time t, multiplied by the exponential of a linear combination of predictors. The linear combination is beta one times X one plus beta two times X two and so on, where beta one through beta k are coefficients and X one through X k are the predictors.

**Kiffer:** And the baseline hazard at time t is the hazard for someone whose predictors are all set to zero. It's a function of time. It can take any shape. Increasing, decreasing, U-shaped, anything. The Cox model leaves it completely unspecified.

**Sarah:** That's the magic move. That's why the Cox model is called semi-parametric. The effect of the predictors is parametric. We assume each predictor multiplies the hazard by a fixed factor. But the underlying hazard over time, the baseline hazard, is non-parametric. We don't assume it follows any particular distribution.

**Kiffer:** And that semi-parametric structure is why the Cox model is the default first model in most applied survival analyses. You don't have to commit to a Weibull or exponential or Gompertz or log-logistic distribution before you've even started analyzing the data. You let the baseline hazard be whatever it wants to be, and you focus on estimating the effects of the predictors.

**Sarah:** Now the coefficients. The betas. They're log hazard ratios. So beta one is the natural log of the hazard ratio for a one-unit increase in X one. The exponential of beta one is the hazard ratio itself.

**Kiffer:** Let me make this concrete. Suppose your predictor is treatment, coded zero for control and one for the new drug. And suppose the estimated coefficient is negative zero point four seven. The exponential of negative zero point four seven is zero point six three. So the hazard ratio is zero point six three.

**Sarah:** Which means the new drug group has zero point six three times the instantaneous event rate of the control group at every time point. Or equivalently, a thirty-seven percent reduction in the instantaneous event rate.

**Kiffer:** And a hazard ratio above one means increased hazard. Below one means decreased hazard. Right at one means no effect. Same logic as a relative risk or an odds ratio. The interpretation feels familiar.

**Sarah:** But hazard ratios are not the same as relative risks. The hazard is a rate, not a probability. So a hazard ratio of two doesn't mean exposed people are twice as likely to have the event by some specific time. It means their instantaneous event rate, given survival up to that time, is twice as high.

**Kiffer:** Over short time windows, hazard ratios and relative risks can look similar. Over long time windows, they can diverge a lot, because hazards compound differently than probabilities.

**Sarah:** Now let's talk about the proportional hazards assumption. This is the assumption that gives the Cox model its name and also its biggest limitation.

**Kiffer:** The proportional hazards assumption says that the ratio of hazards between any two individuals, or any two groups, is constant over time. Both hazards can change however they like over time. They can increase, decrease, do whatever. But the ratio between them stays fixed.

**Sarah:** So if treatment halves the hazard at one month, treatment must also halve the hazard at twelve months. The effect is constant on the multiplicative scale, even if the absolute risk is changing.

**Kiffer:** And graphically, on a log-cumulative-hazard plot, this means the curves for two groups should be roughly parallel. The log of the cumulative hazard is just a vertical translation between groups.

**Sarah:** When the assumption is wrong, the estimated hazard ratio becomes a weighted average of the time-varying effects. Which is fine if you just want a summary, but the interpretation gets murky. You can't say cleanly that the hazard ratio is one point five at all time points. It's some average, weighted in a way you didn't choose.

**Kiffer:** Three main remedies when the assumption is violated. Remedy one is stratification. You let the offending variable have its own baseline hazard. The variable doesn't get a coefficient. It just shifts the baseline hazard for each stratum. The other coefficients remain interpretable.

**Sarah:** Stratification works really well for variables you have to adjust for but don't care about estimating. Like study site. If hazards differ between sites in a non-proportional way, stratify. The site effect becomes flexible. The treatment effect is still estimated cleanly across all sites.

**Kiffer:** Remedy two is time-varying coefficients. You let the coefficient for the offending variable depend on time. You add a predictor times function-of-time interaction term. The hazard ratio is now allowed to change over follow-up.

**Sarah:** And remedy three is to switch model families entirely. Accelerated failure time models, or A F T for short, take a different parameterization. They model the log of survival time directly as a linear combination of predictors. The measure of effect is a time ratio rather than a hazard ratio.

**Kiffer:** A time ratio of two means the expected survival time is doubled. A time ratio of zero point five means it's halved. The accelerated failure time model essentially stretches or shrinks the time axis based on covariates rather than scaling the hazard.

**Sarah:** And accelerated failure time models don't require proportional hazards. They have their own assumptions, like a particular parametric form for the survival distribution, but they accommodate non-proportional effects more naturally.

**Kiffer:** One more piece before we leave Cox. How do you actually estimate the model? I want to mention this because it explains why the model is even mathematically possible. The technique is called partial likelihood.

**Sarah:** Standard maximum likelihood would require you to specify the baseline hazard. Otherwise the likelihood isn't defined. But Cox showed that you can construct a partial likelihood that doesn't depend on the baseline hazard at all.

**Kiffer:** The partial likelihood considers only the ordering of events. At each event time, it asks, given who was at risk and given that one event occurred right now, what's the probability that this particular individual was the one to fail? The baseline hazard cancels out of that conditional probability.

**Sarah:** Which is mathematically beautiful. You estimate the betas by maximizing the partial likelihood. The baseline hazard never enters. You can recover an estimate of the baseline hazard afterwards if you want it, using something called the Breslow estimator. But for inference about the predictors, you don't need it.

**Kiffer:** There's also a small technical issue called ties. When two or more events occur at exactly the same observed time, the partial likelihood gets complicated. Three approximations are commonly used. Breslow, the simplest, fine when ties are rare. Efron, slightly better, the default in most software. And exact methods, computationally intensive but accurate when ties are common.

**Sarah:** Most of the time you don't think about ties. The defaults are fine. But it's worth knowing that when survival times are recorded only in coarse units, like whole days or whole months, ties become common and the choice of approximation can matter.

**Kiffer:** Section three. Diagnostics and competing risks. Because once you've fit a Cox model, you really should check whether the proportional hazards assumption holds for each predictor.

**Sarah:** And the standard tool for that is Schoenfeld residuals. Sometimes called partial residuals. Let me try to define them informally. At each event time, we look at the predictor values for the person who failed and compare them to the average predictor values among everyone who was still at risk.

**Kiffer:** If the proportional hazards assumption holds, those discrepancies should average out to zero, with no systematic relationship to time. The plot of Schoenfeld residuals against time should look like random scatter around a horizontal line at zero.

**Sarah:** If the assumption is violated, the residuals will trend over time. An upward trend means the effect of the predictor is getting larger over follow-up. A downward trend means the effect is shrinking. A U-shape means the effect changes direction. The pattern tells you what kind of violation you have.

**Kiffer:** There's a formal statistical test based on the Schoenfeld residuals. In R, the statistical software, the function cox dot z p h runs this test. It regresses the scaled Schoenfeld residuals against time, or some function of time, and tests whether the slope is zero. A small p-value flags a predictor whose effect is changing over time.

**Sarah:** And there's a global test as well, which combines results across all predictors. If the global test is significant, you know something is wrong somewhere in the model. The per-predictor tests then tell you which variables are the culprits.

**Kiffer:** There are graphical complements too. The log-cumulative hazard plot we mentioned earlier. Plot the log of the negative log of the survival function against the log of time, separately for each group. Roughly parallel curves support proportional hazards. Crossing or fanning curves suggest violation.

**Sarah:** Now let's talk about competing risks, which is one of the most important real-world wrinkles in survival analysis. Competing risks come up when there are multiple possible event types and the occurrence of one prevents the others from happening.

**Kiffer:** The classic example is cause-specific mortality. Suppose you're studying death from cancer in a cohort of older adults. Some of those adults will die from cardiovascular disease before they have a chance to die from cancer. Death from cardiovascular disease is a competing event. Once it happens, the participant can no longer die from cancer. The two outcomes are not just different, they're mutually exclusive.

**Sarah:** And the standard Kaplan-Meier and Cox approaches handle this incorrectly if you treat the competing event as censoring. Censoring assumes that the censored individuals would have continued to be at risk for the event of interest. But people who died from cardiovascular disease are not at risk for cancer death. They're permanently out of the running.

**Kiffer:** If you treat death from cardiovascular disease as censoring, the Kaplan-Meier estimator will overstate the cumulative incidence of cancer death. It assumes those censored people would have eventually died from cancer at the same rate as the people still alive. But they wouldn't, because they're already dead.

**Sarah:** So the methods need to acknowledge competing events explicitly. There are two main approaches and they answer slightly different questions.

**Kiffer:** Approach one is cause-specific hazards. You fit a separate Cox model for each cause. For cancer death, the event indicator is one only for cancer deaths. Deaths from other causes are censored. For cardiovascular death, the event indicator is one only for cardiovascular deaths. Cancer deaths are censored. You get cause-specific hazard ratios for each event type.

**Sarah:** And cause-specific hazards are the natural quantity if your scientific question is about etiology. Among people still alive, what's the rate of cancer death? What predictors affect that rate?

**Kiffer:** But cause-specific hazard ratios don't translate directly into the cumulative incidence of the event of interest. Because the cumulative incidence of cancer death depends on both the hazard of cancer death and the hazards of all the competing causes. If a treatment lowers the cause-specific hazard of cancer death but also lowers the hazard of cardiovascular death, the cumulative incidence of cancer death might actually go up because people are surviving longer to eventually die of cancer.

**Sarah:** Which leads to approach two. The Fine and Gray subdistribution hazard model. Introduced by Jason Fine and Robert Gray in nineteen ninety-nine. This model directly estimates the cumulative incidence function while accounting for competing risks.

**Kiffer:** The clever piece in the Fine and Gray model is the redefinition of the at-risk set. People who experience a competing event are kept in the at-risk set for the event of interest, but with a weight that decays over time. That sounds weird because they can't actually have the event of interest anymore. But mathematically it's what produces a hazard whose integral matches the cumulative incidence function.

**Sarah:** And the coefficients in a Fine and Gray model are subdistribution hazard ratios. They tell you about the cumulative incidence of the event of interest in the presence of competing risks. So the interpretation is a bit different than a cause-specific hazard ratio.

**Kiffer:** Practically, if your scientific question is about absolute risk, like what's the five-year probability of dying from cancer in this cohort given competing causes of death, the Fine and Gray model is what you want. If your question is about etiology, like what biological factors affect the rate of cancer death itself, cause-specific hazards are more appropriate.

**Sarah:** And in practice, careful analyses often report both. They give the cause-specific hazard ratios for understanding mechanism. And they give the Fine and Gray subdistribution hazard ratios or directly the cumulative incidence functions for understanding absolute risk.

**Kiffer:** There's a related technique worth mentioning. The cumulative incidence function itself, sometimes abbreviated C I F, can be estimated non-parametrically, like a competing-risks-adjusted Kaplan-Meier. The Aalen-Johansen estimator gives you cumulative incidence curves for each event type that respect the competing risks structure.

**Sarah:** Okay. Let me try to pull all of this together because we covered a lot.

**Kiffer:** Yeah, takeaways. There are six big ones I'd want a beginning epidemiology student to leave with.

**Sarah:** Takeaway one. Survival analysis is for time-to-event outcomes. The question is when, not whether. The outcome is a pair, time and event indicator. The fundamental challenge is censoring, which is incomplete information about event times for some individuals. Censoring carries real information and the methods are designed to use it honestly.

**Kiffer:** Takeaway two. The Kaplan-Meier estimator is the non-parametric workhorse for describing survival in a sample. It's a step function that drops at each event time. It assumes independent censoring, which means that censoring is unrelated to future event risk. That assumption can't be tested directly. You argue it on substantive grounds.

**Sarah:** Takeaway three. The log-rank test compares survival curves across groups. It's most powerful when hazards are proportional. Median survival time is a robust summary. It's the time at which the curve crosses zero point five. If the curve doesn't reach zero point five, the median is undefined.

**Kiffer:** Takeaway four. Cox proportional hazards regression is the default multivariable model for survival data. It models the hazard function. It's semi-parametric because the baseline hazard is left unspecified. Coefficients are log hazard ratios. Exponentials of coefficients are hazard ratios. Estimation uses partial likelihood, which sidesteps the baseline hazard entirely.

**Sarah:** Takeaway five. The proportional hazards assumption is that the ratio of hazards between any two individuals is constant over time. Test it with Schoenfeld residuals plotted against time, or use the cox dot z p h statistical test. When the assumption is violated, options include stratification by the offending variable, time-varying coefficients, or accelerated failure time models that don't require proportional hazards.

**Kiffer:** And takeaway six. Competing risks are when multiple event types are mutually exclusive. The standard Cox approach with the competing event treated as censoring is biased. Use cause-specific hazards models when the question is etiologic. Use the Fine and Gray subdistribution hazard model when the question is about cumulative incidence and absolute risk in the presence of competing causes.

**Sarah:** And one practical recommendation. The R activity in the lesson uses the survival and survminer packages and walks through fitting a Kaplan-Meier curve, running a log-rank test, fitting a Cox model, and testing proportional hazards with cox dot z p h. If you only do one R exercise this week, do that one. The functions you learn there are what every applied survival analysis you'll ever read uses.

**Kiffer:** And the conceptual recommendation. Sit with censoring. The whole field is organized around the fact that we don't get to see every event. If you internalize what censoring is, what it means for a Kaplan-Meier curve to drop at events but not at censoring, what independent censoring assumes, you've understood the philosophical core of survival analysis.

**Sarah:** One thing I want to flag for clinical communication. When you report a hazard ratio to a clinical audience, be careful with the language. A hazard ratio of zero point seven is often translated as a thirty percent reduction in the risk of the event. That's loose. The hazard is a rate, not a probability. Over a short window the translation is roughly accurate. Over a long window it can be misleading.

**Kiffer:** And if the proportional hazards assumption is violated, the hazard ratio you report represents some weighted average of effects over time. Not a constant effect at every moment. It's worth saying that out loud rather than implying the effect is steady.

**Sarah:** Connecting back to the broader course arc. Lessons three through seven of this material covered regression for continuous, binary, ordered, and count outcomes. Each was a different likelihood and a different link. Lesson eight is the natural extension to time-to-event outcomes, with censoring as the new wrinkle. The semi-parametric structure of the Cox model is something we haven't seen before. The baseline hazard left unspecified, the partial likelihood used for estimation. That's a genuinely new idea.

**Kiffer:** And after this we move into the clustered-data world. Lesson nine starts that arc. Mixed effects models, generalized estimating equations, frailty models for survival, all of which acknowledge that observations within clusters are correlated. Most of the methods we've covered through Lesson eight assume independent observations. That assumption is rarely true in real data.

**Sarah:** And the frailty models we mentioned in passing earlier are the survival-analysis analog of mixed effects. They add a random effect to the hazard function to account for unmeasured heterogeneity at the individual or cluster level. Lesson nine begins to develop that machinery in general.

**Kiffer:** But for today, you have the toolkit. Kaplan-Meier curves, log-rank tests, median survival, Cox regression, hazard ratios, the proportional hazards assumption, Schoenfeld residuals, and competing risks methods. That's the canonical survival analysis curriculum. Master those and you can read most published time-to-event papers with confidence.

**Sarah:** Next up is Lesson nine. Introduction to Clustered Data. Where we leave the world of independent observations behind.

**Kiffer:** Take care, everyone.

**Sarah:** See you there.
