# Lesson 5 — Logistic Regression (v3 expanded)

*Companion-podcast transcript • Sarah & Kiffer* 
*~5970 words • ~32.3 min audio*

---

**Sarah:** Welcome back to Office Hours. I'm Sarah.

**Kiffer:** And I'm Kiffer. Today we're working through Lesson 5, Logistic Regression. And this is one of those lessons that you'll come back to again and again throughout your career, because logistic regression is probably the single most-used statistical model in epidemiology.

**Sarah:** Right. The reason it's so common is that so many of the outcomes we care about in public health are binary. Disease present or absent. Vaccinated or unvaccinated. Alive or dead at five years. Hospitalized or not. Whenever your outcome is a yes-no kind of variable, logistic regression is usually the place you start.

**Kiffer:** And one quick framing note before we dive in. If you've already worked through Lessons 3 and 4 of this material, those lessons covered linear regression and model building strategies. Logistic regression builds directly on that foundation. Most of the strategy carries over. The mechanics shift in some specific ways, and that's what we want to walk through today.

**Sarah:** We've got three big chunks. Section 1 is the model itself. Why we can't just use linear regression on a binary outcome, what logistic regression actually does, and how to interpret a single coefficient. Section 2 is multiple logistic regression. Adding more predictors, adjustment for confounding, categorical variables, interactions, and prediction. And Section 3 is model fit and diagnostics. How do we know whether the model we've built is actually any good.

**Kiffer:** Let's start with Section 1. Why not just use linear regression on a binary outcome?

**Sarah:** It's a fair question. After all, you can write the regression. If your outcome is coded zero for no disease and one for disease, you could in principle just regress that on a predictor. The statistical software won't even complain. So why don't we?

**Kiffer:** Three reasons, all of which boil down to assumption violations. First, predicted probabilities can fall outside zero to one. The whole point of modeling a binary outcome is that you want to know the probability of the outcome occurring. Probabilities have to be between zero and one. They can't be negative. They can't exceed one. But a linear model is a straight line, and a straight line keeps going. So if you fit a linear model to a binary outcome, eventually for some values of the predictor, the predicted probability will be less than zero, or greater than one. That's nonsensical.

**Sarah:** And it's not just a theoretical worry. It happens in real data. As soon as your predictor takes values that are far enough from the average, the linear model will produce a prediction that doesn't make sense as a probability. Imagine modeling the probability of a heart attack as a linear function of age. The model might predict that an eighty-year-old has a probability of one point three. That's not a probability. It's gibberish.

**Kiffer:** Right. The second reason. The residuals from a linear regression on a binary outcome are inherently non-Normal. Quick definition. A residual is the difference between an observed outcome and the predicted outcome from the model. In a linear regression on a continuous outcome, we usually assume those residuals are roughly Normally distributed. That's the bell-curve assumption.

**Sarah:** But if your outcome is binary, the only two possible values are zero and one. So your residual is either the predicted probability subtracted from zero, or the predicted probability subtracted from one. There are only two possible residual values for any given covariate pattern. That distribution is binomial, not Normal. You've broken the Normality assumption by design.

**Kiffer:** And the third reason. The variance of the residuals is not constant. It depends on the predicted probability. The variance of a binomial outcome is p times one minus p, where p is the probability of the event. That's biggest when p equals zero point five, and smallest when p is near zero or near one.

**Sarah:** And the property of constant variance, called homoscedasticity, is another standard assumption of ordinary linear regression. So now we've broken three assumptions. Predictions outside zero and one. Non-Normal residuals. Non-constant variance, which is also called heteroscedasticity. Linear regression is the wrong tool. We need something else.

**Kiffer:** And that something else is logistic regression. Here's the core idea. Instead of modeling the probability directly, logistic regression models the log odds of the outcome. That sounds technical but the move is actually elegant once you see it.

**Sarah:** Let me unpack that step by step. Probability is bounded between zero and one. Odds are the probability divided by one minus the probability. So if the probability of an event is zero point five, the odds are one. If the probability is zero point seven five, the odds are three. If the probability is zero point nine, the odds are nine. Odds are bounded below by zero, but they can go arbitrarily high. The maximum is infinity.

**Kiffer:** And then we take the natural log of the odds. The natural log of zero is negative infinity. The natural log of one is zero. The natural log of infinity is positive infinity. So the log odds, which is also called the logit, ranges from negative infinity to positive infinity. The whole real number line. Unbounded.

**Sarah:** And once your outcome is unbounded, a linear function works just fine. So the logistic regression model says, the logit of pi, which is the natural log of the odds, equals beta zero plus beta one times X. Pi is the probability of the outcome. Beta zero is the intercept. Beta one is the slope on the predictor X.

**Kiffer:** And on the log-odds scale, the relationship between the predictor and the outcome is linear. Same shape as ordinary linear regression. But on the probability scale, when you transform back, the relationship is S-shaped. It's a curve that approaches zero on one end and approaches one on the other, never quite reaching either.

**Sarah:** And that S-shape is exactly what we want. It guarantees that no matter how extreme the predictor gets, the predicted probability stays inside zero to one. We get to use a linear modeling framework while respecting the boundedness of probability.

**Kiffer:** There's a really clean visual intuition here, by the way. If you imagine the S-curve, it has a steep middle where small changes in the predictor produce big changes in probability. And it has flat tails on the ends, where the probability is already close to zero or close to one and even big changes in the predictor barely move it. The steepness of the middle is set by beta one. The location of the middle, the inflection point, is set by minus beta zero divided by beta one.

**Sarah:** So how do we interpret the coefficients? On the log-odds scale, beta one is the change in the log odds per one-unit increase in X. That's the linear interpretation. But log odds aren't very intuitive on their own. Most people don't have a gut sense for what a log-odds change of zero point six nine means.

**Kiffer:** Which is why we exponentiate. The exponential of beta one is the odds ratio per one-unit increase in X. So if beta one equals zero point six nine, the exponential is two. The odds ratio per one-unit increase is two. Meaning, a one-unit increase in X doubles the odds of the outcome.

**Sarah:** And odds ratios are familiar. Anyone who's worked through this material has computed odds ratios by hand from two-by-two tables. The same odds ratio you computed there reappears here as the exponentiated coefficient from a logistic regression model. The model is doing the same comparison, just adjusted for whatever else is in the model.

**Kiffer:** So that's the practical translation. When you fit a logistic regression in software, the raw output is in log-odds units. The exponential of each coefficient is the odds ratio, abbreviated OR, for that predictor. When you write up your paper or present your result, you almost always report the odds ratio. Nobody wants to read log-odds coefficients. They're not interpretable to a clinical or public health audience.

**Sarah:** And one note on continuous predictors. The odds ratio you get by exponentiating the coefficient is the odds ratio per one-unit increase in the predictor. Sometimes one unit isn't a meaningful increment. For age, you usually report the odds ratio per ten-year increase, not per one-year. To get that, you exponentiate ten times the coefficient, or equivalently, raise the per-one-year odds ratio to the tenth power.

**Kiffer:** Now estimation. Linear regression uses ordinary least squares, abbreviated OLS. Ordinary least squares finds the coefficients that minimize the sum of squared residuals. It has a nice closed-form solution. You can write it as a single matrix equation.

**Sarah:** Logistic regression doesn't use ordinary least squares. The residuals don't behave the way least squares assumes, so least squares isn't the right tool. Instead, logistic regression uses maximum likelihood estimation.

**Kiffer:** Quick definition. Maximum likelihood estimation, abbreviated MLE, finds the parameter values that make the observed data most probable under the model. You write down the likelihood function, which is the probability of seeing exactly the data you saw, as a function of the unknown coefficients. Then you find the coefficients that make that likelihood as large as possible.

**Sarah:** And for logistic regression, that maximization doesn't have a nice closed-form solution. There's no single matrix equation that produces the answer. So the algorithm is iterative. The software starts with initial estimates, computes the likelihood, takes a step in a direction that increases the likelihood, and repeats. Eventually the change between iterations gets smaller than some convergence criterion, and the algorithm stops.

**Kiffer:** In practice, this is reliable for most applied problems. Modern software has robust implementations. You don't usually have to think about it. But the iterative nature does occasionally matter. If the data are too sparse, or if a predictor perfectly classifies the outcome, the algorithm can fail to converge. We'll come back to that under separation in Section 3.

**Sarah:** Inference. Once we have the coefficient estimates, we want to put confidence intervals around them and test whether they differ significantly from zero. The standard approach computes standard errors on the log-odds scale. So your point estimate, your standard error, and your confidence interval all live in log-odds units.

**Kiffer:** And to get the confidence interval for the odds ratio, you exponentiate the bounds. So if your beta is zero point six nine and the ninety-five percent confidence interval on the log-odds scale runs from zero point two to one point one eight, you exponentiate the lower bound, exponentiate the upper bound, and the odds ratio confidence interval runs from about one point two two to about three point two five.

**Sarah:** And one technical note. The confidence interval is symmetric on the log-odds scale, but asymmetric on the odds ratio scale. The lower bound is closer to one. The upper bound is farther from one. That's the result of exponentiating. It's not a software glitch. It's a property of the transformation.

**Kiffer:** Two formal tests for individual coefficients. The Wald test divides the coefficient by its standard error. That ratio approximately follows a standard Normal distribution under the null. It's what most software prints by default.

**Sarah:** And the likelihood ratio test compares the fitted model to a reduced model that excludes the variable of interest. The test statistic is two times the difference in log-likelihoods, and it follows an approximate chi-squared distribution with degrees of freedom equal to the number of parameters being tested.

**Kiffer:** Both are widely used. The Wald test is faster because it doesn't require fitting two models. But the Wald test can be unreliable when the true probability is very close to zero or one, or when the sample size is small. In those situations, the likelihood ratio test is generally preferred. We'll see this come up again in Section 3.

**Sarah:** Okay, that wraps Section 1. We've got the model. The logit of pi equals beta zero plus beta one times X. Coefficients are log odds. The exponential of the coefficient is the odds ratio. Estimation by maximum likelihood. Inference on the log-odds scale, exponentiate the bounds for the odds ratio confidence interval.

**Kiffer:** Okay, that brings us to Section 2, multiple logistic regression.

**Sarah:** And honestly, this section is mostly about extending the same logic we've just built up to more predictors. The shape of the model gets richer. The interpretation of any one coefficient becomes a conditional interpretation. But the underlying machinery is identical.

**Kiffer:** The model. The logit of pi equals beta zero plus beta one times X one plus beta two times X two, and so on for as many predictors as you have. Each beta is the change in log odds per one-unit increase in that predictor, holding all the other predictors constant.

**Sarah:** And exponentiating any one beta gives you the adjusted odds ratio for that predictor. Adjusted meaning, holding the others constant. This is the standard idea of multivariable adjustment. We covered the linear-regression version of this in Lesson 3 of this material. Logistic regression does the same thing, just on the log-odds scale.

**Kiffer:** And just like multiple linear regression, the question of what variables to put in the model is not just a statistical question. It's a causal question. You adjust for confounders. You do not adjust for mediators. You do not adjust for colliders.

**Sarah:** Which means we need a directed acyclic graph, abbreviated DAG. Quick definition. A directed acyclic graph is a diagram where each variable is a node and each direct causal relationship is an arrow. The graph encodes your assumptions about how the variables in your study are causally related. Acyclic just means there are no loops. No variable is its own ancestor.

**Kiffer:** And from the DAG you can read off which variables are confounders, which are mediators, and which are colliders. A confounder is a common cause of both the exposure and the outcome. A mediator is on the causal path from exposure to outcome. A collider is a common effect of two variables.

**Sarah:** And the rule. To estimate the total effect of an exposure on an outcome, you adjust for confounders, you do not adjust for mediators, and you do not adjust for colliders. Adjusting for a mediator blocks part of the effect you're trying to estimate. Adjusting for a collider can introduce a spurious association. The DAG is what tells you which is which.

**Kiffer:** So before you run a logistic regression with five predictors, you should be able to draw a DAG and explain why each of those predictors belongs in the model. If you can't, you don't yet understand the analysis you're about to run.

**Sarah:** Quick example to make this concrete. Suppose you're interested in whether smoking causes hypertension. You fit a logistic regression with smoking as the exposure and hypertension as the outcome. What else should be in the model? Age is a confounder. Older people are both more likely to smoke historically and more likely to have hypertension. So age belongs. Sex is similar. Maybe socioeconomic status. But what about body mass index, BMI? If smoking affects BMI, and BMI affects hypertension, then BMI is a mediator, not a confounder. Adjusting for it would block part of the effect you're trying to estimate.

**Kiffer:** And there's also a confounding diagnostic that's worth knowing. If you fit the crude model with just smoking, then add a candidate confounder and refit, you compare the two coefficients on smoking. If the coefficient changes by more than ten or twenty percent, that variable is doing meaningful confounding work and should be retained. If the coefficient barely moves, the variable wasn't doing much. This is the change-in-estimate criterion, and it complements the DAG-based approach.

**Sarah:** Categorical predictors. When a predictor has more than two categories, like racial group or province or smoking status with multiple levels, you can't just throw it in as a single variable. The model needs numeric inputs. So you use dummy coding. Sometimes called indicator coding.

**Kiffer:** Quick definition. Dummy coding takes a categorical predictor with say four levels, and creates three binary indicator variables. Each one equals one for a specific category and zero otherwise. One category is left out. That left-out category is the reference category.

**Sarah:** And in the logistic regression, each of the three dummy variables gets its own coefficient. The interpretation. The coefficient on dummy one is the change in log odds for being in category one versus the reference. Exponentiated, it's the odds ratio comparing category one to the reference category, holding all other predictors constant.

**Kiffer:** And to test whether the categorical variable as a whole matters, you do a multi-degree-of-freedom test. You can use a Wald test with multiple degrees of freedom, or a likelihood ratio test that compares the model with all the dummies to a model without any of them. That test asks, does this categorical variable contribute meaningfully to the model, considered as a whole.

**Sarah:** And one practical note. Picking the reference category matters for interpretability. By default most software picks the first level alphabetically, which is often not what you want. The convention in epidemiology is to pick the largest or the lowest-risk category as the reference, so the odds ratios you report compare other groups to that baseline.

**Kiffer:** Interactions. Sometimes called effect modification. The idea is that the effect of one predictor depends on the level of another. Smoking might affect the risk of disease more strongly for women than for men, say. You handle that by including a product term in the model.

**Sarah:** Specifically, you create a new variable that's the product of the two predictors that interact. Smoking times sex. You include that product term in the model alongside the main effects of smoking and sex. The coefficient on the product term tells you how much the effect of smoking changes per unit change in sex.

**Kiffer:** And on the odds ratio scale, the interaction is multiplicative, not additive. The exponentiated interaction coefficient is the ratio of odds ratios. So if the smoking-by-sex interaction coefficient exponentiates to one point five, the odds ratio for smoking among women is one point five times the odds ratio for smoking among men.

**Sarah:** Test the interaction with a likelihood ratio test or a Wald test. If the interaction is significant, you can't summarize the effect of smoking with a single number. You have to report stratum-specific odds ratios, one for women, one for men.

**Kiffer:** And whether to include an interaction is partly a substantive question. Are there theoretical or biological reasons to expect that the effect of one variable would vary across levels of another? If yes, fit the interaction. If no, you usually leave it out and report the main effect. You also want to be careful about interaction-fishing. Testing many interactions and reporting the ones that look interesting is a recipe for false positives.

**Sarah:** Predicted probabilities. Once you have a fitted model, you can plug in any combination of predictor values, compute the linear predictor, and back-transform to get the predicted probability of the outcome. The back-transformation is the inverse logit, which is one over one plus the exponential of negative the linear predictor.

**Kiffer:** And those predicted probabilities are useful for several things. You can use them descriptively, to summarize how the predicted risk varies across patient profiles. You can use them prospectively, to estimate the risk for a new patient. Or you can use them as a classifier, to predict whether a particular individual will have the outcome.

**Sarah:** Classification requires picking a decision threshold. Often the default is zero point five. So if the predicted probability is above zero point five, you classify the person as positive. Below, you classify them as negative.

**Kiffer:** But zero point five is just a convention. The optimal threshold depends on the relative costs of false positives versus false negatives. If a false negative is much worse than a false positive, like missing a cancer diagnosis, you want to lower the threshold so you catch more true positives, even at the cost of more false alarms.

**Sarah:** And if a false positive is much worse, like recommending a risky surgery for someone who doesn't need it, you want to raise the threshold. The point is that the threshold is a clinical or operational decision, not just a statistical one. The Receiver Operating Characteristic curve, which we'll talk about in Section 3, helps you visualize how sensitivity and specificity trade off across thresholds.

**Kiffer:** Okay, that's Section 2. Multiple predictors, adjusted odds ratios, dummy coding, interactions, predicted probabilities, decision thresholds. Section 3 is where we step back and ask whether the model is any good.

**Sarah:** Right, so that brings us to Section 3, model fit and diagnostics. There are several questions you want to ask of any logistic regression model. Does the model fit the data well overall? Are the predicted probabilities accurate? Does the model discriminate between people who have the outcome and people who don't? Are there individual observations that are unduly influencing the result? And is there any pathological structure in the data, like separation, that's breaking the model?

**Kiffer:** Let's take those one at a time. The likelihood ratio test for nested models. We mentioned this in Section 1, but it deserves more attention. Two models are nested if one is a special case of the other. Specifically, the smaller model can be obtained from the larger one by setting some coefficients to zero.

**Sarah:** So if you have a five-predictor model, and you compare it to a three-predictor model that contains the same three predictors as a subset, those two models are nested. You can use a likelihood ratio test to ask whether the two extra predictors in the larger model significantly improve fit.

**Kiffer:** Test statistic. Two times the log-likelihood of the larger model minus two times the log-likelihood of the smaller model. That difference follows an approximate chi-squared distribution with degrees of freedom equal to the number of additional parameters in the larger model. So in our example, two degrees of freedom, because we added two predictors.

**Sarah:** And the likelihood ratio test is the standard tool for comparing nested logistic regression models. It's used to test whether a single variable is needed, whether a categorical variable as a whole is needed, whether an interaction term is needed. It's the workhorse for variable selection on a substantive basis.

**Kiffer:** Now goodness of fit in the absolute sense. Not, is one model better than another, but, does this model actually fit the data we have? The most commonly used test for that in logistic regression is the Hosmer-Lemeshow test.

**Sarah:** The Hosmer-Lemeshow test works like this. You take the predicted probability for each observation. You sort observations by predicted probability and divide them into deciles, ten groups of equal size. Within each decile, you compute the observed number of events and the expected number of events under the model. Then you compute a chi-squared-like statistic that compares the observed and expected counts across the ten groups.

**Kiffer:** If the model fits well, observed and expected counts should match closely in each decile. The test statistic should be small. The p-value should be large. A small p-value, conventionally below zero point zero five, suggests the model is poorly calibrated. The predicted probabilities don't match observed frequencies.

**Sarah:** And there are some real limitations to be aware of. First, the test is sensitive to the number of groups. The default is ten, but the result can change if you use eight or twelve. Second, the test is sensitive to sample size. With a very large sample, even tiny departures from perfect calibration become significant. With a small sample, the test has low power and can miss real problems.

**Kiffer:** Third, a non-significant Hosmer-Lemeshow test does not prove the model is well calibrated. It just means we don't have evidence of miscalibration. As with all tests of this kind, absence of evidence is not evidence of absence.

**Sarah:** Which leads us to a really important conceptual distinction. Calibration versus discrimination. These are two different things, and a model can be good at one and bad at the other.

**Kiffer:** Calibration. A well-calibrated model produces predicted probabilities that match observed frequencies. If your model says thirty percent of people in some group will develop disease, then in groups where the model predicts thirty percent, about thirty percent should actually develop disease.

**Sarah:** And the way to assess calibration visually is a calibration plot. You bin observations by predicted probability, then plot the predicted probability on the x-axis against the observed frequency on the y-axis. A well-calibrated model lies on the forty-five-degree line. Predicted equals observed.

**Kiffer:** If the points lie consistently above the line, the model is under-predicting. If they lie consistently below the line, the model is over-predicting. Either pattern is a calibration problem.

**Sarah:** Discrimination is a different question. Discrimination asks, can the model rank-order people correctly by their risk? Given two people, one with the outcome and one without, what's the probability the model gives a higher predicted probability to the one with the outcome?

**Kiffer:** And that probability is exactly what's quantified by the Area Under the Receiver Operating Characteristic Curve. The Receiver Operating Characteristic, abbreviated ROC, is a curve we covered back earlier, on screening and diagnostic tests. The Area Under the Curve, abbreviated AUC, is the discrimination summary.

**Sarah:** An AUC of zero point five means the model performs no better than chance at distinguishing cases from non-cases. The ROC curve is a forty-five-degree diagonal. An AUC of one means perfect discrimination. The model never gets the ranking wrong.

**Kiffer:** In real applied work, AUCs above zero point eight are considered strong, between zero point seven and zero point eight are moderate, and below zero point seven are weak. But the threshold for adequate depends on the application. For some clinical risk scores, even an AUC of zero point six five is useful if the alternative is no model at all.

**Sarah:** And here's the key conceptual point. A model can have excellent discrimination and terrible calibration. Or excellent calibration and terrible discrimination. They measure different things.

**Kiffer:** Concrete example. Suppose your model gives every case a predicted probability of zero point one one and every non-case a predicted probability of zero point one zero. Discrimination is perfect. AUC is one. The model always ranks cases above non-cases. But calibration is terrible if the actual outcome rate is much higher than ten percent. The probabilities are too low.

**Sarah:** And the reverse. Suppose your model produces predicted probabilities that match observed frequencies on average across the whole sample, but the predictions don't actually distinguish cases from non-cases. Calibration is fine in aggregate. Discrimination is poor.

**Kiffer:** Which one matters more depends on what you're using the model for. For prediction, especially individual-level prediction, you usually need both. For ranking high-risk patients to prioritize for an intervention, discrimination matters more. For estimating expected event counts in a population, calibration matters more.

**Sarah:** Let's shift to influential observations. Once you've assessed overall fit and predictive performance, you want to ask whether any individual observations are exerting unusually large effects on the model. Two diagnostics are standard here.

**Kiffer:** Cook's distance. This is a single number for each observation that summarizes how much the regression coefficients would change if that observation were removed. Big Cook's distance means the observation is influential. Small Cook's distance means the observation is mostly going along with the crowd.

**Sarah:** And DFBETAs. DFBETA stands for difference in beta. There's one DFBETA per observation per coefficient. It tells you how much that specific coefficient would change if you removed that specific observation. So you can identify whether a single observation is driving a particular result.

**Kiffer:** And the practical advice on influential observations is, investigate, don't automatically delete. If you find a small number of observations with big Cook's distances or big DFBETAs, look at them. Are they data-entry errors? Are they unusual but legitimate cases? Are they evidence of structural heterogeneity in the population that the model isn't capturing?

**Sarah:** If they're errors, fix them. If they're legitimate, you might choose to fit the model with and without them and report both. You generally do not delete legitimate observations just because they're influential. That's a form of cooking the data.

**Kiffer:** Last topic for this section is separation. This is a specific pathology that can break logistic regression. Separation occurs when one or more predictors perfectly classify the outcome. So everyone with a particular predictor value has the outcome, and nobody without it does. Or some linear combination of predictors perfectly separates cases from non-cases.

**Sarah:** And when separation happens, the maximum likelihood algorithm tries to push the corresponding coefficient toward infinity. Because if you can perfectly classify the outcome by setting the coefficient really large, the likelihood keeps going up the more you increase it. There's no finite maximum.

**Kiffer:** What you see in the software output is suspicious behavior. Coefficients with very large estimates. Standard errors with absurdly large values. Wald tests with z-statistics very close to zero, even though the predictor obviously matters. Or sometimes a warning message saying the algorithm did not converge.

**Sarah:** Two main solutions. First, penalized maximum likelihood. The most popular version is Firth's method, which adds a small bias-reducing penalty to the likelihood. Firth's method handles separation gracefully and produces finite coefficient estimates and finite standard errors.

**Kiffer:** Second, merging categories. If the separation is happening because a particular categorical predictor has a level with very few observations, and all those observations happen to fall in one outcome group, merging that level with another can resolve the problem.

**Sarah:** And the underlying point is that separation is a sample-size problem dressed up in algorithmic clothing. Either you don't have enough events, or you don't have enough variation in some predictor, for the standard maximum likelihood estimator to work. Penalized methods and category merging are different ways of acknowledging that and getting a sensible answer anyway.

**Kiffer:** One related sample-size point. There's a rule of thumb in logistic regression that you want at least ten events per predictor. So if you have five predictors, you need at least fifty positive outcomes. If you have ten predictors, you need at least one hundred. With fewer events, your coefficients become unstable, your standard errors get huge, and the model is prone to overfitting. The rule isn't a hard cutoff, but it's a useful sanity check before you start interpreting results.

**Sarah:** And before we wrap, two specialised cousins of standard logistic regression are worth flagging, because they show up in the module and you'll see them in published work. The first is exact logistic regression. When the sample is very small, or when one of those separation problems just won't go away, exact logistic regression uses a conditional maximum likelihood approach that doesn't rely on large-sample approximations. It produces exact p-values and confidence intervals. The cost is that it's computationally intensive and gets impractical with many predictors.

**Kiffer:** The second is conditional logistic regression. That's the right tool for matched case-control studies. In a matched design, if you try to fit a standard logistic regression with one indicator per matched set, you end up estimating an enormous number of stratum parameters, and your odds ratio estimates get biased. Conditional logistic regression conditions those stratum parameters out of the likelihood entirely, so you get unbiased odds ratios for your exposure of interest. The trade-off is that you lose the intercept, you can't estimate effects for variables that are constant within matched sets, and predicted probabilities aren't directly available.

**Sarah:** So three flavours, depending on the data. Standard logistic regression for most applied problems. Exact logistic regression when the sample is tiny or separation breaks the standard fit. Conditional logistic regression when the data are matched. Okay, let's pull the takeaways.

**Kiffer:** First takeaway. Logistic regression is the standard tool for binary outcomes. It models the log odds as a linear combination of predictors. The reason is that probabilities live in zero to one, but log odds live on the entire real line, so a linear model on the logit scale is well-behaved while still respecting probability bounds.

**Sarah:** Second takeaway. Coefficients are log odds. Exponentiate to get odds ratios. Confidence intervals are computed on the log-odds scale, then exponentiated for the odds ratio interval. The interval is asymmetric on the odds ratio scale because of the exponential transformation, and that's expected, not a bug.

**Kiffer:** Third takeaway. Estimation is by maximum likelihood, not ordinary least squares. The algorithm is iterative. For most applied problems it converges fine, but watch out for separation when one or more predictors perfectly classify the outcome.

**Sarah:** Fourth takeaway. Multiple logistic regression follows the same logic as multiple linear regression. The decisions about which variables to include and which to exclude should be guided by a directed acyclic graph. You adjust for confounders, you do not adjust for mediators, and you do not adjust for colliders.

**Kiffer:** Fifth takeaway. Categorical predictors get dummy coded, with one reference category. Each dummy gets its own log odds and its own exponentiated odds ratio. Test the categorical variable as a whole with a multi-degree-of-freedom likelihood ratio test or Wald test.

**Sarah:** Sixth takeaway. Interactions are handled with product terms. The exponentiated interaction coefficient is a ratio of odds ratios. If the interaction is significant, you can't summarize the effect of one variable with a single number. You have to report stratum-specific odds ratios.

**Kiffer:** Seventh takeaway. Predicted probabilities come from the inverse logit transformation. Decision thresholds for classification are not statistical defaults. They depend on the relative costs of false positives and false negatives in your application.

**Sarah:** Eighth takeaway. Goodness of fit. Use the likelihood ratio test for comparing nested models. Use the Hosmer-Lemeshow test for absolute calibration, but be aware of its limitations. It's sensitive to the number of groups and to sample size.

**Kiffer:** Ninth takeaway. Calibration and discrimination are different things. A well-calibrated model produces predicted probabilities that match observed frequencies. A well-discriminating model rank-orders cases above non-cases. The Area Under the Receiver Operating Characteristic Curve, abbreviated AUC, quantifies discrimination. Calibration plots quantify calibration. A model can be good at one and bad at the other.

**Sarah:** Tenth takeaway. Influential observations identified by Cook's distance and DFBETAs. Investigate them, don't automatically delete them. If they're data-entry errors, fix them. If they're legitimate, consider reporting analyses with and without them.

**Kiffer:** Eleventh takeaway. Separation breaks maximum likelihood estimation. Solutions include Firth's penalized maximum likelihood and merging sparse categories. Don't ignore weird coefficient estimates or convergence warnings. They're usually telling you something real about the data.

**Sarah:** Twelfth takeaway. Two specialised variants. Exact logistic regression for very small or severely unbalanced samples where standard maximum likelihood fails. Conditional logistic regression for matched case-control studies, where you condition the stratum parameters out and get unbiased odds ratios on your exposure of interest.

**Sarah:** And one final framing point. Logistic regression is a member of a broader family called generalized linear models, which we'll see again later when we get to Poisson regression for count outcomes. The whole family shares the same logic. A linear predictor on a transformed scale, plus a distribution for the outcome. Logistic regression is just the binary special case.

**Kiffer:** Next up is Lesson 6, Ordinal and Multinomial Models, for outcomes that have more than two categories. The link function changes, but the model-building discipline you built up here carries over directly.

**Sarah:** Take the rest of the day. Go back to one of the homework datasets if you have one, fit a logistic regression, and practice translating the coefficient output into odds ratios and confidence intervals. The mechanical fluency really matters.

**Kiffer:** Take care, everyone.

**Sarah:** See you next time.