Logistic Regression

Exploratory Data Analysis For Epidemiology

Learning objectives for this lesson:

Understand log odds as a measure of disease and how it relates to a linear combination of predictors
Build and interpret logistic regression models
Compute and interpret odds ratios derived from a logistic regression model
Understand how logistic regression fits in the family of generalised linear models
Evaluate logistic regression models using goodness-of-fit tests, ROC curves, and residual analysis
Fit conditional logistic regression models for matched data

This course was developed by Dr. Kiffer G. Card, Faculty of Health Sciences, Simon Fraser University based on Dohoo, I. R., Martin, S. W., & Stryhn, H. (2012). Methods in Epidemiologic Research. VER Inc.

Reference

Glossary: Key Terms, People & Concepts

📚 Reference page, available throughout the lesson

This glossary collects the key concepts, people, and ideas you will meet in this lesson. Use it as a reference while you work through the material, or as a review before assessments. Type in the search box to filter entries.

Key Concepts & Ideas

Binary Outcome An outcome variable taking only two values (e.g., diseased/healthy, yes/no, 1/0). The natural target of logistic regression.

Probability (p) The chance an event occurs, ranging from 0 to 1. Logistic regression models the probability of the outcome as a function of predictors.

Odds The ratio of the probability of the event to its complement: p / (1−p). Ranges from 0 to infinity. Multiplicative in nature, easy to compare across groups.

Log Odds (Logit) The natural logarithm of the odds: logit(p) = log(p/(1−p)). Ranges from −∞ to +∞; the linear scale on which logistic regression is fit.

Odds Ratio (OR) The ratio of odds in two groups. The exponentiated logistic regression coefficient: OR = exp(β). OR > 1 indicates higher odds; OR < 1 lower.

Separation When a predictor (or combination) perfectly classifies the outcome, the maximum likelihood estimate diverges to infinity. Solutions include exact logistic regression or penalised likelihood (Firth).

Conditional Logistic Regression A logistic model for matched-pair or matched-set data. Conditions on the matched-set totals, eliminating matching variables from the likelihood.

Outcome Prevalence The proportion of cases in the dataset. Affects intercept interpretation and influences whether OR approximates RR (close only when outcome is rare).

Methods & Statistical Concepts

Logit Link Function The link function in logistic regression: g(p) = log(p/(1−p)). Maps probabilities (0,1) onto the real line so a linear model can be fit.

Generalised Linear Model (GLM) A unified framework (McCullagh & Nelder, 1989) extending linear regression to any exponential-family distribution via a link function. Logistic, Poisson, and linear regression are all GLMs.

Maximum Likelihood Estimation (MLE) An estimation method that picks parameter values maximising the likelihood of observing the data. The standard approach for fitting logistic regression.

Deviance −2 times the log-likelihood, used as the GLM analogue of the residual sum of squares. Differences in deviance between nested models follow a χ².

Wald Test A test of a coefficient using (β/SE)² ~ χ²₁. Standard in regression output but unreliable when SE is very large (e.g., separation).

Hosmer-Lemeshow Test A goodness-of-fit test grouping observations into deciles of predicted probability and comparing observed to expected counts. Sensitive to group choice and sample size.

ROC Curve A plot of sensitivity against 1 − specificity across all classification thresholds. Summarises a model's discriminative ability.

AUC / C-Statistic Area under the ROC curve. AUC = 0.5 means no discrimination; 1.0 means perfect. Often used to compare predictive models.

Pseudo-R² Various analogues of linear-regression R² for GLMs (e.g., McFadden's, Cox-Snell, Nagelkerke). Should not be interpreted as proportion of variance explained.

Exact Logistic Regression An estimation method based on conditional permutation distributions, useful with small samples or sparse data where MLE fails or is biased.

Firth's Penalised Likelihood A bias-reducing modification to MLE that yields finite estimates under separation, often preferred over exact methods for moderate samples.

Key People

David R. Cox (1924–2022) British statistician who introduced logistic regression as a tool for binary data (1958) and later the proportional hazards (Cox) model for survival analysis.

McCullagh & Nelder Peter McCullagh (1952– ) and John Nelder (1924–2010), British statisticians whose 1983/1989 textbook formalised the unified theory of generalised linear models.

Hosmer & Lemeshow David Hosmer and Stanley Lemeshow, American biostatisticians whose textbook Applied Logistic Regression standardised practice including the goodness-of-fit test and purposeful model selection.

No matching entries. Try a different search term.

Section 2

Interpreting Coefficients & Assessing Confounding

⏱ Estimated time: 20 minutes

Section 2 of 4

Interpreting Coefficients & Assessing Confounding

Checking the assumptions, testing the model, reading each coefficient as an odds ratio, and spotting confounding and interaction.

The two assumptions

What logistic regression assumes

Independence

Each observation's outcome stands on its own. Clustered data (patients within a hospital) violate this.

Linearity on the logit scale

Each one-unit rise in a continuous predictor shifts the log odds by a constant amount. On the probability scale the curve is S-shaped, not straight.

Testing the model

Likelihood ratio test and Wald test

Likelihood ratio statistic, full vs null (Eq 16.9)

\[ \color{#0B7B6B}{G^2_0} = 2(\ln \color{#C2410C}{L} - \ln \color{#6D28D9}{L_0}) \]

G²₀ likelihood ratio statistic L likelihood, full model L₀ likelihood, null model

Compares a fitted model to the intercept-only model (or any two nested models). Follows a chi-squared distribution; degrees of freedom equal the number of added predictors.

The Wald test divides one coefficient by its standard error and checks it against a normal distribution, which the test software reports by default.

A caution

When the Wald test misleads

Near a probability of 0 or 1, or with a small sample, the coefficient and its standard error are poorly approximated, so the Wald test can mislead. The likelihood ratio test has better properties and is preferred there.Vittinghoff & McCulloch (2007)

Interpretation, part 1

Dichotomous and continuous predictors

Two-state predictor

Coefficient = log odds ratio for group 1 vs group 0. If β = 0.69, OR = e^0.69 = 2.0: twice the odds.

Continuous predictor

OR per unit = e^β. For any change, OR = e^{β(x₂−x₁)}. Age β = 0.04 → per 10 years, e^0.4 = 1.49.

OR for a change of (x₂ − x₁) units (Eq 16.13)

\[ \color{#0B7B6B}{\text{OR}} = e^{\color{#C2410C}{\beta}(\color{#1D4ED8}{x_2} - \color{#1D4ED8}{x_1})} \]

OR odds ratio for the change β coefficient x₂−x₁ size of the change

Interpretation, part 2

Categorical predictors and the intercept

Categorical (3+ levels)

Split into indicator variables against one reference level. Test the whole set with a multi-df Wald test or a likelihood ratio test.

Intercept (β₀)

Log odds when all predictors are zero. Often not meaningful on its own, but needed for predicted probabilities.

Effects on the probability scale are non-linear: the same predictor change moves probability differently depending on the baseline.

Confounding and interaction

When another variable changes the picture

Confounding

Add the suspected confounder; if the coefficient of interest shifts by more than about 10–20%, keep it in, regardless of its own p-value.

Interaction (effect modification)

Add a cross-product term. A significant term means the odds ratio for one variable depends on the level of another.

Multiplicative interaction on the logit scale does not imply additive interaction on the risk scale, an important distinction for public-health interpretation.Knol et al. (2008)

Introduction and Overview

An earlier section set up the model and showed how its coefficients become odds ratios on the exponentiated scale. This section turns to the same questions you asked of linear regression in an earlier lesson: are the assumptions met, is the overall model significant, what do the individual coefficients mean, and how do confounding and interaction enter the picture? Most of the framework is identical; the differences are mostly in how we compute and interpret coefficients on the log-odds scale.

Learning Objectives

State the two key assumptions of logistic regression and how to check them.
Compare the likelihood ratio test and Wald test for the overall model and individual coefficients.
Interpret coefficients for dichotomous, categorical, and continuous predictors as adjusted odds ratios.
Use changes in coefficients and stratified analysis to evaluate confounding and interaction.

Assumptions of Logistic Regression

Logistic regression requires two key assumptions: (1) independence of observations, and (2) linearity on the logit scale, that is, the relationship between each continuous predictor and the log odds of the outcome is linear. Note that the relationship on the probability scale will be non-linear (S-shaped).

Testing the Overall Model

The likelihood ratio test (LRT) compares the fitted model to the null model (intercept only). The test statistic is:

Likelihood ratio test: full vs null model (Eq 16.9)

\[ \color{#0B7B6B}{G^2_0} = 2(\ln \color{#C2410C}{L} - \ln \color{#6D28D9}{L_0}) \]

The likelihood ratio statistic is twice the difference between the log-likelihood of the full model and that of the null model. Larger values favour the full model.

This statistic follows an approximate chi-squared distribution with degrees of freedom equal to the number of predictors. It can also be used to compare any two nested models (Eq 16.10), a full model versus a reduced model, to test whether the excluded variables contribute significantly.

The Wald test divides the coefficient by its standard error (following a Z distribution) and is more commonly reported by software. However, Wald tests can be unreliable when the true probability is near 0 or 1, or when the sample size is small (Vittinghoff & McCulloch, 2007).

⚠ Wald Test Limitations

The Wald test can be unreliable when the estimated probability is near the boundary (0 or 1), because the coefficient estimate and its standard error may be poor approximations. In such cases, the likelihood ratio test is preferred as it has better statistical properties.

Interpreting Coefficients

Dichotomous Predictors

For a dichotomous predictor (coded 0/1), the coefficient β represents the log odds ratio comparing the group coded 1 to the group coded 0, adjusted for all other variables. The odds ratio is simply OR = e^β. For example, if β_smoking = 0.69, then OR = e^0.69 = 2.0, meaning the odds of the outcome are twice as high for smokers compared to non-smokers.

Continuous Predictors

For a continuous predictor, β represents the change in the log odds for each 1-unit increase in the predictor. The OR = e^β gives the multiplicative change in odds per unit increase. To compute the OR for any arbitrary change from x₁ to x₂:

OR for a change of (x₂ − x₁) units (Eq 16.13)

\[ \color{#0B7B6B}{\text{OR}} = e^{\color{#C2410C}{\beta}(\color{#1D4ED8}{x_2} - \color{#1D4ED8}{x_1})} \]

For a change from one predictor value to another, the odds ratio is the exponential of the coefficient times the size of that change.

For example, if β_age = 0.04, the OR per 10-year increase in age is e^{0.04 × 10} = e^0.4 = 1.49.

Categorical Predictors (Multiple Levels)

Categorical predictors with more than two levels are represented using indicator (dummy) variables. One category serves as the baseline/reference, and each coefficient represents the log OR comparing that category to the reference. To evaluate the overall significance of the categorical variable, use a multi-degree-of-freedom Wald test or an LRT comparing models with and without the entire set of indicator variables.

Intercept Interpretation

The intercept (β₀) represents the logit of the probability of the outcome when all predictors equal zero. On the probability scale, this is: p = 1/(1 + e^−β₀). The intercept is often not substantively meaningful (e.g., if age = 0 is not a plausible value), but it is essential for computing predicted probabilities. Note that effects on the probability scale are non-linear: the same change in a predictor produces different changes in probability depending on the baseline values of all predictors.

R Activity: fit a logistic regression and convert coefficients to odds ratios

The course dataset includes a binary hypertension outcome (Yes/No) we will use here. The full annotated script is in r-activities/HSCI_410_Lesson_5_Logistic_Regression.R; the highlights:

# 0. Load + ensure outcome has the REFERENCE level FIRST ("No")
phaa <- read.csv("phaa_survey_clean.csv", stringsAsFactors = FALSE)
phaa$hypertension <- factor(phaa$hypertension, levels = c("No", "Yes"))
phaa$smoker       <- factor(phaa$smoker,       levels = c("No", "Yes"))

# 1. Crude (unadjusted) model -----------------------------------------------
glm_1 <- glm(hypertension ~ smoker,
             data   = phaa,
             family = binomial(link = "logit"))
summary(glm_1)
exp(coef(glm_1))      # crude odds ratio
exp(confint(glm_1))   # 95% CI for the OR

# 2. Adjusted (multivariable) model -----------------------------------------
glm_2 <- glm(hypertension ~ smoker + age + gender + bmi + dep_score,
             data   = phaa,
             family = binomial)
summary(glm_2)
or  <- exp(coef(glm_2))
ci  <- exp(confint(glm_2))
round(cbind(OR = or, ci, p = summary(glm_2)$coef[,"Pr(>|z|)"]), 3)

# 3. Likelihood-ratio test for nested models ---------------------------------
anova(glm_1, glm_2, test = "Chisq")

# 4. Goodness-of-fit and discrimination --------------------------------------
library(generalhoslem);  library(DescTools);  library(pROC)
logitgof(obs = glm_2$y, fitted(glm_2))   # Hosmer-Lemeshow
PseudoR2(glm_2, which = "all")             # McFadden / Nagelkerke
phaa$pred_htn <- predict(glm_2, type = "response")
auc(roc(phaa$hypertension, phaa$pred_htn,
        levels = c("No", "Yes")))

Read the table. An OR of 2.0 for smokerYes means smokers have twice the odds of hypertension as non-smokers, holding age, gender, BMI, and depression score constant. Confounding check (a later lesson of an earlier course): if the crude OR from glm_1 differs meaningfully from the adjusted OR in glm_2, one or more of the added covariates is confounding the smoking-hypertension relationship.

R Reflect on what you just ran

Use the questions below to interpret the output you produced. Look at your console / plot before answering.

1. From exp(coef(glm_1)) and exp(coef(glm_2)), what are the crude and adjusted odds ratios for smokerYes? By what percent did the OR change after adjustment? Does that change exceed a 10% rule-of-thumb threshold for confounding?

Model answerexp(coef(glm_1)) typically gives a crude OR for smokerYes around 1.85, and exp(coef(glm_2)) after adjustment around 1.55, a roughly 16% reduction. That exceeds the 10% rule-of-thumb threshold and signals that one or more measured covariates (age, BMI, sex) was confounding the crude smoking-hypertension association. The interpretation: adjusted smokers have ~55% higher odds of hypertension than non-smokers, accounting for measured confounders, meaningfully smaller than the unadjusted gap but still substantial.

2. From the tidied table (round(cbind(OR, ci, p), 3)), which predictors have a 95% CI that excludes 1.0? Pick one and translate its OR into a one-sentence interpretation on the odds scale.

Model answerPredictors with 95% CI excluding 1.0 typically include age, BMI, and smokerYes; sex may be borderline. Example interpretation: an OR of 1.55 for smokerYes (95% CI 1.21, 1.99) means smokers have 1.55-fold higher odds of hypertension compared with non-smokers, holding other covariates constant. The CI excludes 1, so the effect is statistically significant at α = 0.05.

3. Report the Hosmer-Lemeshow p-value, the McFadden pseudo R-squared, and the AUC from the ROC curve. Does the model fit well, and how good is its discrimination between people with and without hypertension?

Model answerHosmer-Lemeshow p-value typically around 0.30–0.50 (high p means no evidence of poor fit, which is good); McFadden pseudo-R² around 0.10–0.15 (modest, but interpreted differently from linear R²; values 0.2–0.4 indicate good fit on this scale); AUC around 0.75 (acceptable discrimination, conventionally 0.7–0.8 = acceptable, 0.8–0.9 = good). The model fits adequately and discriminates moderately well between people with and without hypertension.

Saved.

Assessing Confounding and Interaction

Assessing Confounding

To assess whether a variable is a confounder, add it to the model and check whether the coefficient of the primary predictor of interest changes substantially. A common rule of thumb is a change of more than 10–20% in the coefficient (or OR). If the coefficient changes meaningfully, the variable should be retained as a confounder regardless of its statistical significance.

Assessing Interaction (Effect Modification)

Interaction is assessed by adding cross-product terms (e.g., x₁ × x₂) to the model. When an interaction is present, the odds ratio for one variable varies depending on the level of the interacting variable. For example, if smoking interacts with sex, the OR for smoking would differ between males and females. Test the interaction term using an LRT or Wald test. If significant, the main effects alone are insufficient to describe the relationship. Note that multiplicative interaction on the logit scale does not necessarily imply additive interaction on the risk scale, an important consideration for public-health interpretation (Knol et al., 2008).

✎ Reflection

Imagine you are fitting a logistic regression model for a health outcome. How would you decide whether to report odds ratios per 1-unit increase or per a larger clinically meaningful increment for continuous predictors? Why does this matter for interpretation?

Model answerFor a continuous predictor like age, reporting OR per 1-unit increase makes the OR very close to 1 (e.g., OR = 1.04 per year of age) and hard to interpret clinically. Better practice: report OR per 10-year increase (OR = 1.48 = 1.04¹°) or per IQR or per SD. The choice depends on the clinical/policy unit of intervention: BMI per 5-unit increment, BP per 10-mmHg increment, cholesterol per 1-mmol/L increment. This matters because (a) it makes the magnitude of association visible to a clinical reader, (b) it scales the OR to a clinically actionable change, and (c) it avoids the impression of a tiny effect that is actually large in standard units. Always state explicitly the unit increment in the table and report both raw coefficient and scaled OR.

✓ Reflection saved!

● Complete the quiz and reflection to continue.

Section 3

Evaluating Logistic Regression Models

⏱ Estimated time: 25 minutes

Section 3 of 4

Evaluating Logistic Regression Models

Residuals, goodness-of-fit, discrimination and calibration, overdispersion, pseudo R-squared, and influential observations.

Before you evaluate

Building the model, and how many events you need

Same discipline as linear regression: start from a causal diagram, run univariable analyses, check linearity on the logit scale, and use automated selection with caution.

10(k + 1)minimum events (outcomes), where k is the number of predictors. Five predictors → at least 60 events.

Residuals

Where does the model fit badly?

A residual is the gap between the observed outcome and the predicted probability. Two flavours, scaled differently:

Pearson residuals

Standardise the observed-minus-expected gap by its expected variability.

Deviance residuals

Based on each pattern's contribution to the overall lack of fit.

Large residuals flag covariate patterns the model fits poorly. They are also the building blocks of goodness-of-fit tests (Eq 16.16).

Goodness of fit

The Hosmer-Lemeshow test

Sort everyone by predicted probability, split into ten groups (deciles of risk), and in each group compare observed outcomes to expected. Good calibration means they match across all groups.

A non-significant (high) p-value is the good result: no evidence of misfit. With large samples, though, it can reject trivial misfit, so interpret it alongside other measures.Hosmer & Lemeshow (1980)

Discrimination

Sensitivity, specificity, and the ROC curve

Term	Meaning
Sensitivity	True cases correctly flagged (true positive rate)
Specificity	True non-cases correctly cleared (true negative rate)
Cutpoint	Probability threshold for calling someone positive

Lower the cutpoint: more sensitivity, less specificity. The ROC curve shows this trade-off across all cutpoints.

AUC and calibration

Two different questions

Discrimination (area under the curve)

Chance the model gives a true case a higher predicted risk than a true non-case. 0.5 = coin flip; 0.7–0.8 acceptable; 0.8–0.9 good; 1.0 perfect.

Calibration

When the model says 20% risk, do about 20% actually have the outcome? Checked with a calibration plot of predicted vs observed.

Hanley & McNeil (1982) on the ROC curve; Steyerberg and colleagues (2010) on calibration.

Overdispersion

More variability than the model expects

Apparent

A false alarm from many sparse covariate patterns. Not real extra variation; lean on the Hosmer-Lemeshow test instead.

Real

Genuine extra variation, often from clustering. Fix by adjusting standard errors or using models for clustered data.

Detect it by the ratio of Pearson chi-squared (or deviance) to its degrees of freedom: near 1 is fine, well above 1 signals overdispersion.

Pseudo R-squared and influence

Explained variation and outsized points

Pseudo R-squared

McFadden, Cox-Snell, Nagelkerke approximate explained variation. Not interchangeable; run lower than linear R-squared.

Influential observations

Outliers, high leverage, and delta-beta / delta-chi-squared / delta-deviance flag points that move a coefficient or the fit too much.

Birth weight example: an area under the curve of 0.623 is weak discrimination: useful for risk factors, but the predictors explain little of the outcome.

Introduction and Overview

An earlier section covered model construction and interpretation. This section turns to model evaluation: residuals, formal goodness-of-fit tests (Hosmer & Lemeshow, 1980), predictive ability via discrimination (ROC curves; Hanley & McNeil, 1982) and calibration, the question of overdispersion, pseudo-R² statistics, and influential observations. Each gives a different angle on whether the model you've built is actually fit for the question you're asking (Steyerberg et al., 2010).

Learning Objectives

Distinguish Pearson and deviance residuals and use them to flag poorly fit covariate patterns.
Apply the Hosmer–Lemeshow test and interpret its result alongside other goodness-of-fit measures.
Quantify predictive ability using ROC curves, AUC, and calibration plots.
Recognise overdispersion in binomial data and explain how pseudo-R² statistics summarise model fit.
Identify influential observations using leverage and Cook's distance equivalents for logistic models.

Model-Building Process

The model-building process for logistic regression follows the same general principles as for linear regression (Chapter 15): develop a causal diagram, perform unconditional (univariable) analyses, evaluate linearity of continuous predictors on the logit scale, and use automated selection methods with caution. Subject matter knowledge should guide decisions at every step.

Covariate Patterns and Data Structure

A covariate pattern is a unique combination of predictor values. Whether the data are treated as binary (one observation per row) or binomial/grouped (multiple observations per covariate pattern) has implications for how residuals and goodness-of-fit statistics are computed and interpreted.

Sample Size Rule

A commonly used minimum sample size guideline for logistic regression is at least 10(k + 1) positive outcomes, where k is the number of predictors. For example, if you have 5 predictors, you need at least 10(5 + 1) = 60 positive outcomes (events). Having fewer events can lead to unreliable coefficient estimates and model instability, though simulation work has shown this rule can be relaxed in some scenarios (Vittinghoff & McCulloch, 2007).

Residuals

Pearson residuals and deviance residuals are used to assess model fit at the level of individual covariate patterns (Eq 16.16). Both types compare observed outcomes to predicted probabilities, but they differ in how discrepancies are scaled. These residuals are the building blocks of several goodness-of-fit tests.

Goodness-of-Fit Tests

Pearson χ²Click to explore

Hosmer-LemeshowClick to explore

ROC CurveClick to explore

Predictive Ability

Discrimination is most often summarised by the area under the receiver operating characteristic (ROC) curve (Hanley & McNeil, 1982), while calibration assesses agreement between predicted and observed risks (Steyerberg et al., 2010).

A receiver operating characteristic curve rising above the diagonal chance line, with area under the curve of about 0.62. — The ROC curve traces sensitivity against one minus specificity across all classification thresholds. The area under it (here about 0.62, as in the birth-weight example) is the probability that a random case is ranked above a random non-case; 0.5 is chance.

Concept	Definition	Also Known As
Sensitivity	Proportion of true positives correctly identified by the model	True positive rate
Specificity	Proportion of true negatives correctly identified by the model	True negative rate
Cutpoint	The predicted probability threshold above which subjects are classified as positive	Classification threshold

Selecting a cutpoint involves a trade-off between sensitivity and specificity. A lower cutpoint increases sensitivity but decreases specificity, and vice versa. The ROC curve provides a visual summary of this trade-off across all possible cutpoints.

Overdispersion

Apparent Overdispersion

Apparent overdispersion occurs when the Pearson χ² statistic is inflated, not because of true extra-binomial variation, but because there are many covariate patterns with very few observations each. This is especially common in binary data with continuous predictors. The Hosmer-Lemeshow test is more appropriate in this situation.

Real Overdispersion

Real overdispersion occurs when there is more variability in the data than the binomial model predicts. A common cause is clustering of observations, for example patients within the same hospital may have correlated outcomes. Real overdispersion can be addressed by adjusting standard errors using a dispersion parameter or by using models that account for clustering (e.g., GEE, mixed models).

Detecting Overdispersion

Overdispersion can be detected when the ratio of the Pearson χ² (or deviance) to its degrees of freedom substantially exceeds 1. For grouped data, this ratio should be close to 1 if the model fits well. Values much greater than 1 suggest overdispersion, while values much less than 1 may suggest underdispersion or a model that is too complex.

Pseudo-R² and Influential Observations

Pseudo-R² measures (e.g., McFadden’s, Cox-Snell, Nagelkerke) provide an indication of how much of the variation in the outcome is explained by the model. They are analogues of R² in linear regression but are not directly comparable. Values tend to be lower for logistic regression than for linear regression.

Influential observations can be identified using several diagnostic measures: outliers (large residuals), leverage (unusual covariate patterns), delta-betas (influence on individual coefficients), delta-χ², and delta-deviance (influence on overall fit). These diagnostics help identify observations that disproportionately affect the model.

📊 Example: Birth Weight Model Evaluation

Returning to the low birth weight example, suppose the fitted model has an AUC of 0.623. This indicates limited predictive ability: the model does only marginally better than chance at discriminating between low and normal birth weight infants. This does not necessarily mean the model is useless for understanding risk factors; it simply means the included predictors explain only a small portion of the variation in birth weight outcomes.

✎ Reflection

Consider a logistic regression model you have encountered (or might build). How would you evaluate whether the model has adequate goodness of fit and predictive ability? Which diagnostics would be most important to check?

Model answerGoodness-of-fit evaluation: (a) Hosmer-Lemeshow test: partitions predictions into deciles and compares observed vs. expected counts; high p-value indicates good fit (no evidence of misfit); but note: large samples make the test reject even trivial misfit. (b) Calibration plot: plot observed proportion vs. predicted probability across deciles; should be on the 45° line. (c) Brier score. Predictive ability: (a) AUC (C-statistic): probability that a random case has higher predicted probability than a random non-case; 0.7–0.8 acceptable, > 0.8 good, > 0.9 excellent. (b) Sensitivity, specificity, PPV, NPV at the optimal threshold. (c) Decision-curve analysis for clinical utility. (d) Calibration-in-the-large (mean predicted vs. observed event rate). The most important: calibration plot (for use in clinical decision-making) and AUC (for discrimination).

✓ Reflection saved!

● Complete the quiz and reflection to continue.

Section 4

GLMs, Exact & Conditional Logistic Regression

⏱ Estimated time: 20 minutes

Section 4 of 4

Generalised Linear Models, Exact & Conditional Logistic Regression

Placing logistic regression in a wider family, and two specialised versions for small or matched data.

The unifying frame

Two choices define a generalised linear model

Data type	Distribution	Link	Model
Continuous	Normal	Identity	Linear
Binary	Binomial	Logit	Logistic
Count	Poisson	Log	Poisson
Count (overdispersed)	Negative binomial	Log	NB

Refinements

Canonical links, alternatives, and fitting

The canonical link is the natural default for each distribution; the logit for binary data.
Other links exist: probit and complementary log-log for binary outcomes.
All are fitted by maximum likelihood, via iterative routines (Newton-Raphson, iteratively reweighted least squares).
Quasi-likelihood needs only the mean-variance relationship, not a full distribution.

Special case 1

Exact logistic regression for small or sparse data

Ordinary logistic regression assumes a large sample. When data are tiny, severely unbalanced, or show perfect prediction (a predictor separates the outcome and pushes a coefficient toward infinity), the usual estimates break down.

Exact logistic regression uses conditional maximum likelihood for exact p-values and intervals without large-sample theory. Firth's penalised-likelihood method (1993) reduces small-sample bias and handles separation gracefully.Firth (1993)

When to use it

Reach for exact methods when…

the sample size is very small;
the data are severely unbalanced (very few events or non-events);
perfect prediction occurs and an estimate runs to infinity;
standard maximum likelihood fails to converge.

Trade-off: computationally intensive, often impractical with many predictors.

Special case 2

Conditional logistic regression for matched data

In a matched case-control study you pair each case with controls sharing features like age and sex. Handling that with ordinary logistic regression and one indicator per matched set causes two problems:

Too many parameters

Their number grows with the number of matched sets.

Biased estimates

Especially with small strata.

A conditional likelihood (Eq 16.17) removes the per-set intercepts, giving unbiased odds ratios for the predictors of interest.

Limitations and the choice

What conditional logistic regression gives up

No intercept is estimated.
No coefficient for variables constant within a set, including the matching factors.
Only outcome-varying matched sets contribute; concordant sets drop out.
Predicted probabilities cannot be computed directly.

Standard

Adequate sample, events not rare, unmatched.

Exact

Very small, unbalanced, or perfect prediction.

Conditional

Matched case-control designs.

Lesson recap

What we covered, and what's next

From the failure of linear regression on binary outcomes, to the log-odds model and odds ratios, to testing and interpreting coefficients, to evaluating fit and prediction, to the wider family of generalised linear models with exact and conditional variants.

Next: a short reflection on a logistic model in your own field, then the knowledge check to consolidate the lesson.

Introduction and Overview

Earlier sections covered standard logistic regression. This section places it in a wider context. Logistic regression is one example of the generalised linear model (GLM) family, which also includes the linear, Poisson, and other regressions you'll meet later in this course. The section closes with exact logistic regression, the small-sample alternative when standard maximum-likelihood methods fail.

Learning Objectives

Define a generalised linear model in terms of its random component, link function, and linear predictor.
Place logistic, linear, Poisson, and negative binomial regression within the GLM family.
Identify situations (small samples, sparse cells, perfect prediction), where exact logistic regression is preferred.
Explain when conditional logistic regression should be used for matched case-control data.

Generalised Linear Models (GLMs)

Logistic regression is a member of the broader family of Generalised Linear Models (GLMs). A GLM is defined by two key components: (1) a link function that relates the expected value of the outcome to the linear combination of predictors, and (2) the distribution of the outcome variable.

Data Type	Distribution	Canonical Link	Example
Continuous	Gaussian (Normal)	Identity	Linear regression
Binary	Binomial	Logit	Logistic regression
Count	Poisson	Log	Poisson regression
Count (overdispersed)	Negative Binomial	Log	NB regression

The canonical link is the “natural” link function for each distribution. For binary data, the canonical link is the logit. Non-canonical links (e.g., probit, complementary log-log for binary data) can also be used. GLMs are estimated using maximum likelihood, often with iterative algorithms such as Newton-Raphson or iteratively reweighted least squares. Quasi-likelihood estimation can be used when the full distribution is not specified, requiring only the mean-variance relationship.

Exact Logistic Regression

Standard logistic regression relies on large-sample approximations. When the dataset is very small or severely unbalanced (e.g., very few events), these approximations may be poor, and ML estimates can be biased or fail to converge. Exact logistic regression uses conditional maximum likelihood to produce exact P-values and confidence intervals without relying on large-sample theory. A widely used alternative is the penalised-likelihood approach of Firth (1993), which reduces small-sample bias and handles separation gracefully.

When to Use Exact Logistic Regression

Exact logistic regression is preferred when:

The sample size is very small
The data are severely unbalanced (very few events or non-events)
Perfect prediction occurs (a predictor perfectly separates outcomes, causing ML estimates to be infinite)
Standard ML estimation fails to converge

The trade-off is that exact methods are computationally intensive and may not be feasible for models with many predictors.

Conditional Logistic Regression for Matched Data

Conditional logistic regression is used for matched case-control studies. In matched designs, using unconditional logistic regression with stratum (matched set) indicators is problematic because: (1) the number of parameters grows with the number of matched sets, and (2) coefficient estimates can be biased, especially with small strata.

Conditional logistic regression solves this by using a conditional likelihood (Eq 16.17) that eliminates the stratum-specific intercept parameters from the estimation. This produces unbiased estimates of the odds ratios for the predictors of interest without needing to estimate the matching parameters.

Limitations of Conditional Logistic Regression

Conditional logistic regression has several limitations:

No intercept is estimated (it is conditioned out along with all stratum-specific effects)
Coefficients cannot be estimated for variables that are constant within matched sets (e.g., the matching factors themselves)
Only matched sets with variation in the outcome contribute to the likelihood (concordant sets are uninformative)
Predicted probabilities cannot be computed directly (since there is no intercept)

Link FunctionClick to explore

DistributionClick to explore

Conditional LikelihoodClick to explore

When to Use Which Approach

Standard logistic regression: Use when the sample size is adequate, events are not extremely rare, and data are not matched.<br>Exact logistic regression: Use when the sample is very small, data are severely unbalanced, or perfect prediction occurs.<br>Conditional logistic regression: Use for matched case-control studies where stratum-specific parameters would be problematic to estimate.

✎ Reflection

Think about a study design in your field that uses matching (e.g., matched case-control). Why would conditional logistic regression be more appropriate than unconditional logistic regression for analysing such data? What information would be lost by using the conditional approach?

Model answerIn a matched case-control study (e.g., 1:1 matching on age, sex, and admission date), each matched set is a stratum. Unconditional logistic regression treats observations as independent and ignores the matched structure; this gives biased odds-ratio estimates because the matching variables are perfectly balanced by design but the analysis tries to estimate them as covariates. Conditional logistic regression uses the within-stratum likelihood (conditioning on the matched-set total), effectively cancelling out any stratum-level effects (including the matching variables and any unmeasured stratum-level confounders). What's lost: you cannot estimate the effect of the matching variables themselves (they're absorbed into the strata), so if you wanted the main effect of age or sex on the outcome, conditional logistic is the wrong model. Practical rule: match on what you don't want to estimate; analyse with conditional logistic regression.

✓ Reflection saved!

● Complete the quiz and reflection to continue.

HSCI 410 · Lesson 5

Exploratory Data Analysis For Epidemiology

Logistic Regression

Learning objectives for this lesson:

Glossary: Key Terms, People & Concepts

Introduction & The Logistic Model

Logistic Regression

From continuous outcomes to yes-or-no outcomes

Linear regression