Logistic Regression

Exploratory Data Analysis For Epidemiology

Kiffer G. Card, PhD, Faculty of Health Sciences, Simon Fraser University

Learning objectives for this lesson:

Understand log odds as a measure of disease and how it relates to a linear combination of predictors
Build and interpret logistic regression models
Compute and interpret odds ratios derived from a logistic regression model
Understand how logistic regression fits in the family of generalised linear models
Evaluate logistic regression models using goodness-of-fit tests, ROC curves, and residual analysis
Fit conditional logistic regression models for matched data

This course was developed by Kiffer G. Card, PhD, as a companion to Dohoo, I. R., Martin, S. W., & Stryhn, H. (2012). Methods in Epidemiologic Research. VER Inc.

Section 3

Evaluating Logistic Regression Models

⏱ Estimated time: 25 minutes

Model-Building Process

The model-building process for logistic regression follows the same general principles as for linear regression (Chapter 15): develop a causal diagram, perform unconditional (univariable) analyses, evaluate linearity of continuous predictors on the logit scale, and use automated selection methods with caution. Subject matter knowledge should guide decisions at every step.

Covariate Patterns and Data Structure

A covariate pattern is a unique combination of predictor values. Whether the data are treated as binary (one observation per row) or binomial/grouped (multiple observations per covariate pattern) has implications for how residuals and goodness-of-fit statistics are computed and interpreted.

Sample Size Rule

A commonly used minimum sample size guideline for logistic regression is at least 10(k + 1) positive outcomes, where k is the number of predictors. For example, if you have 5 predictors, you need at least 10(5 + 1) = 60 positive outcomes (events). Having fewer events can lead to unreliable coefficient estimates and model instability.

Residuals

Pearson residuals and deviance residuals are used to assess model fit at the level of individual covariate patterns (Eq 16.16). Both types compare observed outcomes to predicted probabilities, but they differ in how discrepancies are scaled. These residuals are the building blocks of several goodness-of-fit tests.

Goodness-of-Fit Tests

χ²

Pearson χ²

Click to learn more

📊

Hosmer-Lemeshow

Click to learn more

📈

ROC Curve

Click to learn more

Predictive Ability

Concept	Definition	Also Known As
Sensitivity	Proportion of true positives correctly identified by the model	True positive rate
Specificity	Proportion of true negatives correctly identified by the model	True negative rate
Cutpoint	The predicted probability threshold above which subjects are classified as positive	Classification threshold

Selecting a cutpoint involves a trade-off between sensitivity and specificity. A lower cutpoint increases sensitivity but decreases specificity, and vice versa. The ROC curve provides a visual summary of this trade-off across all possible cutpoints.

Overdispersion

Apparent Overdispersion

Apparent overdispersion occurs when the Pearson χ² statistic is inflated, not because of true extra-binomial variation, but because there are many covariate patterns with very few observations each. This is especially common in binary data with continuous predictors. The Hosmer-Lemeshow test is more appropriate in this situation.

Real Overdispersion

Real overdispersion occurs when there is more variability in the data than the binomial model predicts. A common cause is clustering of observations—for example, patients within the same hospital may have correlated outcomes. Real overdispersion can be addressed by adjusting standard errors using a dispersion parameter or by using models that account for clustering (e.g., GEE, mixed models).

Detecting Overdispersion

Overdispersion can be detected when the ratio of the Pearson χ² (or deviance) to its degrees of freedom substantially exceeds 1. For grouped data, this ratio should be close to 1 if the model fits well. Values much greater than 1 suggest overdispersion, while values much less than 1 may suggest underdispersion or a model that is too complex.

Pseudo-R² and Influential Observations

Pseudo-R² measures (e.g., McFadden’s, Cox-Snell, Nagelkerke) provide an indication of how much of the variation in the outcome is explained by the model. They are analogues of R² in linear regression but are not directly comparable. Values tend to be lower for logistic regression than for linear regression.

Influential observations can be identified using several diagnostic measures: outliers (large residuals), leverage (unusual covariate patterns), delta-betas (influence on individual coefficients), delta-χ², and delta-deviance (influence on overall fit). These diagnostics help identify observations that disproportionately affect the model.

📊 Example: Birth Weight Model Evaluation

Returning to the low birth weight example, suppose the fitted model has an AUC of 0.623. This indicates limited predictive ability—the model does only marginally better than chance at discriminating between low and normal birth weight infants. This does not necessarily mean the model is useless for understanding risk factors; it simply means the included predictors explain only a small portion of the variation in birth weight outcomes.

✎ Reflection

Consider a logistic regression model you have encountered (or might build). How would you evaluate whether the model has adequate goodness of fit and predictive ability? Which diagnostics would be most important to check?

✓ Reflection saved!

● Complete the quiz and reflection to continue.

Section 4

GLMs, Exact & Conditional Logistic Regression

⏱ Estimated time: 20 minutes

Generalised Linear Models (GLMs)

Logistic regression is a member of the broader family of Generalised Linear Models (GLMs). A GLM is defined by two key components: (1) a link function that relates the expected value of the outcome to the linear combination of predictors, and (2) the distribution of the outcome variable.

Data Type	Distribution	Canonical Link	Example
Continuous	Gaussian (Normal)	Identity	Linear regression
Binary	Binomial	Logit	Logistic regression
Count	Poisson	Log	Poisson regression
Count (overdispersed)	Negative Binomial	Log	NB regression

The canonical link is the “natural” link function for each distribution. For binary data, the canonical link is the logit. Non-canonical links (e.g., probit, complementary log-log for binary data) can also be used. GLMs are estimated using maximum likelihood, often with iterative algorithms such as Newton-Raphson or iteratively reweighted least squares. Quasi-likelihood estimation can be used when the full distribution is not specified, requiring only the mean-variance relationship.

Exact Logistic Regression

Standard logistic regression relies on large-sample approximations. When the dataset is very small or severely unbalanced (e.g., very few events), these approximations may be poor, and ML estimates can be biased or fail to converge. Exact logistic regression uses conditional maximum likelihood to produce exact P-values and confidence intervals without relying on large-sample theory.

When to Use Exact Logistic Regression

Exact logistic regression is preferred when:

The sample size is very small
The data are severely unbalanced (very few events or non-events)
Perfect prediction occurs (a predictor perfectly separates outcomes, causing ML estimates to be infinite)
Standard ML estimation fails to converge

The trade-off is that exact methods are computationally intensive and may not be feasible for models with many predictors.

Conditional Logistic Regression for Matched Data

Conditional logistic regression is used for matched case-control studies. In matched designs, using unconditional logistic regression with stratum (matched set) indicators is problematic because: (1) the number of parameters grows with the number of matched sets, and (2) coefficient estimates can be biased, especially with small strata.

Conditional logistic regression solves this by using a conditional likelihood (Eq 16.17) that eliminates the stratum-specific intercept parameters from the estimation. This produces unbiased estimates of the odds ratios for the predictors of interest without needing to estimate the matching parameters.

Limitations of Conditional Logistic Regression

Conditional logistic regression has several limitations:

No intercept is estimated (it is conditioned out along with all stratum-specific effects)
Coefficients cannot be estimated for variables that are constant within matched sets (e.g., the matching factors themselves)
Only matched sets with variation in the outcome contribute to the likelihood (concordant sets are uninformative)
Predicted probabilities cannot be computed directly (since there is no intercept)

🔗

Link Function

Click to learn more

📊

Distribution

Click to learn more

📑

Conditional Likelihood

Click to learn more

When to Use Which Approach

Standard logistic regression: Use when the sample size is adequate, events are not extremely rare, and data are not matched.<br>Exact logistic regression: Use when the sample is very small, data are severely unbalanced, or perfect prediction occurs.<br>Conditional logistic regression: Use for matched case-control studies where stratum-specific parameters would be problematic to estimate.

✎ Reflection

Think about a study design in your field that uses matching (e.g., matched case-control). Why would conditional logistic regression be more appropriate than unconditional logistic regression for analysing such data? What information would be lost by using the conditional approach?

✓ Reflection saved!

● Complete the quiz and reflection to continue.

HSCI 410 — Lesson 4

Exploratory Data Analysis For Epidemiology

Logistic Regression

Learning objectives for this lesson:

Introduction & The Logistic Model

Why We Cannot Use Linear Regression for Dichotomous Outcomes

The Logistic Model

Odds and Odds Ratios

Maximum Likelihood Estimation (MLE)

✔ Check Your Understanding

✎ Reflection

Interpreting Coefficients & Assessing Confounding

Assumptions of Logistic Regression

Testing the Overall Model

Interpreting Coefficients

Assessing Confounding and Interaction

Assessing Confounding

Assessing Interaction (Effect Modification)

✔ Check Your Understanding

✎ Reflection

Evaluating Logistic Regression Models

Model-Building Process

Covariate Patterns and Data Structure

Residuals

Goodness-of-Fit Tests

Predictive Ability

Overdispersion

Pseudo-R² and Influential Observations

✔ Check Your Understanding

✎ Reflection

GLMs, Exact & Conditional Logistic Regression

Generalised Linear Models (GLMs)

Exact Logistic Regression

✔ Check Your Understanding

✎ Reflection

Lesson 4 — Final Assessment

✎ Final Reflection

✔ Final Assessment

🏆 Congratulations!