HSCI 410 — Lesson 4

Logistic Regression

Exploratory Data Analysis For Epidemiology

Kiffer G. Card, PhD, Faculty of Health Sciences, Simon Fraser University

Learning objectives for this lesson:

  • Understand log odds as a measure of disease and how it relates to a linear combination of predictors
  • Build and interpret logistic regression models
  • Compute and interpret odds ratios derived from a logistic regression model
  • Understand how logistic regression fits in the family of generalised linear models
  • Evaluate logistic regression models using goodness-of-fit tests, ROC curves, and residual analysis
  • Fit conditional logistic regression models for matched data

This course was developed by Kiffer G. Card, PhD, as a companion to Dohoo, I. R., Martin, S. W., & Stryhn, H. (2012). Methods in Epidemiologic Research. VER Inc.

Section 1

Introduction & The Logistic Model

⏱ Estimated time: 20 minutes

Why We Cannot Use Linear Regression for Dichotomous Outcomes

When the outcome variable is dichotomous (e.g., disease present/absent), ordinary linear regression is inappropriate for three fundamental reasons:

  1. Non-normal errors: The residuals from a linear model with a binary outcome follow a binomial distribution, not a normal distribution, violating a key assumption of linear regression.
  2. Heteroscedasticity: The variance of the residuals depends on the predicted probability, so the assumption of constant variance is violated.
  3. Predictions outside 0–1: A linear model can produce predicted values less than 0 or greater than 1, which are nonsensical for probabilities.
Key Concept

Logistic regression solves all three problems by modelling the log odds (logit) of the outcome rather than the probability directly. The logit transformation maps probabilities from the bounded range (0, 1) to the entire real number line (−∞, +∞), making it suitable for linear modelling.

🚫
Why Not Linear Regression?
Click to learn more
🔄
Logit Transform
Click to learn more
📈
Odds Ratio (OR = eβ)
Click to learn more

The Logistic Model

The logistic regression model expresses the log odds of the outcome as a linear combination of predictors:

The Logit Model (Eq 16.2)
ln(p / (1 − p)) = β0 + β1x1 + β2x2 + … + βjxj

The inverse logit (or logistic function) converts back to the probability scale:

Probability from the Logistic Model (Eq 16.3)
p = 1 / (1 + e−(β0 + Σβjxj))

Note that, unlike linear regression, the logistic model has no error term because it models on the logit scale. The randomness enters through the binomial distribution of the outcome.

Odds and Odds Ratios

The odds of the outcome are p / (1 − p). The odds ratio for the kth predictor is obtained by exponentiating its coefficient:

Odds Ratio (Eq 16.6)
ORk = eβk

For a dichotomous predictor, this is the odds ratio comparing the group coded 1 to the group coded 0, adjusted for all other variables in the model.

Maximum Likelihood Estimation (MLE)

Unlike linear regression, which uses least squares, logistic regression uses maximum likelihood estimation (MLE). MLE is an iterative process that finds the parameter values most likely to have produced the observed data. The algorithm starts with initial estimates and refines them until convergence—when the change in the log-likelihood between iterations falls below a specified criterion.

📋 Example: Low Birth Weight Study

Consider a study of low birth weight (<2500 g) as the outcome. Predictors include the mother’s smoking status, race, and number of prenatal visits. Because the outcome is dichotomous (low birth weight: yes/no), logistic regression is appropriate.

The model would be: ln(p / (1 − p)) = β0 + β1(smoking) + β2(race) + β3(prenatal visits). From the fitted model, eβ1 gives the adjusted odds ratio for smoking, comparing smokers to non-smokers while holding race and prenatal visits constant.

✔ Check Your Understanding

1. What does the logit function transform?

The logit function takes a probability p (bounded between 0 and 1) and transforms it to ln(p/(1−p)), the log of the odds, which ranges from −∞ to +∞. This allows probabilities to be modelled as a linear function of predictors.

2. In a logistic regression, how is the odds ratio for a dichotomous predictor computed?

The odds ratio is obtained by exponentiating the logistic regression coefficient: OR = eβ. For a dichotomous predictor, this gives the ratio of the odds of the outcome in the exposed group versus the unexposed group, adjusted for other covariates.

3. Why is maximum likelihood estimation (MLE) used instead of least squares for logistic regression?

Least squares estimation assumes normally distributed errors, which is violated when the outcome is binary (errors follow a binomial distribution). MLE finds parameter values that maximise the probability of observing the data, making it appropriate for binary outcomes.

✎ Reflection

Think about a dichotomous health outcome in your field. What predictors would you include in a logistic regression model? Why is modelling on the logit scale preferable to modelling the probability directly?

✓ Reflection saved!
Complete the quiz and reflection to continue.
Section 2

Interpreting Coefficients & Assessing Confounding

⏱ Estimated time: 20 minutes

Assumptions of Logistic Regression

Logistic regression requires two key assumptions: (1) independence of observations, and (2) linearity on the logit scale—that is, the relationship between each continuous predictor and the log odds of the outcome is linear. Note that the relationship on the probability scale will be non-linear (S-shaped).

Testing the Overall Model

The likelihood ratio test (LRT) compares the fitted model to the null model (intercept only). The test statistic is:

Likelihood Ratio Test — Full vs Null Model (Eq 16.9)
G20 = 2(ln L − ln L0)

This statistic follows an approximate chi-squared distribution with degrees of freedom equal to the number of predictors. It can also be used to compare any two nested models (Eq 16.10)—a full model versus a reduced model—to test whether the excluded variables contribute significantly.

The Wald test divides the coefficient by its standard error (following a Z distribution) and is more commonly reported by software. However, Wald tests can be unreliable when the true probability is near 0 or 1, or when the sample size is small.

⚠ Wald Test Limitations

The Wald test can be unreliable when the estimated probability is near the boundary (0 or 1), because the coefficient estimate and its standard error may be poor approximations. In such cases, the likelihood ratio test is preferred as it has better statistical properties.

Interpreting Coefficients

Dichotomous Predictors

For a dichotomous predictor (coded 0/1), the coefficient β represents the log odds ratio comparing the group coded 1 to the group coded 0, adjusted for all other variables. The odds ratio is simply OR = eβ. For example, if βsmoking = 0.69, then OR = e0.69 = 2.0, meaning the odds of the outcome are twice as high for smokers compared to non-smokers.

Continuous Predictors

For a continuous predictor, β represents the change in the log odds for each 1-unit increase in the predictor. The OR = eβ gives the multiplicative change in odds per unit increase. To compute the OR for any arbitrary change from x1 to x2:

OR for a Change of (x2 − x1) Units (Eq 16.13)
OR = eβ(x2x1)

For example, if βage = 0.04, the OR per 10-year increase in age is e0.04 × 10 = e0.4 = 1.49.

Categorical Predictors (Multiple Levels)

Categorical predictors with more than two levels are represented using indicator (dummy) variables. One category serves as the baseline/reference, and each coefficient represents the log OR comparing that category to the reference. To evaluate the overall significance of the categorical variable, use a multi-degree-of-freedom Wald test or an LRT comparing models with and without the entire set of indicator variables.

Intercept Interpretation

The intercept (β0) represents the logit of the probability of the outcome when all predictors equal zero. On the probability scale, this is: p = 1/(1 + e−β0). The intercept is often not substantively meaningful (e.g., if age = 0 is not a plausible value), but it is essential for computing predicted probabilities. Note that effects on the probability scale are non-linear—the same change in a predictor produces different changes in probability depending on the baseline values of all predictors.

Assessing Confounding and Interaction

Assessing Confounding

To assess whether a variable is a confounder, add it to the model and check whether the coefficient of the primary predictor of interest changes substantially. A common rule of thumb is a change of more than 10–20% in the coefficient (or OR). If the coefficient changes meaningfully, the variable should be retained as a confounder regardless of its statistical significance.

Assessing Interaction (Effect Modification)

Interaction is assessed by adding cross-product terms (e.g., x1 × x2) to the model. When an interaction is present, the odds ratio for one variable varies depending on the level of the interacting variable. For example, if smoking interacts with sex, the OR for smoking would differ between males and females. Test the interaction term using an LRT or Wald test. If significant, the main effects alone are insufficient to describe the relationship.

✔ Check Your Understanding

1. The likelihood ratio test (LRT) compares models by:

The LRT statistic is G² = 2(ln Lfull − ln Lreduced), which follows a chi-squared distribution. It tests whether the additional parameters in the full model significantly improve the fit compared to the reduced model.

2. For a continuous predictor, what does the odds ratio represent?

For a continuous predictor, OR = eβ represents the factor by which the odds are multiplied for each 1-unit increase in the predictor, holding all other variables constant. To compute the OR for a larger change, use OR(x2 − x1).

3. When assessing confounding in logistic regression, you should:

Confounding is assessed by adding the potential confounder to the model and checking whether the coefficient of the primary predictor changes substantially (often a threshold of 10–20%). If it does, the variable should be retained as a confounder regardless of its own P-value.

✎ Reflection

Imagine you are fitting a logistic regression model for a health outcome. How would you decide whether to report odds ratios per 1-unit increase or per a larger clinically meaningful increment for continuous predictors? Why does this matter for interpretation?

✓ Reflection saved!
Complete the quiz and reflection to continue.
Section 3

Evaluating Logistic Regression Models

⏱ Estimated time: 25 minutes

Model-Building Process

The model-building process for logistic regression follows the same general principles as for linear regression (Chapter 15): develop a causal diagram, perform unconditional (univariable) analyses, evaluate linearity of continuous predictors on the logit scale, and use automated selection methods with caution. Subject matter knowledge should guide decisions at every step.

Covariate Patterns and Data Structure

A covariate pattern is a unique combination of predictor values. Whether the data are treated as binary (one observation per row) or binomial/grouped (multiple observations per covariate pattern) has implications for how residuals and goodness-of-fit statistics are computed and interpreted.

Sample Size Rule

A commonly used minimum sample size guideline for logistic regression is at least 10(k + 1) positive outcomes, where k is the number of predictors. For example, if you have 5 predictors, you need at least 10(5 + 1) = 60 positive outcomes (events). Having fewer events can lead to unreliable coefficient estimates and model instability.

Residuals

Pearson residuals and deviance residuals are used to assess model fit at the level of individual covariate patterns (Eq 16.16). Both types compare observed outcomes to predicted probabilities, but they differ in how discrepancies are scaled. These residuals are the building blocks of several goodness-of-fit tests.

Goodness-of-Fit Tests

χ²
Pearson χ²
Click to learn more
📊
Hosmer-Lemeshow
Click to learn more
📈
ROC Curve
Click to learn more

Predictive Ability

ConceptDefinitionAlso Known As
SensitivityProportion of true positives correctly identified by the modelTrue positive rate
SpecificityProportion of true negatives correctly identified by the modelTrue negative rate
CutpointThe predicted probability threshold above which subjects are classified as positiveClassification threshold

Selecting a cutpoint involves a trade-off between sensitivity and specificity. A lower cutpoint increases sensitivity but decreases specificity, and vice versa. The ROC curve provides a visual summary of this trade-off across all possible cutpoints.

Overdispersion

Apparent Overdispersion

Apparent overdispersion occurs when the Pearson χ² statistic is inflated, not because of true extra-binomial variation, but because there are many covariate patterns with very few observations each. This is especially common in binary data with continuous predictors. The Hosmer-Lemeshow test is more appropriate in this situation.

Real Overdispersion

Real overdispersion occurs when there is more variability in the data than the binomial model predicts. A common cause is clustering of observations—for example, patients within the same hospital may have correlated outcomes. Real overdispersion can be addressed by adjusting standard errors using a dispersion parameter or by using models that account for clustering (e.g., GEE, mixed models).

Detecting Overdispersion

Overdispersion can be detected when the ratio of the Pearson χ² (or deviance) to its degrees of freedom substantially exceeds 1. For grouped data, this ratio should be close to 1 if the model fits well. Values much greater than 1 suggest overdispersion, while values much less than 1 may suggest underdispersion or a model that is too complex.

Pseudo-R² and Influential Observations

Pseudo-R² measures (e.g., McFadden’s, Cox-Snell, Nagelkerke) provide an indication of how much of the variation in the outcome is explained by the model. They are analogues of R² in linear regression but are not directly comparable. Values tend to be lower for logistic regression than for linear regression.

Influential observations can be identified using several diagnostic measures: outliers (large residuals), leverage (unusual covariate patterns), delta-betas (influence on individual coefficients), delta-χ², and delta-deviance (influence on overall fit). These diagnostics help identify observations that disproportionately affect the model.

📊 Example: Birth Weight Model Evaluation

Returning to the low birth weight example, suppose the fitted model has an AUC of 0.623. This indicates limited predictive ability—the model does only marginally better than chance at discriminating between low and normal birth weight infants. This does not necessarily mean the model is useless for understanding risk factors; it simply means the included predictors explain only a small portion of the variation in birth weight outcomes.

✔ Check Your Understanding

1. What does the Hosmer-Lemeshow test evaluate?

The Hosmer-Lemeshow test groups observations by predicted probability (typically into deciles) and compares observed and expected outcomes within each group. A non-significant P-value suggests the model fits the data adequately.

2. What does an ROC curve AUC of 0.5 indicate?

An AUC of 0.5 corresponds to the 45-degree diagonal on the ROC curve, meaning the model performs no better than random classification. An AUC of 1.0 would indicate perfect discrimination.

3. The minimum sample size rule for logistic regression suggests:

The rule of thumb is to have at least 10(k + 1) positive outcomes (events), where k is the number of predictors. This ensures stable coefficient estimates and reliable model performance. Having fewer events risks overfitting and biased estimates.

✎ Reflection

Consider a logistic regression model you have encountered (or might build). How would you evaluate whether the model has adequate goodness of fit and predictive ability? Which diagnostics would be most important to check?

✓ Reflection saved!
Complete the quiz and reflection to continue.
Section 4

GLMs, Exact & Conditional Logistic Regression

⏱ Estimated time: 20 minutes

Generalised Linear Models (GLMs)

Logistic regression is a member of the broader family of Generalised Linear Models (GLMs). A GLM is defined by two key components: (1) a link function that relates the expected value of the outcome to the linear combination of predictors, and (2) the distribution of the outcome variable.

Data TypeDistributionCanonical LinkExample
ContinuousGaussian (Normal)IdentityLinear regression
BinaryBinomialLogitLogistic regression
CountPoissonLogPoisson regression
Count (overdispersed)Negative BinomialLogNB regression

The canonical link is the “natural” link function for each distribution. For binary data, the canonical link is the logit. Non-canonical links (e.g., probit, complementary log-log for binary data) can also be used. GLMs are estimated using maximum likelihood, often with iterative algorithms such as Newton-Raphson or iteratively reweighted least squares. Quasi-likelihood estimation can be used when the full distribution is not specified, requiring only the mean-variance relationship.

Exact Logistic Regression

Standard logistic regression relies on large-sample approximations. When the dataset is very small or severely unbalanced (e.g., very few events), these approximations may be poor, and ML estimates can be biased or fail to converge. Exact logistic regression uses conditional maximum likelihood to produce exact P-values and confidence intervals without relying on large-sample theory.

When to Use Exact Logistic Regression

Exact logistic regression is preferred when:

  • The sample size is very small
  • The data are severely unbalanced (very few events or non-events)
  • Perfect prediction occurs (a predictor perfectly separates outcomes, causing ML estimates to be infinite)
  • Standard ML estimation fails to converge

The trade-off is that exact methods are computationally intensive and may not be feasible for models with many predictors.

Conditional Logistic Regression for Matched Data

Conditional logistic regression is used for matched case-control studies. In matched designs, using unconditional logistic regression with stratum (matched set) indicators is problematic because: (1) the number of parameters grows with the number of matched sets, and (2) coefficient estimates can be biased, especially with small strata.

Conditional logistic regression solves this by using a conditional likelihood (Eq 16.17) that eliminates the stratum-specific intercept parameters from the estimation. This produces unbiased estimates of the odds ratios for the predictors of interest without needing to estimate the matching parameters.

Limitations of Conditional Logistic Regression

Conditional logistic regression has several limitations:

  • No intercept is estimated (it is conditioned out along with all stratum-specific effects)
  • Coefficients cannot be estimated for variables that are constant within matched sets (e.g., the matching factors themselves)
  • Only matched sets with variation in the outcome contribute to the likelihood (concordant sets are uninformative)
  • Predicted probabilities cannot be computed directly (since there is no intercept)
🔗
Link Function
Click to learn more
📊
Distribution
Click to learn more
📑
Conditional Likelihood
Click to learn more
When to Use Which Approach

Standard logistic regression: Use when the sample size is adequate, events are not extremely rare, and data are not matched.<br>Exact logistic regression: Use when the sample is very small, data are severely unbalanced, or perfect prediction occurs.<br>Conditional logistic regression: Use for matched case-control studies where stratum-specific parameters would be problematic to estimate.

✔ Check Your Understanding

1. In the GLM framework, what are the two key components that must be specified?

A GLM requires specification of two components: (1) the link function, which relates the expected value of the outcome to the linear predictor, and (2) the distribution of the outcome variable (e.g., Gaussian, Binomial, Poisson).

2. When is exact logistic regression preferred over standard logistic regression?

Exact logistic regression is preferred when the sample is very small, the data are severely unbalanced (very few events), or when perfect prediction occurs. In these situations, standard ML estimates may be biased or fail to converge because large-sample approximations are unreliable.

3. In conditional logistic regression for matched data, why is the intercept not estimated?

The conditional likelihood conditions on the number of cases within each matched set, which eliminates the stratum-specific intercept parameters. This avoids the incidental parameter problem and produces unbiased OR estimates, but means predicted probabilities cannot be computed directly.

✎ Reflection

Think about a study design in your field that uses matching (e.g., matched case-control). Why would conditional logistic regression be more appropriate than unconditional logistic regression for analysing such data? What information would be lost by using the conditional approach?

✓ Reflection saved!
Complete the quiz and reflection to continue.
Final Assessment

Lesson 4 — Final Assessment

15 questions • 100% required to pass

This assessment covers all sections of Lesson 4. You must answer all 15 questions correctly to complete the lesson. Read each question carefully and review the feedback for any incorrect answers before retrying.

✎ Final Reflection

Now that you have completed all four sections, summarise the key concepts of logistic regression. How does it differ from linear regression, what are the main tools for evaluating model fit, and when would you choose conditional or exact logistic regression over the standard approach?

✓ Reflection saved!

✔ Final Assessment

1. Why can’t linear regression be used for dichotomous outcomes?

Linear regression is inappropriate for binary outcomes because: (1) errors follow a binomial rather than normal distribution, (2) the variance of errors depends on the predicted probability (heteroscedasticity), and (3) predicted values can fall outside the meaningful 0–1 range.

2. The logit function transforms probability to:

The logit function computes ln(p / (1 − p)), which is the natural logarithm of the odds. This transforms the bounded probability (0, 1) to an unbounded scale suitable for linear modelling.

3. In logistic regression, OR = eβ1 represents:

Exponentiating the coefficient β1 gives the odds ratio associated with a 1-unit increase in X1, holding all other predictors constant. For dichotomous predictors, this is the OR comparing the two groups.

4. Maximum likelihood estimation finds parameter values that:

MLE finds the parameter values that maximise the likelihood function—the probability of observing the data given the model. This is an iterative process that converges when the change in the log-likelihood between iterations falls below a specified threshold.

5. The LRT statistic G²0 follows approximately a:

The likelihood ratio test statistic G² = 2(ln Lfull − ln Lreduced) follows an approximate chi-squared distribution with degrees of freedom equal to the difference in the number of parameters between the two models.

6. Why is the Wald test sometimes unreliable?

The Wald test divides the coefficient by its standard error. When the true parameter is near a boundary (e.g., probability near 0 or 1), both the coefficient estimate and its SE can be poor approximations, making the Wald test unreliable. The likelihood ratio test is preferred in such cases.

7. For a continuous predictor, the OR for a change from x1 to x2 is:

For a continuous predictor, the OR for a change of (x2 − x1) units is eβ(x2 − x1), which equals (eβ)(x2 − x1) = OR(x2 − x1). This allows you to compute the OR for any meaningful increment, not just a 1-unit change.

8. Coefficients for categorical predictors represent effects compared to:

When a categorical variable is represented using indicator (dummy) variables, one category serves as the baseline or reference. Each coefficient represents the log OR comparing that specific category to the reference category, adjusted for other variables in the model.

9. The intercept in a logistic model represents:

The intercept β0 is the value of the logit (log odds) when all predictor variables are set to zero. On the probability scale, this translates to p = 1/(1 + e−β0). It may not always have a meaningful interpretation if zero is not a plausible value for all predictors.

10. The Pearson χ² goodness-of-fit test requires:

The Pearson χ² test compares observed and expected frequencies across covariate patterns. It requires a reasonable number of observations per covariate pattern to produce a reliable test statistic. When there are many patterns with few observations (common with continuous predictors), the Hosmer-Lemeshow test is preferred.

11. An ROC curve that closely follows the 45° diagonal indicates:

The 45-degree diagonal on an ROC curve represents random classification (AUC = 0.5). A curve that closely follows this diagonal means the model cannot discriminate between positive and negative outcomes any better than chance.

12. Overdispersion in logistic regression can be caused by:

Real overdispersion occurs when there is more variability than the binomial model predicts. A common cause is clustering—observations within groups (e.g., patients in the same hospital) may have correlated outcomes, leading to extra-binomial variation.

13. In the GLM framework, the canonical link for binary data is:

The canonical link for the binomial distribution (binary data) is the logit function: g(μ) = ln(μ/(1 − μ)). Other links such as probit or complementary log-log can also be used but are non-canonical.

14. Conditional logistic regression is used for:

Conditional logistic regression is designed for matched case-control studies. It uses a conditional likelihood that eliminates stratum-specific parameters, avoiding the bias that can occur when unconditional logistic regression is used with many matched strata.

15. A limitation of conditional logistic regression is:

Because conditional logistic regression conditions on the matched sets, any variable that is constant within all matched sets (including the matching factors themselves) cannot have its coefficient estimated. Only variables that vary within at least some matched sets contribute to the conditional likelihood.

🏆 Congratulations!

You have successfully completed Lesson 4: Logistic Regression.

Your responses have been downloaded automatically.