Modelling Count and Rate Data

Exploratory Data Analysis For Epidemiology

Kiffer G. Card, PhD, Faculty of Health Sciences, Simon Fraser University

Learning objectives for this lesson:

Distinguish among simple counts, rates with person-time denominators, population rates, and area-based counts
Describe the Poisson distribution and its mean=variance property
Specify and interpret a Poisson regression model including the offset term
Interpret incidence rate ratios (IR) from exponentiated Poisson coefficients
Evaluate Poisson models using Pearson, deviance, and Anscombe residuals
Distinguish apparent from real overdispersion and apply appropriate corrections
Compare negative binomial regression models (NB-1, NB-2) to Poisson regression
Apply zero-inflated, hurdle, and zero-truncated models to handle excess zeros

This course was developed by Kiffer G. Card, PhD, as a companion to Dohoo, I. R., Martin, S. W., & Stryhn, H. (2012). Methods in Epidemiologic Research. VER Inc.

Section 3

Evaluating Poisson Models & Overdispersion

⏱ Estimated time: 20 minutes

Residuals for Poisson Models

Just as in linear regression, residuals are the primary tool for evaluating how well a Poisson model fits the observed data. However, because the variance of a Poisson variable depends on its mean, raw residuals (observed − expected) are not directly comparable across observations. Several types of standardised residuals have been developed:

Pearson Residuals

Pearson residuals standardise the raw residual by dividing by the square root of the expected value:

Pearson Residual (Eq 18.6)

r_P = (y − μ̂) / √μ̂

This accounts for the Poisson assumption that Var(Y) = μ. If the model fits well, Pearson residuals should have approximately mean 0 and variance 1. The sum of squared Pearson residuals follows an approximate χ² distribution and can be used as an overall goodness-of-fit test.

Deviance Residuals

Deviance residuals are based on the contribution of each observation to the overall model deviance (the log-likelihood ratio comparing the fitted model to a saturated model). They are defined as:

d_i = sign(y_i − μ̂_i) × √[2(y_i ln(y_i/μ̂_i) − (y_i − μ̂_i))]

Deviance residuals tend to be more normally distributed than Pearson residuals, especially when some expected counts are small. This makes them preferable for normal probability plots and other diagnostic displays.

Anscombe Residuals

Anscombe residuals use a transformation of the observed counts designed to make the residuals as close to normally distributed as possible. They apply a cube-root transformation to both the observed and expected values. Anscombe residuals are particularly useful when checking the normality assumption of residuals in Poisson models, and they complement Pearson and deviance residuals in a thorough model evaluation.

Goodness of Fit

The overall fit of a Poisson model can be assessed using the sum of squared Pearson residuals, which approximately follows a χ² distribution with (n − p) degrees of freedom, where n is the number of observations and p is the number of estimated parameters. A significant test statistic suggests the model does not fit the data adequately.

An important diagnostic is the dispersion parameter, estimated as the sum of squared Pearson residuals divided by the residual degrees of freedom:

Dispersion Parameter Estimate

φ̂ = Σr_P² / (n − p)

Under the Poisson assumption (mean = variance), φ should equal 1. Values substantially greater than 1 indicate overdispersion; values less than 1 indicate underdispersion.

Understanding Overdispersion

Overdispersion—when the observed variance exceeds the Poisson-assumed variance—is one of the most common problems in count data modelling. It is critical to distinguish between two types:

Warning: Interpreting Overdispersion

Before concluding that overdispersion is “real,” always investigate whether the model is correctly specified. Adding missing predictors, removing outliers, or modelling non-linear effects may resolve apparent overdispersion without needing to change the distributional assumptions. Applying overdispersion corrections to a misspecified model can mask important features of the data.

Apparent Overdispersion

Apparent overdispersion arises from problems with the model rather than the data-generating process itself. Common causes include:

Outliers: A few extreme observations can inflate the dispersion statistic dramatically.
Missing important predictors: If key covariates are omitted from the model, the unexplained variation appears as overdispersion.
Wrong model form: Using a linear predictor when the true relationship is non-linear.
Non-linear effects: Failing to include quadratic or other polynomial terms for predictors with curvilinear relationships.

Apparent overdispersion can be resolved by correcting the model specification—removing outliers, adding missing predictors, or using the correct functional form.

Real Overdispersion

Real overdispersion reflects genuine extra-Poisson variation in the data that cannot be explained by observable covariates. This often arises from:

Unobserved heterogeneity: Subject-level variation in the underlying rate that is not captured by measured predictors.
Clustering: Events within groups (e.g., animals within herds) are correlated, violating the independence assumption.
Biological variability: Inherent variation in susceptibility or exposure that exceeds what the Poisson model allows.

Real overdispersion requires statistical corrections such as scaling standard errors, using negative binomial regression, or employing random effects models.

Approaches to Handling Overdispersion

Approach	How It Works	When to Use
Scale SEs by √φ	Multiplies standard errors by the square root of the estimated dispersion parameter; coefficients unchanged	Mild to moderate overdispersion; quick fix when coefficient estimates are trusted
Negative binomial regression	Adds an extra parameter (α) to model the excess variance explicitly	Moderate to severe overdispersion; when a more principled model is desired
Random effects / GLMM	Includes subject- or group-level random intercepts to capture unobserved heterogeneity	Clustered data (e.g., animals within herds); hierarchical study designs
GEE (robust SEs)	Uses generalised estimating equations with an empirical (sandwich) variance estimator	Clustered data when marginal (population-averaged) estimates are of primary interest

Reflection

Why is it important to distinguish between apparent and real overdispersion before choosing a correction strategy? What could go wrong if you apply the wrong fix?

Reflection saved!

* Complete the quiz and reflection to continue.

Section 4

Negative Binomial & Zero-Adjusted Models

⏱ Estimated time: 20 minutes

The Negative Binomial Distribution

The negative binomial (NB) distribution extends the Poisson by adding an extra parameter α that captures the additional variation not accounted for by the Poisson assumption. Conceptually, the NB distribution arises when the Poisson rate itself varies randomly across individuals—each subject has their own λ, drawn from a Gamma distribution.

The NB distribution allows the variance to exceed the mean, making it the natural first choice when overdispersion is present. Two common parameterisations define how the variance relates to the mean:

NB-1: Linear Variance

NB-1 Variance Function (Eq 18.8)

Var(Y) = μ + αμ = μ(1 + α)

In the NB-1 parameterisation, the variance increases linearly with the mean. The overdispersion is proportional to the mean: doubling the expected count doubles the excess variance. The ratio Var(Y)/μ = (1 + α) is constant across all observations, making NB-1 similar to a quasi-Poisson model with a fixed dispersion parameter.

NB-1 is sometimes preferred when overdispersion is relatively constant across the range of predicted values. However, it is less commonly used in practice than NB-2.

NB-2: Quadratic Variance

NB-2 Variance Function (Eq 18.9)

Var(Y) = μ + αμ²

In the NB-2 parameterisation (the most commonly used form), the variance increases quadratically with the mean. Observations with higher expected counts have proportionally more overdispersion. This is often more realistic in biological settings where variability tends to grow faster than the average.

The NB-2 model is the default in most statistical software (e.g., Stata’s nbreg, R’s glm.nb()). When α = 0, the NB-2 model reduces to the Poisson model, making the Poisson a special (nested) case of NB-2.

Negative Binomial Regression

The NB regression model uses the same log-linear form as Poisson regression—the only difference is in the assumed distribution of the outcome:

NB Regression Model

ln(E(Y)) = ln(n) + β₀ + β₁X₁ + β₂X₂ + …

Coefficients are interpreted identically to Poisson regression: e^β gives the incidence rate ratio. The key advantage is that the NB model produces correct standard errors even when overdispersion is present, because the extra variation is explicitly modelled through α.

Testing Poisson vs. Negative Binomial

Since the Poisson model is nested within the NB model (when α = 0), a likelihood ratio test (LRT) can be used to determine whether the NB model provides a significantly better fit. A significant LRT indicates that overdispersion is present and the NB model is preferred. Note that this is a boundary test (testing α = 0 vs. α > 0), so the p-value from the standard χ² reference distribution is conservative.

Zero-Adjusted Models

Standard count models (Poisson and NB) may not adequately handle datasets with an unusual number of zeros. Three families of models have been developed to address different zero-related problems:

🔠

Zero-Inflated Models

Click to learn more

🏃

Hurdle Models

Click to learn more

⛔

Zero-Truncated Models

Click to learn more

Choosing Among Zero-Adjusted Models

The choice depends on the data-generating process:

If some zeros are “structural” (from a fundamentally different process) and others arise from the count process, use a zero-inflated model.
If the zero/non-zero distinction is a separate decision from the magnitude of the count, use a hurdle model.
If zeros are impossible by design, use a zero-truncated model.

Model	Source of Zeros	Key Feature	Test / Comparison
Zero-Inflated	Both components (structural + count)	Mixture of logistic + count model	Vuong test vs. standard model
Hurdle	Binary component only	Two-part: binary then truncated count	LRT or AIC/BIC comparison
Zero-Truncated	Zeros cannot occur	Conditional on Y > 0	Applied when sampling excludes zeros

Reflection

When might you choose a hurdle model over a zero-inflated model in practice? Think of an epidemiological example where the distinction matters.

Reflection saved!

* Complete the quiz and reflection to continue.

HSCI 410 — Lesson 6

Exploratory Data Analysis For Epidemiology

Modelling Count and Rate Data

Learning objectives for this lesson:

Introduction & The Poisson Distribution

Why Model Count and Rate Data?

Types of Count and Rate Data

The Poisson Distribution

Section 1 Knowledge Check

Reflection

Poisson Regression Model & Interpretation

The Expected Count

The Log-Linear Model

The Offset Term

Modelling Counts (Without Offset)

Modelling Rates (With Offset)

Interpreting Poisson Regression Coefficients

Poisson Regression for Relative Risk Estimation

Section 2 Knowledge Check

Reflection

Evaluating Poisson Models & Overdispersion

Residuals for Poisson Models

Goodness of Fit

Understanding Overdispersion

Apparent Overdispersion

Real Overdispersion

Approaches to Handling Overdispersion

Section 3 Knowledge Check

Reflection

Negative Binomial & Zero-Adjusted Models

The Negative Binomial Distribution

NB-1: Linear Variance

NB-2: Quadratic Variance

Negative Binomial Regression

Zero-Adjusted Models

Choosing Among Zero-Adjusted Models

Section 4 Knowledge Check

Reflection

Lesson 6 — Comprehensive Assessment

Final Reflection

Final Assessment (15 Questions)

Lesson 6 Complete!