Modelling Count and Rate Data

Exploratory Data Analysis For Epidemiology

Learning objectives for this lesson:

Distinguish among simple counts, rates with person-time denominators, population rates, and area-based counts
Describe the Poisson distribution and its mean=variance property
Specify and interpret a Poisson regression model including the offset term
Interpret incidence rate ratios (IRR) from exponentiated Poisson coefficients
Evaluate Poisson models using Pearson, deviance, and Anscombe residuals
Distinguish apparent from real overdispersion and apply appropriate corrections
Compare negative binomial regression models (NB-1, NB-2) to Poisson regression
Apply zero-inflated, hurdle, and zero-truncated models to handle excess zeros

This course was developed by Dr. Kiffer G. Card, Faculty of Health Sciences, Simon Fraser University based on Dohoo, I. R., Martin, S. W., & Stryhn, H. (2012). Methods in Epidemiologic Research. VER Inc.

Reference

Glossary: Key Terms, People & Concepts

📚 Reference page, available throughout the lesson

This glossary collects the key concepts, people, and ideas you will meet in this lesson. Use it as a reference while you work through the material, or as a review before assessments. Type in the search box to filter entries.

Key Concepts & Ideas

Count Data Non-negative integer outcomes (0, 1, 2, …) representing the number of events. Examples: cases of disease, hospital visits, deaths.

Rate Data Counts standardised by exposure or population time-at-risk (e.g., cases per 1000 person-years). Modeled as count + log(time) offset.

Exposure / Person-Time The denominator in a rate: how much “at-risk time” produced the counted events (e.g., person-years, animal-months). Enters the model as an offset.

Offset A predictor in a Poisson/NB model whose coefficient is fixed at 1 (typically log(person-time)). Converts a count regression into a rate regression.

Overdispersion Observed variance exceeds the mean, violating the Poisson mean = variance assumption. Inflates Type I error if ignored; addressed with negative binomial or quasi-Poisson.

Dispersion Parameter (φ) A parameter capturing the ratio of variance to mean in count models. φ = 1 indicates Poisson; φ > 1 indicates overdispersion.

Zero Inflation More zeros than the Poisson or NB distribution predicts. Suggests two underlying processes, one generating structural zeros and one generating counts.

Incidence Rate The number of new cases per unit person-time (e.g., per 1000 person-years). The natural quantity estimated by Poisson models with an offset.

Methods & Statistical Concepts

Poisson Distribution A discrete probability distribution for counts of rare independent events with a constant rate. Mean equals variance (λ).

Poisson Regression A GLM with Poisson family and log link: log(λ) = Xβ. Used for count or rate outcomes when the Poisson assumption holds.

Log Link Function The canonical link for count GLMs, log(λ) = Xβ. Coefficients are exponentiated to give rate ratios (multiplicative effects).

IRR (Incidence Rate Ratio) The exponentiated coefficient from a Poisson or NB rate model: IRR = exp(β). Interprets as the ratio of incidence rates per unit increase in the predictor.

Negative Binomial Regression A GLM that allows variance to exceed the mean by adding a dispersion parameter (k or θ). Standard remedy for overdispersed counts.

Quasi-Poisson A pragmatic fix for overdispersion that scales standard errors by √φ. Uses Poisson point estimates but inflates inference; doesn't yield a true likelihood.

Zero-Inflated Poisson (ZIP) A two-component model: a logistic regression for “structural zero” vs. “at-risk” status, and a Poisson for counts among the at-risk. Handles excess zeros.

Zero-Inflated Negative Binomial (ZINB) Like ZIP but with a negative binomial count component. Handles both excess zeros and overdispersion.

Hurdle Model A two-part model: a binary process for any-vs-no events, and a truncated count process for those with at least one event. Conceptually distinct from zero-inflation.

Vuong Test A test for comparing non-nested models (e.g., Poisson vs. zero-inflated Poisson). Evaluates which fits the observed data better.

Deviance / Pearson Residuals Residuals scaled appropriately for GLMs. Used to diagnose lack of fit, outliers, and overdispersion in Poisson and NB models.

Key People

Siméon-Denis Poisson (1781–1840) French mathematician and physicist who derived the Poisson distribution as a limit of the binomial. The distribution and the regression bear his name.

Norman Breslow (1941–2015) American biostatistician who developed Poisson-regression methods for cohort and rate data, and (with Day) authored the foundational two-volume Statistical Methods in Cancer Research.

Diane Lambert American statistician who introduced zero-inflated Poisson regression in a 1992 paper on manufacturing defects. The method is now standard for excess-zero counts.

No matching entries. Try a different search term.

Section 2

Poisson Regression Model & Interpretation

⏱ Estimated time: 20 minutes

Section 2 of 4

Poisson Regression Model & Interpretation

Log-linear model, the offset for person-time, and rate ratios.

The starting point

Expected count = person-time times rate

Expected count (Eq 18.2)

\[ \color{#0B7B6B}{E(Y)} = \color{#C2410C}{n} \, \color{#6D28D9}{\lambda} \]

E(Y) expected count n person-time at risk λ event rate

Here $n$ is the accumulated person-time at risk and $\lambda$ is the underlying event rate. Subjects with longer follow-up have larger expected counts purely because of time, independent of any predictor.

The goal of the model is to let $\lambda$ vary with covariates while keeping $n$ as an accounting term, not an estimated coefficient.

The log-linear model

Poisson regression with an offset

Poisson regression model (Eq 18.3)

\[ \color{#0B7B6B}{\ln\bigl(E(Y)\bigr)} = \underbrace{\color{#C2410C}{\ln(n)}}_{\text{offset}} + \color{#6D28D9}{\beta_0} + \color{#1D4ED8}{\beta_1 X_1 + \beta_2 X_2 + \cdots} \]

ln E(Y) log expected count ln(n) offset (log person-time) β₀ intercept β_kX_k predictor terms

Without offset

Predicts expected count. All observations assumed to have equal exposure time.

With offset $\ln(n)$

Predicts expected rate $E(Y)/n$. Coefficient fixed at 1; uses no degrees of freedom.

What the offset does

From count to rate: algebraic equivalence

Rate model (equivalent form)

\[ \ln\!\left(\frac{\color{#0B7B6B}{E(Y)}}{\color{#C2410C}{n}}\right) = \color{#6D28D9}{\beta_0} + \color{#1D4ED8}{\beta_1 X_1 + \beta_2 X_2 + \cdots} \]

E(Y) expected count n person-time β₀ intercept β_kX_k predictors

Interpretation

Exponentiated coefficients as incidence rate ratios

Incidence Rate Ratio

\[ \text{IRR} = e^{\beta_1} \]

IRR > 1

Higher rate in the exposed group. $e^{0.30} = 1.35$ means a 35% rate increase per unit.

IRR = 1

No difference in rate between groups.

IRR < 1

Lower rate in the exposed group. $e^{-0.22} = 0.80$ means a 20% rate reduction.

Worked example

Mastitis in dairy herds

Example: herd size coefficient

\[ \hat{\beta}_{\text{herd size}} = 0.012 \implies \text{IRR} = e^{0.012} = 1.012 \]

For 100 additional cows:

\[ \text{IRR}_{+100} = e^{0.012 \times 100} = e^{1.2} \approx 3.32 \]

A herd with 100 more cows has a 3.32-fold higher mastitis rate than the baseline herd on the rate-per-cow-year scale.

Carry forward

What to take into the next section

Log-linear model: multiplicative effects on the rate scale, estimated on the log scale.
Offset: $\ln(n)$ with coefficient fixed at 1. Converts count model to rate model.
IRR: $e^\beta$ is the ratio of rates between groups differing by one unit of the predictor.

Introduction and Overview

An earlier section introduced the Poisson distribution as the probability model behind count data. This section turns the distribution into a regression. The log-linear formulation (Nelder & Wedderburn, 1972) lets us link counts to predictors on a multiplicative scale, and the offset term is the trick that converts a count model into a rate model, precisely what you need to handle person-time denominators from the rate-based cohort designs you saw in earlier courses and lessons.

Learning Objectives

Write down a Poisson regression model with a log link and identify its linear predictor.
Use an offset term to convert a count model into a rate model with person-time denominators.
Interpret exponentiated Poisson coefficients as multiplicative rate ratios.
Translate a fitted Poisson regression into incidence rates and predicted counts for new covariate values.

The Expected Count

The starting point for Poisson regression is the relationship between the expected number of events and the underlying rate. If an individual (or group) is observed for n units of person-time and the event rate is λ (lambda), the expected count is:

Expected count (Eq 18.2)

\[ \color{#0B7B6B}{E(Y)} = \color{#C2410C}{n}\,\color{#6D28D9}{\lambda} \]

The expected number of events equals the person-time at risk multiplied by the underlying event rate.

Here, n represents the person-time at risk (e.g., person-years of follow-up) and λ is the incidence rate. The expected count is simply the product of the time at risk and the rate at which events occur. Different subjects may contribute different amounts of person-time, which must be accounted for in the model.

The Log-Linear Model

Poisson regression uses a log link function to relate the expected count to a linear combination of predictors. Taking the natural logarithm of both sides of the expected count equation and incorporating predictors gives us the Poisson regression model:

Poisson regression model (Eq 18.3)

\[ \color{#0B7B6B}{\ln\!\big(E(Y)\big)} = \color{#C2410C}{\ln(n)} + \color{#6D28D9}{\beta_0} + \color{#1D4ED8}{\beta_1 X_1 + \beta_2 X_2 + \cdots} \]

The log of the expected count equals the offset (log person-time) plus the intercept plus the predictor terms.

The term ln(n) is the offset, a fixed term in the model that is included, rather than estimated, to account for the fact that different observations may have different amounts of exposure (person-time). The β coefficients describe how the log of the expected count (or rate) changes with the predictors. Fixing that coefficient at exactly 1 encodes a simple idea: a subject followed for twice as long is expected to accumulate about twice as many events, so the offset simply carries the denominator of the rate rather than being a quantity the data must estimate.

The Offset Term

The offset is one of the most important concepts in Poisson regression. It transforms the model from one that predicts counts to one that effectively predicts rates (Coxe et al., 2009).

Modelling Counts (Without Offset)

When no offset is included, the model predicts the expected count directly:

ln(E(Y)) = β₀ + β₁X₁ + …

This is appropriate when all observations have the same amount of exposure or follow-up time. For instance, if all herds are observed for exactly one year, the count of disease cases directly reflects the rate. In practice, this situation is relatively uncommon, since most epidemiological studies have subjects with varying follow-up times.

Modelling Rates (With Offset)

When the offset ln(n) is included, the model effectively predicts the rate rather than the raw count:

ln(E(Y)) = ln(n) + β₀ + β₁X₁ + …

This is equivalent to modelling ln(E(Y)/n) = β₀ + β₁X₁ + …, where E(Y)/n is the expected rate. The offset accounts for the fact that subjects with longer follow-up times are expected to accumulate more events simply by virtue of being observed longer. This is the standard approach when follow-up times vary across subjects.

Interpreting Poisson Regression Coefficients

In Poisson regression, the exponentiated coefficient e^β is interpreted as an incidence rate ratio (IRR), consistent with the wider generalised linear model framework. This is analogous to the odds ratio in logistic regression but applies to rates rather than odds.

Incidence Rate Ratio (IRR)

For a one-unit increase in the predictor X₁, the incidence rate is multiplied by e^β₁. If β₁ = 0.30, then IRR = e^0.30 = 1.35, meaning the rate increases by 35% for each one-unit increase in X₁. An IRR > 1 indicates an increased rate; an IRR < 1 indicates a decreased rate; and an IRR = 1 indicates no association.

Epidemiological Example: Mastitis in Dairy Herds

Suppose we model the number of mastitis cases per herd over one year, with herd size as a predictor and cow-years at risk as the offset. The Poisson regression yields β_{herd size} = 0.012.

Interpretation: e^0.012 = 1.012, so for each additional cow in the herd, the incidence rate of mastitis increases by 1.2%. A herd with 100 more cows would have an expected rate ratio of e^0.012×100 = e^1.2 = 3.32 compared to the baseline, a 3.32-fold higher mastitis rate. Notice that the scaling is done on the coefficient scale first: multiply the coefficient by 100, then exponentiate. Multiplying the rate ratio itself by 100 (1.012 times 100) would be wrong, because rate ratios compound by multiplication rather than adding up.

R Activity: Poisson with an offset and a negative-binomial fall-back

The companion dataset phaa_followup.csv records how many GP visits each participant had during their follow-up. Because follow-up time varies, we need an offset of log(fu_years) to turn the count into a rate. The full annotated script is in r-activities/HSCI_410_Lesson_7_Count_and_Rate_Data.R.

library(MASS);  library(AER)
phaa <- read.csv("phaa_followup.csv", stringsAsFactors = FALSE)
phaa$smoker <- factor(phaa$smoker, levels = c("No","Yes"))

# 1. A peek at the count outcome
summary(phaa$gp_visits)
hist(phaa$gp_visits, breaks = 30,
     main = "GP visits during follow-up", xlab = "Visits")

# 2. Poisson regression with offset to model the RATE per person-year
fit_rate <- glm(gp_visits ~ age + smoker + hypertension
                            + offset(log(fu_years)),
                family = poisson, data = phaa)
summary(fit_rate)
exp(coef(fit_rate))                              # incidence-rate ratios
exp(confint(fit_rate))

# 3. Goodness of fit: Pearson chi^2 / df ~ 1 = good
sum(residuals(fit_rate, type = "pearson")^2) / fit_rate$df.residual

# 4. Formal overdispersion test
dispersiontest(fit_rate)

# 5. Negative binomial fall-back when Poisson is overdispersed
fit_nb <- glm.nb(gp_visits ~ age + smoker + hypertension
                              + offset(log(fu_years)), data = phaa)
AIC(fit_rate, fit_nb)
cbind(Poisson = exp(coef(fit_rate)),
      NegBin  = exp(coef(fit_nb)))

Why the offset isn't a predictor. Because its coefficient is fixed at 1, offset(log(fu_years)) shifts the intercept onto the rate scale without using a degree of freedom. The IRR for smokerYes tells you the multiplicative effect of smoking on the visit rate, holding age and hypertension constant. If dispersiontest() is significant, prefer glm.nb(); CIs widen but the point estimates are usually similar.

R Reflect on what you just ran

Use the questions below to interpret the output you produced. Look at your console / plot before answering.

1. From exp(coef(fit_rate)) and exp(confint(fit_rate)), report the IRR (and 95% CI) for smokerYes. Translate it into one sentence: how does the rate of GP visits per person-year differ between smokers and non-smokers?

Model answerexp(coef(fit_rate)) typically returns an IRR for smokerYes of about 1.40, 95% CI roughly (1.25, 1.55). Interpretation: smokers have an incidence rate of GP visits per person-year about 40% higher than non-smokers, holding other covariates constant. The CI excludes 1, so the effect is statistically significant.

2. Compute sum(residuals(fit_rate, type = "pearson")^2) / fit_rate$df.residual. Is it close to 1, or much larger? Also report the p-value from dispersiontest(fit_rate). What do those two pieces of evidence say about overdispersion?

Model answersum(residuals(fit_rate, type="pearson")^2) / fit_rate$df.residual typically returns a value around 1.4–2.0, meaningfully larger than 1, suggesting overdispersion. dispersiontest(fit_rate) gives a small p-value (typically < 0.01), confirming statistically significant overdispersion. Both pieces of evidence point to the same conclusion: the Poisson assumption (variance = mean) is violated, and the standard errors from Poisson are likely too small (CIs too narrow, p-values inflated).

3. From AIC(fit_rate, fit_nb) and the side-by-side IRR comparison, which model do you prefer? Are the point estimates similar between Poisson and NegBin? What typically changes when you switch (point estimates, CIs, or both)?

Model answerAIC(fit_rate, fit_nb) typically shows AIC much lower for the negative binomial model (often hundreds of points lower), strongly favouring NegBin. The point estimates for IRRs are very similar between Poisson and NegBin, which is expected because both are estimating the conditional mean structure. What changes are the standard errors: NegBin SEs are larger (because they account for the extra variance from the over-dispersion parameter), so the CIs widen and p-values become less extreme. The lesson: overdispersion doesn't bias the point estimates much, but mis-specifying it gives over-confident inference.

Saved.

Poisson Regression for Relative Risk Estimation

An important application of Poisson regression is estimating relative risks (RR) directly from binary outcome data. When the outcome is rare, the Poisson model can provide estimates of the RR that are more interpretable than the odds ratios from logistic regression. This approach typically uses robust (sandwich) standard errors to account for the fact that binary data do not truly follow a Poisson distribution.

Reflection

Consider a study where participants have very different follow-up times. How would using an offset term change your interpretation compared to simply modelling raw counts?

Model answerThe offset term offset(log(person_years)) in a Poisson model fixes the coefficient on log(person-years) at exactly 1, effectively modelling the rate (events per person-year) rather than the raw count. Without an offset, the model would estimate a coefficient on log(person-years) that absorbs the relationship between exposure time and outcome, leaving it biased and not interpretable. With the offset, the IRRs read as multipliers on the rate (smokers have 1.4-fold higher rate of events per unit time), which is the unit of inference most epidemiologic questions want. Comparing to a count model without offset: a participant with 5 events in 1 year is the same as one with 5 events in 5 years, which is obviously wrong; the offset corrects this.

Reflection saved!

* Complete the quiz and reflection to continue.

Section 3

Evaluating Poisson Models & Overdispersion

⏱ Estimated time: 20 minutes

Section 3 of 4

Evaluating Poisson Models & Overdispersion

Residuals, goodness-of-fit, and the most common Poisson failure mode.

Residual types

Three kinds of Poisson residuals

Pearson residual (Eq 18.6)

\[ \color{#0B7B6B}{r_{P,i}} = \frac{\color{#C2410C}{y_i} - \color{#6D28D9}{\hat{\mu}_i}}{\color{#1D4ED8}{\sqrt{\hat{\mu}_i}}} \]

r_P,i Pearson residual y_i observed count μ̂_i fitted mean √μ̂_i Poisson SD

Deviance residuals

Signed contribution of each observation to the total model deviance. More normally distributed; preferred for quantile-quantile plots.

Anscombe residuals

Cube-root transformation for near-normality. Complementary to Pearson and deviance residuals in a thorough diagnostic review.

Goodness of fit

The dispersion parameter

Estimated dispersion

\[ \color{#0B7B6B}{\hat{\phi}} = \frac{\color{#C2410C}{\sum r_{P,i}^{2}}}{\color{#6D28D9}{n - p}} \]

φ̂ dispersion estimate Σr_P,i² sum of squared residuals n − p residual df

$\hat{\phi} \approx 1$

Poisson fits well. Variance matches the mean as assumed.

$\hat{\phi} \gg 1$

Overdispersion. Observed variance exceeds Poisson prediction. Standard errors are too narrow.

$\hat{\phi} < 1$

Underdispersion. Less common. Can indicate a model that predicts too much variation.

A critical distinction

Apparent versus real overdispersion

Apparent overdispersion

Source: model misspecification. Outliers, missing predictors, wrong functional form. Fix by correcting the model first, then re-check the dispersion statistic.

Real overdispersion

Source: genuine unobserved heterogeneity. Clustering within groups, biological variability, unmeasurable confounders. Requires a different distributional assumption.

Always investigate apparent causes before reaching for a distributional fix.

Responses to overdispersion

Four approaches

Scale SEs by $\sqrt{\hat{\phi}}$

Quick correction (quasi-Poisson). Coefficients unchanged; standard errors widened. Appropriate for mild to moderate overdispersion.

Negative binomial

Adds an explicit dispersion parameter $\alpha$. Standard errors are correct by design. Section 4 covers this fully.

GLMM / random effects

Models unobserved heterogeneity through subject-level or group-level random intercepts. Best for clustered designs.

GEE (robust SEs)

Uses sandwich variance estimators. Population-averaged interpretation; appropriate when within-cluster correlation is a nuisance rather than a target.

Carry forward

What to take into the next section

Dispersion statistic: $\hat{\phi} = \sum r_{P}^2 / (n-p)$. Target is 1.
Apparent vs. real: check model specification before choosing a distributional fix.
If real: negative binomial regression is the standard first response, and it is where a later section begins.

Introduction and Overview

An earlier section fit the Poisson model. This section turns to evaluating it. Poisson regression makes a strong assumption, that the variance equals the mean, which real count data frequently violate (Ver Hoef & Boveng, 2007). Overdispersion is the most common diagnosis you'll make on a count model, and addressing it is what a later section will be about. First, though, you need the residuals and goodness-of-fit tools to detect it.

Learning Objectives

Compute and interpret Pearson, deviance, and Anscombe residuals from a Poisson model.
Apply the deviance and Pearson chi-squared statistics as overall goodness-of-fit tests.
Define overdispersion in terms of the variance-to-mean ratio and explain why it inflates Type I error.
Choose between quasi-Poisson, scale-corrected, and negative binomial responses to overdispersion.

Residuals for Poisson Models

Just as in linear regression, residuals are the primary tool for evaluating how well a Poisson model fits the observed data. However, because the variance of a Poisson variable depends on its mean, raw residuals (observed − expected) are not directly comparable across observations. Several types of standardised residuals have been developed:

Pearson Residuals

Pearson residuals standardise the raw residual by dividing by the square root of the expected value:

Pearson residual (Eq 18.6)

\[ \color{#0B7B6B}{r_P} = \frac{\color{#C2410C}{y} - \color{#6D28D9}{\hat{\mu}}}{\color{#1D4ED8}{\sqrt{\hat{\mu}}}} \]

The Pearson residual is the observed count minus the fitted mean, divided by the square root of that fitted mean (the Poisson standard deviation).

This accounts for the Poisson assumption that Var(Y) = μ. If the model fits well, Pearson residuals should have approximately mean 0 and variance 1. The sum of squared Pearson residuals follows an approximate χ² distribution and can be used as an overall goodness-of-fit test.

Deviance Residuals

Deviance residuals are based on the contribution of each observation to the overall model deviance (the log-likelihood ratio comparing the fitted model to a saturated model). They are defined as:

d_i = sign(y_i − μ̂_i) × √[2(y_i ln(y_i/μ̂_i) − (y_i − μ̂_i))]

Deviance residuals tend to be more normally distributed than Pearson residuals, especially when some expected counts are small. This makes them preferable for normal probability plots and other diagnostic displays.

Anscombe Residuals

Anscombe residuals use a transformation of the observed counts designed to make the residuals as close to normally distributed as possible. They apply a cube-root transformation to both the observed and expected values. Anscombe residuals are particularly useful when checking the normality assumption of residuals in Poisson models, and they complement Pearson and deviance residuals in a thorough model evaluation.

Goodness of Fit

The overall fit of a Poisson model can be assessed using the sum of squared Pearson residuals, which approximately follows a χ² distribution with (n − p) degrees of freedom, where n is the number of observations and p is the number of estimated parameters. A significant test statistic suggests the model does not fit the data adequately.

An important diagnostic is the dispersion parameter, estimated as the sum of squared Pearson residuals divided by the residual degrees of freedom:

Dispersion parameter estimate

\[ \color{#0B7B6B}{\hat{\phi}} = \frac{\color{#C2410C}{\sum r_P^{2}}}{\color{#6D28D9}{n - p}} \]

The estimated dispersion is the sum of squared Pearson residuals divided by the residual degrees of freedom. A value near one supports the Poisson variance assumption.

Under the Poisson assumption (mean = variance), φ should equal 1. Values substantially greater than 1 indicate overdispersion; values less than 1 indicate underdispersion.

Why this matters in practice: when overdispersion is ignored, the Poisson standard errors come out too small, so confidence intervals are too narrow and p-values too extreme. The practical risk is calling an association statistically significant when the data do not really support it.

Understanding Overdispersion

Overdispersion, the situation where the observed variance exceeds the Poisson-assumed variance, is one of the most common problems in count data modelling (Ver Hoef & Boveng, 2007). It is critical to distinguish between two types:

Warning: Interpreting Overdispersion

Before concluding that overdispersion is “real,” always investigate whether the model is correctly specified. Adding missing predictors, removing outliers, or modelling non-linear effects may resolve apparent overdispersion without needing to change the distributional assumptions. Applying overdispersion corrections to a misspecified model can mask important features of the data.

Apparent Overdispersion

Apparent overdispersion arises from problems with the model rather than the data-generating process itself. Common causes include:

Outliers: A few extreme observations can inflate the dispersion statistic dramatically.
Missing important predictors: If key covariates are omitted from the model, the unexplained variation appears as overdispersion.
Wrong model form: Using a linear predictor when the true relationship is non-linear.
Non-linear effects: Failing to include quadratic or other polynomial terms for predictors with curvilinear relationships.

Apparent overdispersion can be resolved by correcting the model specification: removing outliers, adding missing predictors, or using the correct functional form.

Real Overdispersion

Real overdispersion reflects genuine extra-Poisson variation in the data that cannot be explained by observable covariates. This often arises from:

Unobserved heterogeneity: Subject-level variation in the underlying rate that is not captured by measured predictors.
Clustering: Events within groups (e.g., animals within herds) are correlated, violating the independence assumption.
Biological variability: Inherent variation in susceptibility or exposure that exceeds what the Poisson model allows.

Real overdispersion requires statistical corrections such as scaling standard errors, using negative binomial regression, or employing random effects models.

Approaches to Handling Overdispersion

Approach	How It Works	When to Use
Scale SEs by √φ	Multiplies standard errors by the square root of the estimated dispersion parameter; coefficients unchanged	Mild to moderate overdispersion; quick fix when coefficient estimates are trusted
Negative binomial regression	Adds an extra parameter (α) to model the excess variance explicitly	Moderate to severe overdispersion; when a more principled model is desired
Random effects / GLMM	Includes subject- or group-level random intercepts to capture unobserved heterogeneity	Clustered data (e.g., animals within herds); hierarchical study designs
GEE (robust SEs)	Uses generalised estimating equations with an empirical (sandwich) variance estimator	Clustered data when marginal (population-averaged) estimates are of primary interest

Reflection

Why is it important to distinguish between apparent and real overdispersion before choosing a correction strategy? What could go wrong if you apply the wrong fix?

Model answerApparent overdispersion can arise from model misspecification (missing important covariates, missing interactions, or the wrong functional form for a continuous predictor) rather than from true heterogeneity in the rate. The fix depends on the cause: (a) if misspecification, add the missing covariates and re-check dispersion; the ‘overdispersion’ often resolves. (b) If real (true heterogeneity beyond what the linear predictor captures), switch to negative binomial or quasi-Poisson. Wrong fix consequences: applying NegBin to a misspecified Poisson model patches the SE inflation but leaves the biased point estimates from missing covariates; applying robust SEs instead may also misrepresent the true variability if the underlying structure has zero-inflation. Always investigate dispersion's source through residual diagnostics, plots of variance vs. mean, and inclusion of plausible missing covariates.

Reflection saved!

* Complete the quiz and reflection to continue.

Section 4

Negative Binomial & Zero-Adjusted Models

⏱ Estimated time: 20 minutes

Section 4 of 4

Negative Binomial & Zero-Adjusted Models

Extending the Poisson for overdispersion and excess zeros.

Conceptual foundation

Negative binomial as a Gamma-mixed Poisson

When $\alpha = 0$, all subjects share the same $\lambda$, and the model reduces exactly to Poisson.

Two parameterisations

NB-1 and NB-2 variance functions

NB-1 variance (Eq 18.8)

\[ \color{#0B7B6B}{\text{Var}(Y)} = \color{#C2410C}{\mu}(1 + \color{#6D28D9}{\alpha}) \]

Var(Y) variance μ mean α dispersion parameter

NB-2 variance (Eq 18.9), the default in glm.nb()

\[ \color{#0B7B6B}{\text{Var}(Y)} = \color{#C2410C}{\mu} + \color{#6D28D9}{\alpha}\color{#C2410C}{\mu}^2 \]

Var(Y) variance μ mean α dispersion parameter

NB-2 is the most commonly used form. Its variance-to-mean ratio $1 + \alpha\mu$ grows with the mean, reflecting the biological reality that larger expected counts are also more variable in absolute terms.

Choosing between Poisson and NB

Likelihood ratio test on the dispersion parameter

Null hypothesis for LRT

\[ H_0: \alpha = 0 \quad (\text{Poisson is adequate}) \]

Significant LRT

Overdispersion confirmed. Use negative binomial. Expect similar point estimates but wider standard errors than Poisson.

Non-significant LRT

No evidence against Poisson. Simpler model preferred. The dispersion parameter $\alpha$ is effectively zero.

Excess zeros

Three zero-adjusted model families

Zero-inflated

Mixture: structural-zero group plus count process. Both components can have different predictors. The Vuong test compares it to the standard model.

Hurdle

Two stages: binary model for any event, then zero-truncated count for magnitude. All zeros come from the binary stage.

Zero-truncated

Zeros structurally impossible by study design. Conditions on Y greater than zero. Applied when sampling excludes non-events.

Lesson recap

The decision sequence

Introduction and Overview

An earlier section named overdispersion as the most common problem with Poisson regression. This section closes the lesson with the standard fixes: the negative binomial distribution (which adds a free dispersion parameter; Ver Hoef & Boveng, 2007), and zero-adjusted models for the special case where there are far more zeros than Poisson or negative binomial alone can accommodate (Lambert, 1992).

Learning Objectives

Derive the negative binomial as a Gamma-mixed Poisson and contrast NB-1 and NB-2 parameterisations.
Fit a negative binomial regression and compare it to a Poisson fit using a likelihood-ratio test on the dispersion parameter.
Identify zero-inflation versus zero-truncation and choose the appropriate zero-adjusted model.
Interpret a hurdle or zero-inflated model in terms of separate processes for the zero and the count components.

The Negative Binomial Distribution

The negative binomial (NB) distribution extends the Poisson by adding an extra parameter α that captures the additional variation not accounted for by the Poisson assumption. Conceptually, the NB distribution arises when the Poisson rate itself varies randomly across individuals, so each subject has their own λ, drawn from a Gamma distribution. Averaging over those rates produces the negative binomial distribution.

The NB distribution allows the variance to exceed the mean, making it the natural first choice when overdispersion is present (Ver Hoef & Boveng, 2007). Two common parameterisations define how the variance relates to the mean:

Bar chart comparing a Poisson distribution and a negative binomial distribution, both with a mean of 6. The negative binomial places more probability on very low and very high counts, reflecting its larger variance. — At the same mean (μ = 6), the negative binomial spreads probability further into the tails than the Poisson. That extra dispersion is what makes it the standard remedy when counts are overdispersed.

NB-1: Linear Variance

NB-1 variance function (Eq 18.8)

\[ \color{#0B7B6B}{\text{Var}(Y)} = \color{#C2410C}{\mu} + \color{#6D28D9}{\alpha}\color{#C2410C}{\mu} = \color{#C2410C}{\mu}\,(1 + \color{#6D28D9}{\alpha}) \]

The variance equals the mean inflated by a constant factor set by the dispersion parameter, so the variance grows linearly with the mean.

In the NB-1 parameterisation, the variance increases linearly with the mean. The overdispersion is proportional to the mean: doubling the expected count doubles the excess variance. The ratio Var(Y)/μ = (1 + α) is constant across all observations, making NB-1 similar to a quasi-Poisson model with a fixed dispersion parameter.

NB-1 is sometimes preferred when overdispersion is relatively constant across the range of predicted values. However, it is less commonly used in practice than NB-2.

NB-2: Quadratic Variance

NB-2 variance function (Eq 18.9)

\[ \color{#0B7B6B}{\text{Var}(Y)} = \color{#C2410C}{\mu} + \color{#6D28D9}{\alpha}\color{#C2410C}{\mu}^{2} \]

The variance equals the mean plus the dispersion parameter times the mean squared, so the variance grows quadratically with the mean.

In the NB-2 parameterisation (the most commonly used form), the variance increases quadratically with the mean. Observations with higher expected counts have proportionally more overdispersion. This is often more realistic in biological settings where variability tends to grow faster than the average.

The NB-2 model is the default in most statistical software (e.g., Stata’s nbreg, R’s glm.nb()). When α = 0, the NB-2 model reduces to the Poisson model, making the Poisson a special (nested) case of NB-2.

Negative Binomial Regression

The NB regression model uses the same log-linear form as Poisson regression; the only difference is in the assumed distribution of the outcome:

Negative binomial regression model

\[ \color{#0B7B6B}{\ln\!\big(E(Y)\big)} = \color{#C2410C}{\ln(n)} + \color{#6D28D9}{\beta_0} + \color{#1D4ED8}{\beta_1 X_1 + \beta_2 X_2 + \cdots} \]

The log of the expected count uses the same log-linear form as Poisson: the offset plus the intercept plus the predictor terms. Only the assumed variance differs.

Coefficients are interpreted identically to Poisson regression: e^β gives the incidence rate ratio. The key advantage is that the NB model produces correct standard errors even when overdispersion is present, because the extra variation is explicitly modelled through α.

Testing Poisson vs. Negative Binomial

Since the Poisson model is nested within the NB model (when α = 0), a likelihood ratio test (LRT) can be used to determine whether the NB model provides a significantly better fit. A significant LRT indicates that overdispersion is present and the NB model is preferred. Note that this is a boundary test (testing α = 0 vs. α > 0), so the p-value from the standard χ² reference distribution is conservative.

Zero-Adjusted Models

Standard count models (Poisson and NB) may not adequately handle datasets with an unusual number of zeros (Lambert, 1992). Three families of models have been developed to address different zero-related problems:

Zero-Inflated ModelsClick to explore

Hurdle ModelsClick to explore

Zero-Truncated ModelsClick to explore

Choosing Among Zero-Adjusted Models

The choice depends on the data-generating process:

If some zeros are “structural” (from a fundamentally different process) and others arise from the count process, use a zero-inflated model.
If the zero/non-zero distinction is a separate decision from the magnitude of the count, use a hurdle model.
If zeros are impossible by design, use a zero-truncated model.

Model	Source of Zeros	Key Feature	Test / Comparison
Zero-Inflated	Both components (structural + count)	Mixture of logistic + count model	Vuong test vs. standard model
Hurdle	Binary component only	Two-part: binary then truncated count	LRT or AIC/BIC comparison
Zero-Truncated	Zeros cannot occur	Conditional on Y > 0	Applied when sampling excludes zeros

Reflection

When might you choose a hurdle model over a zero-inflated model in practice? Think of an epidemiological example where the distinction matters.

Model answerHurdle and zero-inflated models both handle excess zeros but with different mechanisms. Hurdle: the zero process is independent of the count process, so you first decide whether any event occurs (a binary process), and only if “yes” do you count events (typically a zero-truncated Poisson or NegBin). Zero-inflated: zeros come from two sources: a structural-zero process (always zero, e.g., never susceptible) plus the count process (could be zero or non-zero among the susceptible). Choose hurdle when the binary “any event” decision is conceptually separable from the count “how many” (e.g., GP visits: did you go at all this year? if yes, how many?). Choose zero-inflated when some subjects are structurally immune (e.g., contraception use among postmenopausal women, structural zeros, versus premenopausal users counting cycles). Distinction matters because the two models give different interpretations of the ‘extra-zero’ subjects.

Reflection saved!

* Complete the quiz and reflection to continue.

HSCI 410, Lesson 7

Exploratory Data Analysis For Epidemiology

Modelling Count and Rate Data

Learning objectives for this lesson:

Glossary: Key Terms, People & Concepts

Introduction & The Poisson Distribution

Modelling Count and Rate Data

A different kind of outcome

Linear regression

Poisson regression

Introduction & The Poisson Distribution

What kind of count data do you have?

Simple counts

Person-time rates

Population rates

Area-based counts

The Poisson distribution

Mean equals variance: \(E(Y) = \text{Var}(Y) = \mu\)

Four conditions and the shape of the distribution

What to take into the next section

Introduction and Overview

Learning Objectives

Why Model Count and Rate Data?

Types of Count and Rate Data

The Poisson Distribution

Key Property: Mean = Variance

Reflection

Poisson Regression Model & Interpretation

Poisson Regression Model & Interpretation

Expected count = person-time times rate

Poisson regression with an offset

Without offset

With offset \(\ln(n)\)

From count to rate: algebraic equivalence

Exponentiated coefficients as incidence rate ratios

IRR > 1

IRR = 1

IRR < 1

Mastitis in dairy herds

What to take into the next section

Introduction and Overview

Learning Objectives

The Expected Count

The Log-Linear Model

The Offset Term

Modelling Counts (Without Offset)

Modelling Rates (With Offset)

Interpreting Poisson Regression Coefficients

Incidence Rate Ratio (IRR)

R Reflect on what you just ran

Poisson Regression for Relative Risk Estimation

Reflection

Evaluating Poisson Models & Overdispersion

Evaluating Poisson Models & Overdispersion

Three kinds of Poisson residuals

Deviance residuals

Anscombe residuals

The dispersion parameter

\(\hat{\phi} \approx 1\)

\(\hat{\phi} \gg 1\)

\(\hat{\phi} < 1\)

Apparent versus real overdispersion

Apparent overdispersion

Real overdispersion

Four approaches

Scale SEs by \(\sqrt{\hat{\phi}}\)

Negative binomial

GLMM / random effects

GEE (robust SEs)

What to take into the next section

Introduction and Overview

Learning Objectives

Residuals for Poisson Models

Goodness of Fit

Understanding Overdispersion

Warning: Interpreting Overdispersion

Apparent Overdispersion

Real Overdispersion

Approaches to Handling Overdispersion

Reflection