Model-Building Strategies

Exploratory Data Analysis For Epidemiology

Kiffer G. Card, PhD, Faculty of Health Sciences, Simon Fraser University

Learning objectives for this lesson:

Develop a full (maximal) model incorporating biological understanding of the system under study
Carry out procedures to reduce a large number of predictors to a manageable subset
Address issues related to the functional form of continuous predictors and missing values
Build regression-type models using both statistical and non-statistical criteria
Evaluate the reliability of a regression-type model
Present the results from an analysis in a meaningful way

This course was developed by Kiffer G. Card, PhD, as a companion to Dohoo, I. R., Martin, S. W., & Stryhn, H. (2012). Methods in Epidemiologic Research. VER Inc.

Section 3

Effects of Continuous Predictors

⏱ Estimated time: 25 minutes

Evaluating Continuous Predictor–Outcome Relationships

It is important to evaluate the structure of the relationship between a continuous predictor and the outcome before starting model building. The underlying assumption of linearity can be evaluated through diagnostics after fitting the model, but it is useful to explore the nature of the relationship beforehand.

Key Approaches

Four main approaches to evaluating the effect of continuous predictors are: (1) scatterplots and smoothed line plots, (2) categorising the continuous variable, (3) exploring polynomial models, and (4) using splines.

Scatterplots & Smoothed Lines

Scatterplots are 2-way plots of the outcome (Y-axis) versus the continuous predictor (X-axis). They are most useful for continuous outcomes; scatterplots of dichotomous outcomes present as two lines of dots and are rarely informative by themselves.

Scatterplots can be greatly improved by adding a smoothed line through the centre of the data. All smoothed lines have a local-influence property: the position of the line at any value of x is influenced by nearby points but not by distant points.

Types of Smoothed Lines

There are several types of smoothed line functions:

Running mean smoother: Computes a simple average of y values in the neighbourhood
Running line smoother: Fits a simple linear regression through observations in the neighbourhood
Lowess smoother: Fits a weighted linear regression where points closer to x_i receive larger weight (using tricube weighting)
Local polynomial smoother: Fits a weighted polynomial regression in the neighbourhood

The bandwidth controls the size of the neighbourhood. A bandwidth of 0.8 means 80% of the data is used for each point. Larger bandwidths produce smoother lines but may miss important features.

Caution with Extreme Values

All smoothed line functions can have problems at the extreme values of the predictor distribution. This is because the neighbourhood at the tails is not symmetrical and contains relatively few data points. It is important not to pay much attention to the extremes of the fitted line. Vertical dashed lines marking the 2.5th and 97.5th percentiles can help delineate where most of the data falls.

Categorising Continuous Predictors

The assumption of linearity can be avoided by converting the continuous predictor into categories. However, this is generally not advisable for three reasons:

Categorisation involves the loss of information
It is unlikely that biological processes have a step-function relationship (i.e., sudden changes at specific cutpoints)
The choice of cutpoints is arbitrary and, if data-driven, may lead to biased results

That said, about 5 categories will usually suffice to control for confounding effects. A model with a categorised variable can be compared to one with a continuous (linear) variable using AIC or BIC.

Polynomial Models

Polynomials allow the regression line to follow a curve rather than a straight line. Power terms (e.g., x² or x³) are added to the model. Unlike smoothed lines, polynomial models have a global-influence property—the shape of the entire line is influenced by all the data.

Quadratic Model

Y = β₀ + β₁x + β₂x² + ε

⚠ Centring to Avoid Collinearity

The original variable (x) is often highly correlated with its squared term (x²), creating collinearity. The solution is to centre the variable by subtracting the mean before squaring. If a quadratic model is insufficient (i.e., the quadratic term is significant but the fit is still poor), a cubic term (x³) can be added.

Fractional Polynomials

Fractional polynomials (FPs) extend the idea of polynomial models by allowing power terms that are not restricted to positive integers. The most common set of powers to consider is: −3, −2, −1, −0.5, 0 (= ln), 0.5, 1, 2, 3. A 2-degree FP can fit a wide range of non-linear shapes and may be the most parsimonious way to model non-linearity.

📊 Example: Birth Weight vs Gestation Length

A quadratic model regressing birth weight on centred gestation length showed R² = 0.29. When fractional polynomials were explored, the best-fitting 2-degree FP used powers of 3 and 3×ln(gest), yielding R² = 0.30 and fitting significantly better than the linear, quadratic, or cubic models. The FP coefficients are not directly interpretable—the only way to make sense of such a model is to display the function graphically.

Splines

An alternative to polynomial models is to fit a piecewise linear function. Points where the slope changes are called knot points. In the absence of prior evidence, knots may be chosen at percentiles of the predictor (e.g., 25th, 50th, 75th). Cubic splines allow for smoother transitions across knots compared to linear splines, producing more biologically plausible curves.

Approach	Influence	Strengths	Limitations
Smoothed lines	Local	Flexible; reveals non-linearity	Cannot be used in model itself; issues at extremes
Categorisation	N/A	Avoids linearity assumption	Loses information; arbitrary cutpoints
Polynomials	Global	Simple to implement; formal tests	May over-fit at extremes; collinearity
Fractional polynomials	Global	Very flexible with few terms	Coefficients not directly interpretable
Splines	Local	Flexible; smooth transitions	Sudden shifts at knots (linear splines)

✎ Reflection

Think about a continuous predictor in your field. Would you expect the relationship with the outcome to be linear? If not, which approach (categorisation, polynomials, fractional polynomials, or splines) would you choose and why?

✓ Reflection saved!

● Complete the quiz and reflection to continue.

HSCI 410 — Lesson 3

Exploratory Data Analysis For Epidemiology

Model-Building Strategies

Learning objectives for this lesson:

Introduction & Steps in Model Building

Why Model-Building Strategies Matter

Goals of the Analysis

Steps in Building a Regression Model

Building a Causal Model

✔ Check Your Understanding

✎ Reflection

Reducing Predictors & Missing Values

Reducing the Number of Predictors

Screening Predictors Based on Descriptive Statistics

Correlation Analysis

Creation of Indices & Cronbach’s Alpha

Screening Variables Based on Unconditional Associations

PCA, Factor Analysis & Correspondence Analysis

The Problem of Missing Values

Dealing with Missing Data: Imputation

✔ Check Your Understanding

✎ Reflection

Effects of Continuous Predictors

Evaluating Continuous Predictor–Outcome Relationships

Scatterplots & Smoothed Lines

Categorising Continuous Predictors

Polynomial Models

Fractional Polynomials

Splines

✔ Check Your Understanding

✎ Reflection

Interactions & Building the Model

Identifying Interaction Terms

Building the Model: Selection Criteria

Non-Statistical Considerations

Statistical Criteria for Nested Models

Information Criteria (AIC & BIC)

Adjusted R² & Mallow’s Cp

Specifying the Selection Strategy

✔ Check Your Understanding

✎ Reflection

Lesson 3 — Final Assessment

✎ Final Reflection

✔ Final Assessment

🏆 Congratulations!

Adjusted R² & Mallow’s C_p