Model-Building Strategies

Exploratory Data Analysis For Epidemiology

Learning objectives for this lesson:

Develop a full (maximal) model incorporating biological understanding of the system under study
Carry out procedures to reduce a large number of predictors to a manageable subset
Address issues related to the functional form of continuous predictors and missing values
Build regression-type models using both statistical and non-statistical criteria
Evaluate the reliability of a regression-type model
Present the results from an analysis in a meaningful way

This course was developed by Dr. Kiffer G. Card, Faculty of Health Sciences, Simon Fraser University based on Dohoo, I. R., Martin, S. W., & Stryhn, H. (2012). Methods in Epidemiologic Research. VER Inc.

Reference

Glossary: Key Terms, People & Concepts

📚 Reference page, available throughout the lesson

This glossary collects the key concepts, people, and ideas you will meet in this lesson. Use it as a reference while you work through the material, or as a review before assessments. Type in the search box to filter entries.

Key Concepts & Ideas

Model Building The iterative process of choosing predictors, interactions, and functional forms for a regression model. Should be guided by the research question, prior knowledge, and DAGs, not by data dredging.

Parsimony The principle that, all else equal, simpler models are preferred over more complex ones. Parsimonious models tend to generalise better to new data.

Overfitting When a model captures noise as if it were signal, performing well on the training data but poorly on new observations. Mitigated by parsimony, regularisation, and cross-validation.

Underfitting When a model is too simple to capture the systematic structure of the data, leading to high bias.

Confounding Mixing of effects between an exposure and an extraneous variable that is associated with both exposure and outcome. Adjusting for confounders is a primary motivation for multivariable models.

Mediator A variable on the causal pathway between exposure and outcome. Adjusting for a mediator removes part of the total effect, which is usually undesirable when estimating a total effect.

Effect Modifier A variable across whose levels the effect of the exposure on the outcome differs. Modeled with an interaction term; reported by stratum rather than collapsed.

Hierarchical Well-Formulated Model A model that includes all lower-order terms implied by any included higher-order term (e.g., if X₁X₂ is in the model, both X₁ and X₂ should be).

Methods & Statistical Concepts

Forward Selection A stepwise procedure that starts with no predictors and adds them one at a time, selecting whichever improves a chosen criterion (e.g., AIC, p-value) the most until none qualify.

Backward Elimination A stepwise procedure that starts with all candidate predictors and removes them one at a time, dropping whichever's removal improves the criterion until none can be dropped.

Stepwise Selection A combination of forward and backward steps. Widely used historically but criticised for inflated Type I error, biased coefficients, and irreproducibility.

AIC (Akaike Information Criterion) A model-fit criterion: AIC = −2·log-likelihood + 2k. Lower is better. Balances fit against complexity; favors models that generalise (Akaike, 1974).

BIC (Bayesian Information Criterion) Like AIC but with a stiffer penalty for complexity: BIC = −2·log-likelihood + k·log(n). Tends to select more parsimonious models than AIC (Schwarz, 1978).

Likelihood Ratio Test (LRT) A test comparing two nested models via −2·log(L₀/L₁), which follows a χ² distribution under the null. Used to test whether added terms improve fit.

Change-in-Estimate A confounder-selection rule: include a variable if its addition changes the exposure coefficient by more than a threshold (often 10%). Less arbitrary than p-value selection.

Cross-Validation A resampling procedure (e.g., k-fold) that holds out part of the data, fits the model on the rest, and measures performance on the held-out subset. Estimates out-of-sample error.

Lasso Regression A penalised regression that adds an L1 penalty on coefficient magnitudes, shrinking some coefficients to exactly zero. Performs variable selection automatically (Tibshirani, 1996).

Ridge Regression A penalised regression with an L2 penalty on coefficient magnitudes. Shrinks coefficients toward zero (but not exactly to zero); helpful with multicollinearity.

Elastic Net A penalised regression combining L1 (lasso) and L2 (ridge) penalties. Balances variable selection with stability when predictors are correlated.

Purposeful Selection A model-building strategy (Hosmer-Lemeshow) combining substantive theory, screening, change-in-estimate checks, and assessment of confounding/interaction.

Key People

Hirotugu Akaike (1927–2009) Japanese statistician who introduced the Akaike Information Criterion (AIC) in 1973, providing an information-theoretic basis for model selection (Akaike, 1974).

Robert Tibshirani (1956– ) Canadian statistician who introduced the lasso (Tibshirani, 1996) and co-authored landmark texts on statistical learning. A leading figure in modern regularised regression.

No matching entries. Try a different search term.

Section 3

Effects of Continuous Predictors

⏱ Estimated time: 25 minutes

Section 3 of 4

Effects of Continuous Predictors

Scatterplots, smoothers, categorisation, polynomials, fractional polynomials, and splines.

Visualising first

Scatterplots and smoothed lines

A smoothed line reveals the shape of the predictor–outcome relationship without imposing a parametric form.

Categorisation

When it helps, when it hurts

Potential use

Avoids the linearity assumption. About five categories suffices to control for confounding effects.

The costs

Discards within-category variation. Cutpoints are arbitrary. Data-driven cutpoints bias results. Biological relationships are rarely step-functions.

Polynomial models

Integer and fractional powers

Adding power terms allows the regression line to curve. Centring the predictor before squaring it reduces collinearity between the original variable and its powers.

Quadratic (centred)

\[ \color{#0B7B6B}{Y} = \color{#C2410C}{\beta_0} + \color{#6D28D9}{\beta_1}(x - \bar{x}) + \color{#047857}{\beta_2}(x - \bar{x})^2 + \color{#BE185D}{\varepsilon} \]

Y outcome β₀ intercept β₁ linear coefficient β₂ quadratic coefficient ε error

Fractional polynomials allow powers from the set \(\{-3,\,-2,\,-1,\,-0.5,\,0,\,0.5,\,1,\,2,\,3\}\), where power 0 means the natural log. Very flexible with few terms; present the fitted curve graphically.

Splines

Piecewise smooth functions

Knots placed at the 25th, 50th, and 75th percentiles in the absence of prior evidence. Restricted cubic splines constrain the tails to be linear.

Carry forward

Which approach to choose

Inspect first

Always plot the predictor against the outcome with a smoother before modelling. The shape guides the choice.

Preferred in practice

Restricted cubic splines are well-behaved at the tails and flexible. Fractional polynomials are compact. Both beat naive categorisation.

Introduction and Overview

An earlier section reduced your candidate-predictor list. This section zooms into how to model the predictors that survived, specifically the continuous ones. Linear regression assumes a linear relationship between predictor and outcome by default, but real relationships are often curved. Categorisation, polynomials, and splines are three different ways to allow curvature; each has its own trade-offs.

Learning Objectives

Use scatterplots and smoothed lines (e.g., LOESS) to inspect the shape of predictor–outcome relationships before modelling.
Decide when categorising a continuous predictor helps interpretation and when it discards too much information.
Fit polynomial terms to capture simple curvature, and recognise their limits at the tails of the data.
Use splines (linear, restricted cubic) to model flexible non-linear relationships without losing interpretability.
Compare these approaches and choose one based on the analytic question, sample size, and audience.

Evaluating Continuous Predictor–Outcome Relationships

It is important to evaluate the structure of the relationship between a continuous predictor and the outcome before starting model building. The underlying assumption of linearity can be evaluated through diagnostics after fitting the model, but it is useful to explore the nature of the relationship beforehand.

Key Approaches

Four main approaches to evaluating the effect of continuous predictors are: (1) scatterplots and smoothed line plots, (2) categorising the continuous variable, (3) exploring polynomial models, and (4) using splines.

Scatterplots & Smoothed Lines

Scatterplots are 2-way plots of the outcome (Y-axis) versus the continuous predictor (X-axis). They are most useful for continuous outcomes; scatterplots of dichotomous outcomes present as two lines of dots and are rarely informative by themselves.

Scatterplots can be greatly improved by adding a smoothed line through the centre of the data. All smoothed lines have a local-influence property: the position of the line at any value of x is influenced by nearby points but not by distant points.

Types of Smoothed Lines

There are several types of smoothed line functions:

Running mean smoother: Computes a simple average of y values in the neighbourhood
Running line smoother: Fits a simple linear regression through observations in the neighbourhood
Lowess smoother: Fits a weighted linear regression where points closer to x_i receive larger weight (using tricube weighting)
Local polynomial smoother: Fits a weighted polynomial regression in the neighbourhood

The bandwidth controls the size of the neighbourhood. A bandwidth of 0.8 means 80% of the data is used for each point. Larger bandwidths produce smoother lines but may miss important features.

Caution with Extreme Values

All smoothed line functions can have problems at the extreme values of the predictor distribution. This is because the neighbourhood at the tails is not symmetrical and contains relatively few data points. It is important not to pay much attention to the extremes of the fitted line. Vertical dashed lines marking the 2.5th and 97.5th percentiles can help delineate where most of the data falls.

Categorising Continuous Predictors

The assumption of linearity can be avoided by converting the continuous predictor into categories. However, this is generally not advisable for three reasons:

Categorisation involves the loss of information
It is unlikely that biological processes have a step-function relationship (i.e., sudden changes at specific cutpoints)
The choice of cutpoints is arbitrary and, if data-driven, may lead to biased results

That said, about 5 categories will usually suffice to control for confounding effects. A model with a categorised variable can be compared to one with a continuous (linear) variable using AIC or BIC.

Polynomial Models

Polynomials allow the regression line to follow a curve rather than a straight line. Power terms (e.g., x² or x³) are added to the model. Unlike smoothed lines, polynomial models have a global-influence property: the shape of the entire line is influenced by all the data.

Quadratic model

\[ \color{#0B7B6B}{Y} = \color{#C2410C}{\beta_0} + \color{#6D28D9}{\beta_1} \color{#1D4ED8}{x} + \color{#047857}{\beta_2} \color{#1D4ED8}{x}^2 + \color{#BE185D}{\varepsilon} \]

The outcome is the intercept plus a linear term in the predictor plus a quadratic term in its square, allowing a curved relationship, plus error.

⚠ Centring to Avoid Collinearity

The original variable (x) is often highly correlated with its squared term (x²), creating collinearity. The solution is to centre the variable by subtracting the mean before squaring. If a quadratic model is insufficient (i.e., the quadratic term is significant but the fit is still poor), a cubic term (x³) can be added.

Fractional Polynomials

Fractional polynomials (FPs) extend the idea of polynomial models by allowing power terms that are not restricted to positive integers. The most common set of powers to consider is: −3, −2, −1, −0.5, 0 (= ln), 0.5, 1, 2, 3. A 2-degree FP can fit a wide range of non-linear shapes and may be the most parsimonious way to model non-linearity.

📊 Example: Birth Weight vs Gestation Length

A quadratic model regressing birth weight on centred gestation length showed R² = 0.29. When fractional polynomials were explored, the best-fitting 2-degree FP used powers of 3 and 3×ln(gest), yielding R² = 0.30 and fitting significantly better than the linear, quadratic, or cubic models. The FP coefficients are not directly interpretable; the only way to make sense of such a model is to display the function graphically.

Splines

An alternative to polynomial models is to fit a piecewise linear function. Points where the slope changes are called knot points. In the absence of prior evidence, knots may be chosen at percentiles of the predictor (e.g., 25th, 50th, 75th). Cubic splines allow for smoother transitions across knots compared to linear splines, producing more biologically plausible curves; the same logic extends to generalised additive models (Hastie & Tibshirani, 1986).

Approach	Influence	Strengths	Limitations
Smoothed lines	Local	Flexible; reveals non-linearity	Cannot be used in model itself; issues at extremes
Categorisation	N/A	Avoids linearity assumption	Loses information; arbitrary cutpoints
Polynomials	Global	Simple to implement; formal tests	May over-fit at extremes; collinearity
Fractional polynomials	Global	Very flexible with few terms	Coefficients not directly interpretable
Splines	Local	Flexible; smooth transitions	Sudden shifts at knots (linear splines)

✎ Reflection

Think about a continuous predictor in your field. Would you expect the relationship with the outcome to be linear? If not, which approach (categorisation, polynomials, fractional polynomials, or splines) would you choose and why?

Model answerFor age vs. all-cause mortality, the relationship is decidedly non-linear: risk accelerates with age. Choice: restricted cubic splines (3–5 knots placed at percentiles of the age distribution) are typically best because they (a) capture non-linearity flexibly without forcing a specific functional form, (b) are stable at the edges of the data (unlike polynomials, which can have wild tails), and (c) yield interpretable plots of the dose-response. Categorisation is the worst choice (information loss, threshold effects, instability of effect at the cut-point); polynomials work for simple curves but extrapolate badly; fractional polynomials are a compromise (data-driven choice of exponents) but less interpretable than splines. The right defence is to plot the spline fit overlaid on the data and present the dose-response curve as a deliverable rather than only regression coefficients.

✓ Reflection saved!

● Complete the quiz and reflection to continue.

Section 4

Interactions & Building the Model

⏱ Estimated time: 25 minutes

Section 4 of 4

Interactions & Building the Model

Interaction specification, moderation, model-selection criteria, and selection strategies.

Interaction specification

Five strategies for two-way interactions

All pairs (small p)

Significant main effects only

Significant unconditional assoc.

Biologically motivated pairs

Exposure × covariate only

Rule: if an interaction is in the model, both constituent main effects must be included. Bonferroni adjustment is warranted when many interactions are tested.

Moderation

Statistical interaction with a substantive story

Mediation (from DAG)

X acts through M to reach Y. Answers: through what?

Moderation (interaction)

W changes the strength of the X → Y relationship. Answers: for whom, or under what conditions?

A p-value detects an interaction. Only subject-matter knowledge supplies the story that makes it worth keeping.

Moderation in R

Does maternal age modify the smoking effect?

The test

lm(bwt ~ smoke * age) versus the main-effects model, compared with a partial F-test: F = 4.11, p = 0.044. The interaction earns its place.

The follow-up

interact_plot() draws two non-parallel age slopes; sim_slopes() with Johnson-Neyman shows where the smoking effect is statistically meaningful.

Selection criteria

AIC, BIC, and tests for nested models

Information criterion (general form)

\[ \color{#0B7B6B}{\text{IC}} = -2\ln\color{#C2410C}{\hat{L}} + \color{#6D28D9}{a} \times \color{#1D4ED8}{s} \]

IC information criterion L̂ maximised likelihood a penalty per parameter s number of parameters

Where \(s\) is the number of parameters and \(a\) is the penalty constant. For AIC, \(a = 2\) (Akaike, 1974). For BIC, \(a = \ln(n)\) (Schwarz, 1978). Smaller IC indicates a better model.

BIC differences: 0–2 = weak evidence; 2–6 = positive; 6–10 = strong; ≥10 = very strong.

Additional criteria

Adjusted R² and Mallow's Cₚ

Adjusted R-squared

\[ \color{#0B7B6B}{R^2_{\text{adj}}} = 1 - \frac{(\color{#C2410C}{n}-1)}{(\color{#C2410C}{n}-\color{#1D4ED8}{k}-1)}(1 - \color{#6D28D9}{R^2}) \]

R²₃ₐ₌ adjusted for predictors n sample size k number of predictors R² unadjusted R-squared

Where \(k\) is the number of predictors. Adjusted R² penalises complexity; maximise it. Mallow's Cₚ is a special case of AIC; minimise it.

Selection strategy

Backward, forward, and the dangers of stepwise

Backward elimination

Start with all predictors; remove one at a time. Each variable is evaluated in context. Preferred by Dohoo, Martin & Stryhn.

Avoid pure stepwise

Data-driven search overfits, produces optimistic standard errors, and yields models that fail to replicate (Babyak, 2004).

Combine any automated strategy with a protected list: primary predictors and known confounders are never removed by the algorithm.

Introduction and Overview

Earlier sections settled which predictors enter the model and how each one is shaped. This section closes the lesson with the most consequential remaining design choices: which interaction terms (if any) to include, how to think about moderation as a substantive question, and which selection criterion to use when several plausible models compete.

Learning Objectives

Identify interaction terms worth testing using subject-matter knowledge before searching the data.
Distinguish statistical interaction from substantive moderation and explain why the latter requires a substantive story rather than only a p-value.
Compare model-selection criteria (adjusted R², AIC, BIC, cross-validation) and explain what each rewards.
Specify a model-building strategy (backward, forward, all-subsets, or theory-driven) and defend the choice.
Recognise the dangers of data-driven selection (overfitting, optimistic standard errors) and how to mitigate them (Babyak, 2004).

Identifying Interaction Terms

It is important to consider including interaction terms when specifying the maximum model. There are five general strategies for creating and evaluating 2-way interactions:

Strategy 1: Evaluate All Possible 2-Way Interactions

This is feasible only when the total number of predictors is small (e.g., ≤ 8). You create and test every possible pair of interaction terms.

Strategy 2: Interactions Among Significant Main Effects

After building the final main-effects model, create interactions among all predictors that are statistically significant. This reduces the number of interactions to evaluate but may miss interactions with non-significant main effects.

Strategy 3: Interactions Among Unconditionally Associated Predictors

Create interactions among all predictors that have a significant unconditional association with the outcome. This casts a wider net than Strategy 2.

Strategy 4: Theory-Driven Interactions

Only create interactions among pairs of variables you suspect (based on evidence from the literature or biological reasoning) might interact. This usually focuses on interactions involving the primary predictor(s) of interest and important confounders.

Strategy 5: Exposure-Only Interactions

Only create interactions that involve the exposure variable (primary predictor of interest). This is the most conservative approach but may miss important interactions among covariates.

⚠ Important Rules for Interactions

If an interaction term is included in the model, the main effects that make it up must also be included. Evaluating many interactions increases the risk of identifying spurious associations, so a Bonferroni adjustment or similar correction may be warranted. Three-way interactions are usually very difficult to interpret and should be included only if there is strong a priori reason.

Moderation: Interactions With a Substantive Story

In the regression literature, an interaction term is often called a moderation when it captures a substantive claim that the effect of one variable depends on another. A moderator W changes the slope of X → Y. Mathematically it is identical to an interaction (Y ~ X * W); the difference is conceptual.

Mediation vs. moderation, one more time

In an earlier section we used a DAG to set up mediation (X → M → Y, where M is on the causal pathway). Moderation is structurally different: W does not sit between X and Y, it changes the strength of the X → Y arrow. Mediation answers through what?; moderation answers for whom, or under what conditions?

R Activity: detect and visualise moderation in R

Using the built-in birthwt dataset (MASS package), we ask: does maternal age modify the effect of smoking on birth weight? If yes, smoking matters more (or less) for younger versus older mothers, a question about for-whom, not through-what.

# install.packages(c("MASS", "ggplot2", "interactions"))
library(MASS);  library(ggplot2);  library(interactions)
data("birthwt", package = "MASS")
bw <- birthwt
bw$smoke <- factor(bw$smoke, levels = c(0, 1),
                   labels = c("non-smoker", "smoker"))

# 1. Main-effects model (no moderation)
m_main <- lm(bwt ~ smoke + age, data = bw)

# 2. Moderation model: smoke * age
m_mod  <- lm(bwt ~ smoke * age, data = bw)

# 3. Is the moderation needed? Compare nested models with a likelihood-ratio
#    test (here, the partial F-test from anova() because models are linear).
anova(m_main, m_mod)

summary(m_mod)

# 4. Plot the moderation: smoking slopes at low / mean / high age
interact_plot(m_mod, pred = age, modx = smoke,
              interval = TRUE) +
  labs(y = "Birth weight (g)",
       title = "Effect of maternal age on birth weight, by smoking status")

# 5. Simple-slopes / Johnson-Neyman: across the range of maternal age,
#    where is the effect of smoking on birth weight distinguishable from zero?
sim_slopes(m_mod, pred = smoke, modx = age, johnson_neyman = TRUE)

Console output (truncated)

Analysis of Variance Table Model 1: bwt ~ smoke + age Model 2: bwt ~ smoke * age Res.Df RSS Df Sum of Sq F Pr(>F) 1 186 95672288 2 185 93062605 1 2609683 5.19 0.0239 * Coefficients (m_mod): Estimate Std. Error t value Pr(>|t|) (Intercept) 2406.1 292.2 8.23 <0.001 smokesmoker 798.2 484.3 1.65 0.101 age 27.7 12.1 2.28 0.024 smokesmoker:age -46.6 20.4 -2.28 0.024

Reading the moderation. The interaction term smokesmoker:age is negative and significant: as maternal age rises, smoking is linked to a larger drop in birth weight, so the smoker-versus-non-smoker deficit grows rather than shrinks with age. interact_plot() shows this as two non-parallel birth-weight-versus-age lines, one per smoking group; sim_slopes() with pred = smoke reports the estimated smoking effect at chosen maternal ages, each with a confidence interval. Activity: change the moderator from age to lwt (mother’s weight at last menstrual period). Does the effect of smoking on birth weight depend on maternal weight? Defend your answer with both the partial F-test and the plot.

R Reflect on what you just ran

Use the questions below to interpret the output you produced. Look at your console / plot before answering.

1. From anova(m_main, m_mod), what is the F statistic and p-value for adding the smoke:age interaction? Does adding the interaction significantly improve fit, and what does that tell you about whether age moderates the smoking-birthweight relationship?

Model answeranova(m_main, m_mod) gives F = 5.19 on 1 and 185 degrees of freedom, p = 0.024. Because p is below 0.05, adding the smoke:age interaction significantly improves fit. That is evidence of effect modification: the smoking effect on birth weight depends on maternal age. Retaining the interaction is supported by both the significant test and the substantive plausibility that smoking-related risk differs across maternal age.

2. In summary(m_mod), what is the smokesmoker:age coefficient and its sign? In one plain-English sentence, describe how the smoking-vs-non-smoking gap in birthweight changes as maternal age increases.

Model answerThe smokesmoker:age coefficient in m_mod is about −46.6 g per additional year of maternal age (p = 0.024). The non-smoker age slope is about +27.7 g/year, so the smoker slope is roughly 27.7 − 46.6 ≈ −19 g/year. Plain English: smoking's harm to birth weight is more pronounced in older mothers than in younger mothers, so the smoker-versus-non-smoker deficit grows as maternal age increases.

3. From sim_slopes(..., johnson_neyman = TRUE), identify the range of maternal age over which the effect of smoking on birthweight is statistically significant. Outside that region, what would you conclude about smoking's effect for that subgroup?

Model answerWith sim_slopes(pred = smoke, modx = age, johnson_neyman = TRUE), the Johnson-Neyman boundary falls near age 22: the effect of smoking on birth weight is statistically significant for mothers older than about 22 and is not distinguishable from zero below that. For younger mothers we cannot confirm a smoking effect in these data, but we also cannot rule one out; the boundary reflects statistical power (confidence intervals widen where smokers are sparse), not a biological threshold.

Saved.

Building the Model: Selection Criteria

Once the maximum model has been specified, you need to decide how to determine which predictors to retain. Both non-statistical and statistical criteria should be considered.

Non-Statistical Considerations

Variables should be retained in the model if they:

Are a primary predictor of interest
Are thought a priori to be confounders for the primary predictor
Show evidence of being a confounder (their removal causes a substantial change in the coefficient of interest)
Are a component of an interaction term included in the model

Statistical Criteria for Nested Models

Models where one model’s predictors are a subset of another’s are called nested models. Tests for nested models include:

Partial F-test (for linear regression)
Wald test (most commonly used; can be unreliable if P-values are near 0.05 or SEs appear suspect)
Likelihood-ratio test (LRT) (has the best statistical properties but requires fitting both models)

For categorical variables with multiple indicator terms, evaluate the overall significance of all indicators together, not individual terms.

In plain terms, each of these tests asks the same question: do the extra terms in the larger model explain enough additional variation to be worth the degrees of freedom they cost? A small p-value says yes.

Information Criteria (AIC & BIC)

For non-nested models, information criteria are used. The general formula is:

Information criterion (Eq 15.1)

\[ \color{#0B7B6B}{\text{IC}} = -2\ln\color{#C2410C}{\hat{L}} + \color{#6D28D9}{a} \times \color{#1D4ED8}{s} \]

An information criterion trades off model fit (via the maximised likelihood) against complexity: a penalty per parameter times the number of parameters. Lower is better.

Where s is the number of parameters, lnL is the log-likelihood, and a is a penalty constant. For AIC, a = 2 (Akaike, 1974). For BIC, a = ln(n) (Schwarz, 1978). Smaller values indicate a better model. BIC tends to favour more parsimonious models.

Reading the numbers. An AIC or BIC value means nothing on its own; a single value such as 1,240 is neither good nor bad. Only the difference between models carries information, and the comparison is valid only when the models are fit to the same outcome, measured the same way, on the same set of observations. That last condition catches beginners: you cannot compare the AIC of a model for Y against one for log(Y), and quietly dropping rows with missing values changes the sample, which makes the criteria incomparable.

Guidelines for interpreting BIC differences between models:

0–<2: Weak evidence
2–<6: Positive evidence
6–<10: Strong evidence
≥10: Very strong evidence

Adjusted R² & Mallow’s C_p

Adjusted R² maximises the variance explained while penalising unnecessary complexity. The model that maximises adjusted R² is preferred.

Mallow’s C_p (Eq 15.2)

\[ \color{#0B7B6B}{C_p} = \frac{\sum (\color{#C2410C}{Y} - \color{#6D28D9}{\hat{Y}})^2}{\color{#1D4ED8}{\sigma^2}} - \color{#BE185D}{n} + 2\color{#047857}{k} \]

Mallow’s C_p compares the observed and predicted values, scaled by the error variance, then adjusts for the sample size and the number of parameters. Values near k indicate a well-fitting subset.

Where k is the number of predictors in the candidate model, σ² is the MSE from the full model, and n is the sample size. Mallow’s C_p is a special case of the AIC. The model with the lowest C_p is generally considered the best.

Specifying the Selection Strategy

Once criteria are established, there are several strategies for selecting which variables to include in the final model.

All Possible SubsetsClick to explore

Forward SelectionClick to explore

Backward EliminationClick to explore

Best Practice Summary

Backward elimination is generally preferred over forward selection because each predictor is evaluated in the context of all others. However, the most important point is to combine statistical procedures with subject matter knowledge: retain known confounders and primary predictors regardless of statistical criteria, and always build a causal model first (Heinze, Wallisch, & Dunkler, 2018).

✎ Reflection

Reflect on the tension between statistical model selection (AIC, BIC, stepwise methods) and subject matter knowledge. Why might a model selected purely by statistical criteria fail to answer your research question?

Model answerPure statistical model selection (stepwise, lasso (Tibshirani, 1996), AIC/BIC) optimises a goodness-of-fit criterion that does not encode the causal question. The model that minimises AIC may include a mediator (because it predicts the outcome well) or exclude a confounder (because it doesn't add much to the fit), both of which produce biased causal estimates. Stepwise procedures are especially problematic: they over-fit to the sample, produce non-reproducible models, and yield p-values and CIs that are demonstrably wrong (the model itself depends on the data being analysed). Subject-matter knowledge encoded in a pre-specified DAG produces an adjustment set that statistically may not be "the best" by AIC but is the right one for the causal question. The right role for AIC/BIC is in comparing pre-specified models with the same identification strategy, not in choosing which variables to include.

✓ Reflection saved!

● Complete the quiz and reflection to continue.

HSCI 410 · Lesson 4

Exploratory Data Analysis For Epidemiology

Model-Building Strategies

Learning objectives for this lesson:

Glossary: Key Terms, People & Concepts

Introduction & Steps in Model Building

Model-Building Strategies

Decisions before data

Introduction & Steps in Model Building

Prediction versus causal explanation

Prediction

Causal explanation

The model-building workflow

The causal diagram comes first

Smoking, birth weight, and the mediator trap

Exclude from the model

Retain regardless

Two kinds of knowledge, working together

Introduction and Overview

Learning Objectives

Why Model-Building Strategies Matter

Key Concept

Goals of the Analysis

Steps in Building a Regression Model

Building a Causal Model

⚠ Important Distinction

✎ Reflection

Reducing Predictors & Missing Values

Reducing Predictors & Missing Values

Descriptive and correlation screening

Descriptive screening

Correlation screening

Indices and Cronbach's alpha

Screening on unconditional associations

The method

The drawback

PCA, factor analysis, correspondence analysis

PCA

Factor analysis

Correspondence analysis

Three mechanisms, three consequences

MCAR

MAR

MNAR

Multiple imputation: why it matters

Introduction and Overview

Learning Objectives

Reducing the Number of Predictors

Practical Tip

🎲 Interactive: Overfitting & the Bias–Variance Tradeoff (Babyak, 2004)

Training data + fitted curve

In-sample R² vs out-of-sample MSE by degree

Screening Predictors Based on Descriptive Statistics

Correlation Analysis

Creation of Indices & Cronbach’s Alpha

Screening Variables Based on Unconditional Associations

PCA, Factor Analysis & Correspondence Analysis

The Problem of Missing Values

Dealing with Missing Data: Imputation

✎ Reflection

Effects of Continuous Predictors

Effects of Continuous Predictors

Scatterplots and smoothed lines

When it helps, when it hurts

Potential use

The costs

Integer and fractional powers

Piecewise smooth functions

Which approach to choose

Inspect first

Preferred in practice

Introduction and Overview

Learning Objectives

Evaluating Continuous Predictor–Outcome Relationships

Key Approaches

Scatterplots & Smoothed Lines

Categorising Continuous Predictors

Polynomial Models

⚠ Centring to Avoid Collinearity

Fractional Polynomials

Adjusted R² & Mallow’s C_p