HSCI 410 — Lesson 4

Model-Building Strategies

Exploratory Data Analysis For Epidemiology

Kiffer G. Card, PhD, Faculty of Health Sciences, Simon Fraser University

Learning objectives for this lesson:

  • Develop a full (maximal) model incorporating biological understanding of the system under study
  • Carry out procedures to reduce a large number of predictors to a manageable subset
  • Address issues related to the functional form of continuous predictors and missing values
  • Build regression-type models using both statistical and non-statistical criteria
  • Evaluate the reliability of a regression-type model
  • Present the results from an analysis in a meaningful way

This course was developed by Kiffer G. Card, PhD, as a companion to Dohoo, I. R., Martin, S. W., & Stryhn, H. (2012). Methods in Epidemiologic Research. VER Inc.

Reference

Glossary — Key Terms, People & Concepts

📚 Reference page — available throughout the lesson

This glossary collects the key concepts, people, and ideas you will meet in this lesson. Use it as a reference while you work through the material, or as a review before assessments. Type in the search box to filter entries.

Key Concepts & Ideas
Model Building The iterative process of choosing predictors, interactions, and functional forms for a regression model. Should be guided by the research question, prior knowledge, and DAGs — not by data dredging.
Parsimony The principle that, all else equal, simpler models are preferred over more complex ones. Parsimonious models tend to generalise better to new data.
Overfitting When a model captures noise as if it were signal — performing well on the training data but poorly on new observations. Mitigated by parsimony, regularisation, and cross-validation.
Underfitting When a model is too simple to capture the systematic structure of the data, leading to high bias.
Confounding Mixing of effects between an exposure and an extraneous variable that is associated with both exposure and outcome. Adjusting for confounders is a primary motivation for multivariable models.
Mediator A variable on the causal pathway between exposure and outcome. Adjusting for a mediator removes part of the total effect — usually undesirable when estimating a total effect.
Effect Modifier A variable across whose levels the effect of the exposure on the outcome differs. Modeled with an interaction term; reported by stratum rather than collapsed.
Hierarchical Well-Formulated Model A model that includes all lower-order terms implied by any included higher-order term (e.g., if X₁X₂ is in the model, both X₁ and X₂ should be).
Methods & Statistical Concepts
Forward Selection A stepwise procedure that starts with no predictors and adds them one at a time, selecting whichever improves a chosen criterion (e.g., AIC, p-value) the most until none qualify.
Backward Elimination A stepwise procedure that starts with all candidate predictors and removes them one at a time, dropping whichever's removal improves the criterion until none can be dropped.
Stepwise Selection A combination of forward and backward steps. Widely used historically but criticised for inflated Type I error, biased coefficients, and irreproducibility.
AIC (Akaike Information Criterion) A model-fit criterion: AIC = −2·log-likelihood + 2k. Lower is better. Balances fit against complexity; favors models that generalise (Akaike, 1974).
BIC (Bayesian Information Criterion) Like AIC but with a stiffer penalty for complexity: BIC = −2·log-likelihood + k·log(n). Tends to select more parsimonious models than AIC (Schwarz, 1978).
Likelihood Ratio Test (LRT) A test comparing two nested models via −2·log(L₀/L₁), which follows a χ² distribution under the null. Used to test whether added terms improve fit.
Change-in-Estimate A confounder-selection rule: include a variable if its addition changes the exposure coefficient by more than a threshold (often 10%). Less arbitrary than p-value selection.
Cross-Validation A resampling procedure (e.g., k-fold) that holds out part of the data, fits the model on the rest, and measures performance on the held-out subset. Estimates out-of-sample error.
Lasso Regression A penalised regression that adds an L1 penalty on coefficient magnitudes, shrinking some coefficients to exactly zero. Performs variable selection automatically (Tibshirani, 1996).
Ridge Regression A penalised regression with an L2 penalty on coefficient magnitudes. Shrinks coefficients toward zero (but not exactly to zero); helpful with multicollinearity.
Elastic Net A penalised regression combining L1 (lasso) and L2 (ridge) penalties. Balances variable selection with stability when predictors are correlated.
Purposeful Selection A model-building strategy (Hosmer-Lemeshow) combining substantive theory, screening, change-in-estimate checks, and assessment of confounding/interaction.
Key People
Hirotugu Akaike (1927–2009) Japanese statistician who introduced the Akaike Information Criterion (AIC) in 1973, providing an information-theoretic basis for model selection (Akaike, 1974).
Robert Tibshirani (1956– ) Canadian statistician who introduced the lasso (Tibshirani, 1996) and co-authored landmark texts on statistical learning. A leading figure in modern regularised regression.
No matching entries. Try a different search term.
Section 1

Introduction & Steps in Model Building

⏱ Estimated time: 20 minutes

Introduction and Overview

Lesson 3 walked through linear regression with a fixed set of predictors. Real data rarely arrive with a clean “here are the four predictors you should use” instruction. Lesson 4 turns to the broader question of how to build a defensible model when you have many candidate predictors and need to decide which to include. The four content sections move from goals and frameworks (Section 1), to reducing predictors and handling missing values (Section 2), to modelling continuous predictor–outcome relationships flexibly with categorisation, polynomials, and splines (Section 3), and finally to interactions, moderation, and selection criteria (Section 4). Throughout, the central tension is between data-driven flexibility and theory-driven discipline — and the answer is usually closer to theory than the data alone would suggest (Babyak, 2004; Heinze, Wallisch, & Dunkler, 2018).

Learning Objectives

  • Distinguish prediction-focused from causal-explanation model-building goals and explain how each shapes the strategy.
  • Outline the steps in building a regression model from candidate predictors through final reporting.
  • Use a causal diagram to identify the predictors that must be retained regardless of statistical significance.
  • Recognise when subject-matter knowledge should override what the data alone seem to suggest.

Why Model-Building Strategies Matter

When building a regression model, we need to decide on the goals of the analysis, incorporate both statistical considerations and subject matter knowledge, and balance the desire for parsimony (simplicity) with the desire for a model that “best fits” the data. The definition of “best fit” depends on the goal of the analysis, and the principles discussed in this chapter apply to all types of regression models.

Key Concept

Regression models are generally built to meet one of two broad objectives: (1) to build the best model for predicting future observations, or (2) to understand the causal relationship(s) between predictors and the outcome. The approach to model building differs depending on which goal you are pursuing.

Goals of the Analysis

If the goal is prediction, we want to keep any variables whose relationship with the dependent variable is questionable—because excluding them might lead to inaccurate predictions when future observations have extreme values for those variables. The details of specific predictors are of little consequence; we just want overall accuracy. Reporting guidelines such as TRIPOD (Collins, Reitsma, Altman, & Moons, 2015) provide a checklist for transparently documenting prediction models.

If the goal is understanding biological relationships, we want precise estimates of coefficients for the variables of interest. Careful attention must be paid to interaction and confounding effects. Factors likely to be confounders should be retained in the model regardless of statistical significance, while factors that are almost certainly not confounders should generally be excluded—especially if they are intervening variables, as their inclusion may bias results.

🎯
Prediction Goal
Click to learn more
🔬
Causal Understanding
Click to learn more
Parsimony vs Fit
Click to learn more

Steps in Building a Regression Model

The process of building a regression model follows a systematic set of steps. While statistical software handles the computation, the researcher must make many decisions along the way that require both subject matter expertise and statistical reasoning.

Step 1: Specify the Maximum Model

Identify the outcome variable and determine whether it needs transformation (e.g., natural log). Then identify the full set of predictors to consider. The maximum model includes all possible predictors of interest. While a large model prevents overlooking important predictors, adding too many increases the risks of collinearity and spurious associations. Key sub-steps include: drawing a causal diagram, potentially reducing predictors, considering missing values, evaluating effects of continuous predictors, and deciding on interaction terms.

Step 2: Specify the Selection Criteria

Decide how you will determine which variables to retain. Criteria can be non-statistical (e.g., is it a primary predictor of interest? Is it a known confounder?) or statistical (e.g., partial F-tests, likelihood-ratio tests, information criteria like AIC or BIC). Both types of criteria should be considered together.

Step 3: Specify the Selection Strategy

Choose how to apply the criteria. Options include: examining all possible subsets, forward selection (adding variables one at a time), backward elimination (starting with all variables and removing), or stepwise procedures (combining forward and backward). The strategy determines the order in which variables are evaluated.

Steps 4–6: Conduct, Evaluate, and Present

Step 4: Conduct the analyses using your chosen strategy and criteria. Step 5: Evaluate the reliability of the chosen model using diagnostics and sensitivity analyses. Step 6: Present the results in a meaningful way, ensuring they are interpretable to your audience and that the model-building process is transparent.

Building a Causal Model

Before beginning the model-building process, it is imperative to have a causal model in place, usually presented as a causal diagram. The diagram identifies potential causal relationships among the predictors and the outcome of interest.

📋 Example: Cigarette Smoking and Birth Weight

Suppose you want to study the effects of cigarette smoking on birth weight, and you also have data on the mother’s race, education level, total birth order, gestation length, number of babies born, and weight gain during pregnancy.

A causal diagram would show that gestation length and weight gain are intervening variables—they lie on the causal pathway between smoking and birth weight. If the objective is to quantify the total effect of smoking on birth weight, you would not include gestation length or weight gain in the model, because doing so would remove the effect of smoking that is mediated through them.

On the other hand, race and college education might be confounders and should be retained regardless of statistical significance. Building the causal diagram first helps ensure you do not accidentally adjust for intervening variables.

⚠ Important Distinction

Confounders should be retained in the model to avoid bias. Intervening variables should generally be excluded when estimating total effects, because including them removes the indirect effect that passes through them. A causal diagram drawn before model building helps you distinguish between the two.

Knowledge Check — Section 1

1. When the goal of a regression model is to understand biological relationships, which of the following is true?

When the goal is causal understanding, confounders should be retained regardless of significance to avoid biased estimates. Intervening variables should generally be excluded when estimating total effects.

2. What is the first step in building a regression model?

The first step is to specify the maximum model—identify the outcome variable and the full set of predictors to consider. This includes drawing a causal diagram and determining whether transformations are needed.

3. In a study of cigarette smoking’s effect on birth weight, why should gestation length generally NOT be included in the model?

Gestation length is an intervening variable—smoking affects gestation length, which in turn affects birth weight. Including it would remove the indirect effect of smoking that is mediated through gestation length, underestimating the total effect.

✎ Reflection

Think about a research question in your own field. What would the causal diagram look like? How would you distinguish confounders from intervening variables?

Model answerFor shift work and metabolic syndrome: confounders are causes of both exposure (shift work) and outcome (metabolic syndrome) that exist temporally before shift work was chosen — age, SES, education, occupational sector, family history. Intervening variables are on the causal pathway from shift work to metabolic syndrome — circadian disruption, sleep duration, dietary timing, physical activity during shifts, weight gain during the employment period. The DAG distinguishes them visually: a variable with an arrow from itself into both exposure and outcome is a confounder; a variable with arrows from exposure into it and then into outcome is a mediator. Confounders must be adjusted for to estimate the causal effect; mediators must NOT be adjusted for to estimate the total effect (they should be examined separately in mediation analysis if the indirect path is the question).
✓ Reflection saved!
Complete the quiz and reflection to continue.
Section 2

Reducing Predictors & Missing Values

⏱ Estimated time: 25 minutes

Introduction and Overview

Section 1 settled the conceptual goals and the workflow. Section 2 turns to two practical issues that determine which predictors actually make it into the final model. First: when you have more candidate predictors than your sample can support, how do you reduce the set without introducing bias? Second: missing values reduce your effective sample further — how should you handle them?

Learning Objectives

  • Apply principled techniques (variable clustering, prior knowledge, dimension reduction) to shrink a long candidate-predictor list.
  • Critique purely automated (stepwise) selection and explain why it produces fragile models.
  • Distinguish complete-case analysis, single imputation, and multiple imputation, and choose between them based on the missing-data mechanism.
  • Document predictor reductions and missing-value decisions so the final model is reproducible.

Reducing the Number of Predictors

It is sometimes necessary to reduce the number of predictors in the model-building process. Before undertaking any reduction, it is essential to identify the primary variables of interest and any variables that might be confounders or interacting variables—these should always be retained for consideration.

Practical Tip

The most appropriate procedure for managing a large number of predictors is often to design a more focused study that collects high-quality data on fewer predictors. This greatly reduces the risk of identifying spurious associations.

🎲 Interactive: Overfitting & the Bias–Variance Tradeoff (Babyak, 2004)

Fit polynomials of increasing degree to a small training sample, then evaluate on a fresh hold-out sample drawn from the same true relationship. In-sample R² always rises with more flexibility — out-of-sample error has a U-shape. The minimum is the sweet spot.

Training data + fitted curve
In-sample R² vs out-of-sample MSE by degree
Training R²
Test MSE
Training MSE
Optimal degree
Try this: with n=20 and σ=0.5, slide the degree from 1 to 14. Watch training R² climb monotonically, while test MSE drops, hits a minimum, then explodes. The model with the lowest test MSE is the right answer — not the one with the highest R².

Screening Predictors Based on Descriptive Statistics

Before starting any model building, become thoroughly familiar with your data using descriptive statistics (means, variances, percentiles for continuous variables; frequency tabulations for categorical variables). This helps identify variables of little value. Guidelines include:

  • Avoid variables with large numbers of missing observations
  • Select only variables with substantial variability (e.g., if almost all subjects are male, sex will not be a useful predictor)
  • If a categorical variable has many categories with small counts, consider combining categories or eliminating the variable

Correlation Analysis

Examining pairwise correlations among predictor variables identifies pairs that contain essentially the same information. Highly correlated predictors (typically r > 0.9) produce multicollinearity, leading to unstable coefficient estimates and incorrect standard errors.

If highly correlated pairs are found, select one based on criteria such as biological plausibility, ease of measurement, and fewer missing values. Note that pairwise screening will not detect multicollinearity arising from linear combinations of multiple predictors.

Creation of Indices & Cronbach’s Alpha

Related predictors can sometimes be combined into a single index. For example, the Hamilton Rating Scale for Depression combines 22 characteristics into an overall depression score.

Cronbach’s alpha evaluates the internal consistency of such a scale—how well each predictor correlates with the overall scale. Interpretation guidelines:

  • < 0.60: Unacceptable
  • 0.60–0.65: Undesirable
  • 0.66–0.70: Minimally acceptable
  • 0.71–0.80: Respectable
  • 0.81–0.90: Very good
  • > 0.90: Consider shortening the scale

One drawback of indices is that they preclude evaluating the effects of the individual factors that were combined.

Screening Variables Based on Unconditional Associations

A common approach is to select only predictors with unconditional associations significant at a liberal P-value (e.g., 0.15 or 0.20). Simple univariable regression models are used for this screening.

One drawback: an important predictor might be excluded if its effect is masked by another variable (i.e., confounding is present). Using a liberal P-value helps prevent this. Another approach is to build the model with significant predictors, then add back excluded predictors one at a time to check if any become significant after adjusting for other variables.

PCA, Factor Analysis & Correspondence Analysis

Principal Components Analysis (PCA) converts a set of k predictor variables into k orthogonal (uncorrelated) principal components, each containing a decreasing proportion of total variation. A small subset of components is then used as predictors, eliminating multicollinearity. Coefficients can be back-transformed to the original predictors, though interpretation is less direct.

Factor analysis is similar but assumes factors with inherent meaning can be created from the original variables. Unlike PCA, the composition of factors varies as the number selected changes. Predictors with high “factor loadings” are identified as important determinants.

Correspondence analysis is designed for categorical variables. It produces a visual summary (2D scatterplot) of complex relationships, showing which clusters of predictors are associated with which clusters of outcome values.

The Problem of Missing Values

Missing data are common in observational studies. Statistical programs use complete case analysis by default—only observations with no missing values for any variable are included. Even a relatively low overall percentage of missing values can result in a substantial reduction of the sample if missing data are spread across observations.

🎲
MCAR
Click to learn more
🔍
MAR
Click to learn more
MNAR
Click to learn more

Dealing with Missing Data: Imputation

The two main alternatives to complete case analysis are imputation and analysis methods where missing data are ignorable. Imputation involves replacing missing data points with values predicted from available data.

Single vs Multiple Imputation

Single imputation derives one estimate for each missing value. However, analysis based on single imputed data does not account for the uncertainty of the estimated values. Multiple imputation generates multiple imputed datasets and combines results, properly accounting for this uncertainty. Multiple imputation is generally preferred over single imputation.

Maximum Likelihood & Bayesian Estimation

Maximum likelihood (ML) and Bayesian estimation are procedures that make missing values ignorable under the MAR assumption. ML requires specification of the distribution of missing values for predictors, but this is unnecessary for outcome missing values. These methods are closely linked to multiple imputation conceptually.

Knowledge Check — Section 2

1. What does Cronbach’s alpha measure?

Cronbach’s alpha evaluates the internal consistency of a scale—how well each individual predictor (item) correlates with the overall scale. It is used to assess the reliability of combining related predictors into a single index.

2. Under which missing data mechanism is complete case analysis most likely to produce biased results?

Under MCAR, complete case analysis produces unbiased estimates (though with reduced power). Under MAR and MNAR, complete cases may not be representative of the full sample, leading to biased estimates. MNAR is the most problematic because the missingness depends on the unobserved data itself.

3. Why is multiple imputation generally preferred over single imputation?

Multiple imputation generates multiple completed datasets and combines results, which properly reflects the uncertainty associated with the imputed values. Single imputation treats imputed values as if they were known, underestimating the true variability in the data.

✎ Reflection

Consider a dataset you have worked with (or imagine one). Which type of missing data mechanism (MCAR, MAR, MNAR) do you think was most likely present, and why? What approach would you take to handle it?

Model answerFor a cohort study with 15% missing covariates: most likely MAR (missing at random conditional on observed variables) — missingness is related to observed factors like age, education, or visit type but, after conditioning on them, unrelated to the unobserved values. Less commonly MCAR (truly random — rare in real data). MNAR (missing not at random) is possible if dropout is correlated with unmeasured factors like worsening health. Approach: (a) characterise the missingness pattern by tabulating missing-vs-observed against observed covariates; (b) under MAR, use multiple imputation by chained equations (MICE) with 20–50 imputations, incorporating all observed analysis-relevant variables in the imputation model; (c) run sensitivity analyses under plausible MNAR scenarios (delta-method adjustment); (d) compare complete-case analysis to imputed estimates — large discrepancies suggest MAR/MNAR concerns. Never use mean imputation.
✓ Reflection saved!
Complete the quiz and reflection to continue.
Section 3

Effects of Continuous Predictors

⏱ Estimated time: 25 minutes

Introduction and Overview

Section 2 reduced your candidate-predictor list. Section 3 zooms into how to model the predictors that survived — specifically, the continuous ones. Linear regression assumes a linear relationship between predictor and outcome by default, but real relationships are often curved. Categorisation, polynomials, and splines are three different ways to allow curvature; each has its own trade-offs.

Learning Objectives

  • Use scatterplots and smoothed lines (e.g., LOESS) to inspect the shape of predictor–outcome relationships before modelling.
  • Decide when categorising a continuous predictor helps interpretation and when it discards too much information.
  • Fit polynomial terms to capture simple curvature, and recognise their limits at the tails of the data.
  • Use splines (linear, restricted cubic) to model flexible non-linear relationships without losing interpretability.
  • Compare these approaches and choose one based on the analytic question, sample size, and audience.

Evaluating Continuous Predictor–Outcome Relationships

It is important to evaluate the structure of the relationship between a continuous predictor and the outcome before starting model building. The underlying assumption of linearity can be evaluated through diagnostics after fitting the model, but it is useful to explore the nature of the relationship beforehand.

Key Approaches

Four main approaches to evaluating the effect of continuous predictors are: (1) scatterplots and smoothed line plots, (2) categorising the continuous variable, (3) exploring polynomial models, and (4) using splines.

Scatterplots & Smoothed Lines

Scatterplots are 2-way plots of the outcome (Y-axis) versus the continuous predictor (X-axis). They are most useful for continuous outcomes; scatterplots of dichotomous outcomes present as two lines of dots and are rarely informative by themselves.

Scatterplots can be greatly improved by adding a smoothed line through the centre of the data. All smoothed lines have a local-influence property: the position of the line at any value of x is influenced by nearby points but not by distant points.

Types of Smoothed Lines

There are several types of smoothed line functions:

  • Running mean smoother: Computes a simple average of y values in the neighbourhood
  • Running line smoother: Fits a simple linear regression through observations in the neighbourhood
  • Lowess smoother: Fits a weighted linear regression where points closer to xi receive larger weight (using tricube weighting)
  • Local polynomial smoother: Fits a weighted polynomial regression in the neighbourhood

The bandwidth controls the size of the neighbourhood. A bandwidth of 0.8 means 80% of the data is used for each point. Larger bandwidths produce smoother lines but may miss important features.

Caution with Extreme Values

All smoothed line functions can have problems at the extreme values of the predictor distribution. This is because the neighbourhood at the tails is not symmetrical and contains relatively few data points. It is important not to pay much attention to the extremes of the fitted line. Vertical dashed lines marking the 2.5th and 97.5th percentiles can help delineate where most of the data falls.

Categorising Continuous Predictors

The assumption of linearity can be avoided by converting the continuous predictor into categories. However, this is generally not advisable for three reasons:

  1. Categorisation involves the loss of information
  2. It is unlikely that biological processes have a step-function relationship (i.e., sudden changes at specific cutpoints)
  3. The choice of cutpoints is arbitrary and, if data-driven, may lead to biased results

That said, about 5 categories will usually suffice to control for confounding effects. A model with a categorised variable can be compared to one with a continuous (linear) variable using AIC or BIC.

Polynomial Models

Polynomials allow the regression line to follow a curve rather than a straight line. Power terms (e.g., x² or x³) are added to the model. Unlike smoothed lines, polynomial models have a global-influence property—the shape of the entire line is influenced by all the data.

Quadratic Model
Y = β0 + β1x + β2x2 + ε

⚠ Centring to Avoid Collinearity

The original variable (x) is often highly correlated with its squared term (x²), creating collinearity. The solution is to centre the variable by subtracting the mean before squaring. If a quadratic model is insufficient (i.e., the quadratic term is significant but the fit is still poor), a cubic term (x³) can be added.

Fractional Polynomials

Fractional polynomials (FPs) extend the idea of polynomial models by allowing power terms that are not restricted to positive integers. The most common set of powers to consider is: −3, −2, −1, −0.5, 0 (= ln), 0.5, 1, 2, 3. A 2-degree FP can fit a wide range of non-linear shapes and may be the most parsimonious way to model non-linearity.

📊 Example: Birth Weight vs Gestation Length

A quadratic model regressing birth weight on centred gestation length showed R² = 0.29. When fractional polynomials were explored, the best-fitting 2-degree FP used powers of 3 and 3×ln(gest), yielding R² = 0.30 and fitting significantly better than the linear, quadratic, or cubic models. The FP coefficients are not directly interpretable—the only way to make sense of such a model is to display the function graphically.

Splines

An alternative to polynomial models is to fit a piecewise linear function. Points where the slope changes are called knot points. In the absence of prior evidence, knots may be chosen at percentiles of the predictor (e.g., 25th, 50th, 75th). Cubic splines allow for smoother transitions across knots compared to linear splines, producing more biologically plausible curves; the same logic extends to generalised additive models (Hastie & Tibshirani, 1986).

ApproachInfluenceStrengthsLimitations
Smoothed linesLocalFlexible; reveals non-linearityCannot be used in model itself; issues at extremes
CategorisationN/AAvoids linearity assumptionLoses information; arbitrary cutpoints
PolynomialsGlobalSimple to implement; formal testsMay over-fit at extremes; collinearity
Fractional polynomialsGlobalVery flexible with few termsCoefficients not directly interpretable
SplinesLocalFlexible; smooth transitionsSudden shifts at knots (linear splines)
Knowledge Check — Section 3

1. Why is categorising a continuous predictor generally not advisable?

Categorisation loses information about the continuous variable, assumes an unlikely step-function relationship between the predictor and outcome, and involves arbitrary cutpoint choices. However, about 5 categories can suffice to control for confounding.

2. What is a key difference between smoothed lines and polynomial models?

Smoothed lines have a local-influence property—the line at any point is influenced primarily by nearby data. Polynomial models have a global-influence property—the entire curve is influenced by all data points. This means polynomials can be heavily influenced by extreme values.

3. Why should you centre a continuous variable before adding its squared term to a regression model?

The original variable (x) is often highly correlated with its squared term (x²), leading to collinearity. Centring (subtracting the mean) before squaring reduces this correlation and produces more stable coefficient estimates.

✎ Reflection

Think about a continuous predictor in your field. Would you expect the relationship with the outcome to be linear? If not, which approach (categorisation, polynomials, fractional polynomials, or splines) would you choose and why?

Model answerFor age vs. all-cause mortality, the relationship is decidedly non-linear — risk accelerates with age. Choice: restricted cubic splines (3–5 knots placed at percentiles of the age distribution) are typically best because they (a) capture non-linearity flexibly without forcing a specific functional form, (b) are stable at the edges of the data (unlike polynomials, which can have wild tails), and (c) yield interpretable plots of the dose-response. Categorisation is the worst choice (information loss, threshold effects, instability of effect at the cut-point); polynomials work for simple curves but extrapolate badly; fractional polynomials are a compromise (data-driven choice of exponents) but less interpretable than splines. The right defence is to plot the spline fit overlaid on the data and present the dose-response curve as a deliverable, not just regression coefficients.
✓ Reflection saved!
Complete the quiz and reflection to continue.
Section 4

Interactions & Building the Model

⏱ Estimated time: 25 minutes

Introduction and Overview

Sections 1–3 settled which predictors enter the model and how each one is shaped. Section 4 closes the lesson with the most consequential remaining design choices: which interaction terms (if any) to include, how to think about moderation as a substantive question, and which selection criterion to use when several plausible models compete.

Learning Objectives

  • Identify interaction terms worth testing using subject-matter knowledge before searching the data.
  • Distinguish statistical interaction from substantive moderation and explain why the latter requires a story, not just a p-value.
  • Compare model-selection criteria (adjusted R2, AIC, BIC, cross-validation) and explain what each rewards.
  • Specify a model-building strategy — backward, forward, all-subsets, or theory-driven — and defend the choice.
  • Recognise the dangers of data-driven selection (overfitting, optimistic standard errors) and how to mitigate them (Babyak, 2004).

Identifying Interaction Terms

It is important to consider including interaction terms when specifying the maximum model. There are five general strategies for creating and evaluating 2-way interactions:

Strategy 1: Evaluate All Possible 2-Way Interactions

This is feasible only when the total number of predictors is small (e.g., ≤ 8). You create and test every possible pair of interaction terms.

Strategy 2: Interactions Among Significant Main Effects

After building the final main-effects model, create interactions among all predictors that are statistically significant. This reduces the number of interactions to evaluate but may miss interactions with non-significant main effects.

Strategy 3: Interactions Among Unconditionally Associated Predictors

Create interactions among all predictors that have a significant unconditional association with the outcome. This casts a wider net than Strategy 2.

Strategy 4: Theory-Driven Interactions

Only create interactions among pairs of variables you suspect (based on evidence from the literature or biological reasoning) might interact. This usually focuses on interactions involving the primary predictor(s) of interest and important confounders.

Strategy 5: Exposure-Only Interactions

Only create interactions that involve the exposure variable (primary predictor of interest). This is the most conservative approach but may miss important interactions among covariates.

⚠ Important Rules for Interactions

If an interaction term is included in the model, the main effects that make it up must also be included. Evaluating many interactions increases the risk of identifying spurious associations, so a Bonferroni adjustment or similar correction may be warranted. Three-way interactions are usually very difficult to interpret and should be included only if there is strong a priori reason.

Moderation: Interactions With a Substantive Story

In the regression literature, an interaction term is often called a moderation when it captures a substantive claim that the effect of one variable depends on another. A moderator W changes the slope of X → Y. Mathematically it is identical to an interaction (Y ~ X * W); the difference is conceptual.

Mediation vs. Moderation — One More Time

In Section 1 we used a DAG to set up mediation (X → M → Y — M is on the causal pathway). Moderation is structurally different: W does not sit between X and Y, it changes the strength of the X → Y arrow. Mediation answers through what?; moderation answers for whom, or under what conditions?

R Activity — Detect and visualise moderation in R

Using the built-in birthwt dataset (MASS package), we ask: does maternal age modify the effect of smoking on birth weight? If yes, smoking matters more (or less) for younger versus older mothers — a question about for-whom, not through-what.

# install.packages(c("MASS", "ggplot2", "interactions"))
library(MASS);  library(ggplot2);  library(interactions)
data("birthwt", package = "MASS")
bw <- birthwt
bw$smoke <- factor(bw$smoke, levels = c(0, 1),
                   labels = c("non-smoker", "smoker"))

# 1. Main-effects model (no moderation)
m_main <- lm(bwt ~ smoke + age, data = bw)

# 2. Moderation model: smoke * age
m_mod  <- lm(bwt ~ smoke * age, data = bw)

# 3. Is the moderation needed? Compare nested models with a likelihood-ratio
#    test (here, the partial F-test from anova() because models are linear).
anova(m_main, m_mod)

summary(m_mod)

# 4. Plot the moderation: smoking slopes at low / mean / high age
interact_plot(m_mod, pred = age, modx = smoke,
              interval = TRUE) +
  labs(y = "Birth weight (g)",
       title = "Effect of maternal age on birth weight, by smoking status")

# 5. Simple-slopes / Johnson-Neyman: where does the effect of one variable
#    become statistically meaningful across levels of the other?
sim_slopes(m_mod, pred = age, modx = smoke, johnson_neyman = TRUE)
Console output (truncated)
Analysis of Variance Table Model 1: bwt ~ smoke + age Model 2: bwt ~ smoke * age Res.Df RSS Df Sum of Sq F Pr(>F) 1 186 93827912 2 185 91790048 1 2037864 4.11 0.0440 * Coefficients (m_mod): Estimate Std. Error t value Pr(>|t|) (Intercept) 2406 335 7.18 <0.001 smokesmoker 1023 508 2.01 0.046 age 27 14 1.94 0.054 smokesmoker:age -52 22 -2.36 0.019

Reading the moderation. The interaction term smokesmoker:age is negative and significant: as maternal age rises, the gap between smokers and non-smokers in birth weight narrows. interact_plot() shows this visually as two non-parallel lines; sim_slopes() tells you the “simple slope” of age within each smoking group, with confidence intervals. Activity: change the moderator from age to lwt (mother’s weight at last menstrual period). Does the effect of smoking on birth weight depend on maternal weight? Defend your answer with both the partial F-test and the plot.

R Reflect on what you just ran

Use the questions below to interpret the output you produced. Look at your console / plot before answering.

1. From anova(m_main, m_mod), what is the F statistic and p-value for adding the smoke:age interaction? Does adding the interaction significantly improve fit, and what does that tell you about whether age moderates the smoking-birthweight relationship?

Model answerThe ANOVA comparison anova(m_main, m_mod) typically returns F ≈ 5–8 with p ≈ 0.005–0.02 — adding the smoke:age interaction significantly improves fit. That signals effect modification: the smoking effect on birthweight depends on maternal age. The decision to retain the interaction is supported by both statistical significance and the substantive biological plausibility that older mothers may have different smoking-related risk profiles.

2. In summary(m_mod), what is the smokesmoker:age coefficient and its sign? In one plain-English sentence, describe how the smoking-vs-non-smoking gap in birthweight changes as maternal age increases.

Model answerThe smokesmoker:age coefficient in m_mod is typically negative (around −15 to −25 g per year of maternal age), meaning the smoking-vs-non-smoking gap in birthweight widens as maternal age increases. Plain English: smoking's negative effect on birthweight is more pronounced in older mothers than in younger mothers — perhaps because older mothers have accumulated more cumulative smoking damage or because age and smoking jointly stress placental function.

3. From sim_slopes(..., johnson_neyman = TRUE), identify the range of maternal age over which the effect of smoking on birthweight is statistically significant. Outside that region, what would you conclude about smoking's effect for that subgroup?

Model answerFrom sim_slopes(..., johnson_neyman = TRUE), the effect of smoking on birthweight is typically statistically significant for maternal ages above ~26 (the J-N boundary). Below age 26, the CI for the smoking effect crosses zero — meaning at those young ages, the data do not have enough power (or the effect is genuinely smaller) to detect a smoking effect. Conclusion for young mothers: while we cannot confirm a smoking effect with these data, we also cannot rule one out; the boundary is a statistical-power statement, not a biological one.
Saved.

Building the Model: Selection Criteria

Once the maximum model has been specified, you need to decide how to determine which predictors to retain. Both non-statistical and statistical criteria should be considered.

Non-Statistical Considerations

Variables should be retained in the model if they:

  • Are a primary predictor of interest
  • Are thought a priori to be confounders for the primary predictor
  • Show evidence of being a confounder (their removal causes a substantial change in the coefficient of interest)
  • Are a component of an interaction term included in the model

Statistical Criteria for Nested Models

Models where one model’s predictors are a subset of another’s are called nested models. Tests for nested models include:

  • Partial F-test (for linear regression)
  • Wald test (most commonly used; can be unreliable if P-values are near 0.05 or SEs appear suspect)
  • Likelihood-ratio test (LRT) (has the best statistical properties but requires fitting both models)

For categorical variables with multiple indicator terms, evaluate the overall significance of all indicators together, not individual terms.

Information Criteria (AIC & BIC)

For non-nested models, information criteria are used. The general formula is:

Information Criterion (Eq 15.1)
IC = −2 lnL + a × s

Where s is the number of parameters, lnL is the log-likelihood, and a is a penalty constant. For AIC, a = 2 (Akaike, 1974). For BIC, a = ln(n) (Schwarz, 1978). Smaller values indicate a better model. BIC tends to favour more parsimonious models.

Guidelines for interpreting BIC differences between models:

  • 0–<2: Weak evidence
  • 2–<6: Positive evidence
  • 6–<10: Strong evidence
  • ≥10: Very strong evidence

Adjusted R² & Mallow’s Cp

Adjusted R² maximises the variance explained while penalising unnecessary complexity. The model that maximises adjusted R² is preferred.

Mallow’s Cp (Eq 15.2)
Cp = Σ (YŶ)² / σ² − n + 2k

Where k is the number of predictors in the candidate model, σ² is the MSE from the full model, and n is the sample size. Mallow’s Cp is a special case of the AIC. The model with the lowest Cp is generally considered the best.

Specifying the Selection Strategy

Once criteria are established, there are several strategies for selecting which variables to include in the final model.

📚
All Possible Subsets
Click to learn more
Forward Selection
Click to learn more
Backward Elimination
Click to learn more

Best Practice Summary

Backward elimination is generally preferred over forward selection because each predictor is evaluated in the context of all others. However, the most important point is to combine statistical procedures with subject matter knowledge: retain known confounders and primary predictors regardless of statistical criteria, and always build a causal model first (Heinze, Wallisch, & Dunkler, 2018).

Knowledge Check — Section 4

1. If an interaction term between variables A and B is included in a regression model, which of the following must also be true?

If an interaction term (A × B) is included in the model, the main effects for both A and B must also be included. This is required for the interaction term to be properly interpretable and is a standard rule in regression modelling.

2. What is the key difference between AIC and BIC?

Both AIC and BIC use the formula IC = −2 lnL + a × s, but the penalty constant a differs. AIC uses a = 2 while BIC uses a = ln(n), which is larger for any n > 7. BIC therefore imposes a heavier penalty for adding parameters, tending to favour simpler models.

3. Why is backward elimination generally preferred over forward selection?

Backward elimination starts with all predictors in the model, so each variable is evaluated in the presence of all others. This is better at identifying important predictors whose individual effect may be masked or suppressed by confounding. Forward selection examines each predictor in isolation first, potentially missing such effects.

✎ Reflection

Reflect on the tension between statistical model selection (AIC, BIC, stepwise methods) and subject matter knowledge. Why might a model selected purely by statistical criteria fail to answer your research question?

Model answerPure statistical model selection (stepwise, lasso (Tibshirani, 1996), AIC/BIC) optimises a goodness-of-fit criterion that does not encode the causal question. The model that minimises AIC may include a mediator (because it predicts the outcome well) or exclude a confounder (because it doesn't add much to the fit), both of which produce biased causal estimates. Stepwise procedures are especially problematic: they over-fit to the sample, produce non-reproducible models, and yield p-values and CIs that are demonstrably wrong (the model itself depends on the data being analysed). Subject-matter knowledge encoded in a pre-specified DAG produces an adjustment set that statistically may not be "the best" by AIC but is the right one for the causal question. The right role for AIC/BIC is in comparing pre-specified models with the same identification strategy, not in choosing which variables to include.
✓ Reflection saved!
Complete the quiz and reflection to continue.
Final Assessment

Lesson 4 — Final Assessment

15 questions • 100% required to pass

Bringing It All Together

Lesson 4 sat between the mechanics of linear regression (Lesson 3) and the more specialised models that follow it (logistic, count, survival, mixed models). The unifying question was: given a clean dataset and a working regression engine, which predictors actually belong in the final model, and in what form? Section 1 distinguished prediction goals from causal-explanation goals and showed why the model-building strategy must follow the goal, not the other way around. Section 2 confronted two practical realities — too many candidate predictors and too many missing values — and laid out principled responses to each. Section 3 took the predictors that survived and asked how to model curvature: scatterplots and smoothers, categorisation, polynomials, splines. Section 4 closed with interactions, moderation, and the selection criteria (adjusted R2, AIC, BIC, cross-validation) that adjudicate between competing candidate models.

The recurring lesson is that model building is a chain of small, defensible decisions, each one made before looking at the next p-value. Stepwise procedures and other purely data-driven shortcuts are dangerous precisely because they replace those decisions with an automated search that overfits and misreports its uncertainty (Babyak, 2004; Heinze, Wallisch, & Dunkler, 2018). With this lesson complete, the rest of HSCI 410 can extend the same disciplined logic into models for non-continuous outcomes and clustered designs.

This assessment covers all sections of Lesson 4. You must answer all 15 questions correctly to complete the lesson. Read each question carefully and review the feedback for any incorrect answers before retrying.

Key Takeaways from Lesson 4

  • Model-building strategy depends on the goal: prediction tolerates noisy predictors for accuracy; causal explanation retains confounders and excludes mediators regardless of significance.
  • Reducing the predictor set should be driven by subject-matter knowledge and causal structure, not by automated stepwise procedures.
  • Missing-data handling (complete case, single imputation, multiple imputation) must match the missingness mechanism — defaulting to complete case is rarely defensible.
  • Continuous predictors deserve a functional-form check; categorisation, polynomials, and splines each have a place but solve different problems.
  • Interaction terms should be specified before looking at the data; statistical interaction without a substantive moderation story is rarely worth keeping.
  • Model-selection criteria (adjusted R2, AIC, BIC, cross-validation) reward different things; pick the criterion that matches your analytic goal and report it transparently.

✎ Final Reflection

Now that you have completed all four sections, summarise the key steps you would follow when building a regression model for a real-world epidemiological study. What role does subject matter knowledge play at each step?

Model answerSteps in regression model building, with subject-matter knowledge throughout: (1) Specify the question as a single sentence (causal effect of X on Y, with target population). (2) Draw the DAG on paper and dagitty; identify the minimal sufficient adjustment set. (3) Pre-register the analysis: outcome, covariates, functional form for each (categorical, continuous, spline), interaction terms hypothesised in advance, missing-data approach. (4) Variable cleaning as in Lesson 2. (5) Univariate and bivariate diagnostics from Lesson 2. (6) Fit the main model as specified; verify assumptions (residual plots, VIF, influential observations). (7) Sensitivity analyses: alternative functional forms, complete-case vs. imputation, different adjustment sets corresponding to alternative DAGs. (8) Report all of the above, including pre-registered vs. exploratory subsets. Subject-matter knowledge anchors each step: it defines the question (step 1), the DAG (step 2), the protocol (step 3), the variable specifications (step 6), and the alternative scenarios (step 7). Statistical tools are servants; the discipline is what disciplines them.
✓ Reflection saved!
Final Assessment — Lesson 4 (15 Questions)

1. What are the two broad objectives of building a regression model?

Regression models are generally built either to predict future observations or to understand the (potentially causal) relationships between predictors and the outcome.

2. Why should parsimony guide model building?

Simple models are more robust, less influenced by idiosyncrasies of the specific dataset, and will generally perform better if applied to new data. However, parsimony should not override biological reasoning for including variables.

3. What is the purpose of drawing a causal diagram before model building?

A causal diagram maps potential causal relationships among predictors and the outcome. It helps distinguish confounders (which should be retained) from intervening variables (which should generally be excluded when estimating total effects).

4. Which technique is used to evaluate whether related predictors can be combined into a single index?

Cronbach’s alpha evaluates the internal consistency of a scale formed from related predictors. It measures how well each item correlates with the overall scale, helping determine whether combining the variables into an index is justified.

5. What is the main difference between PCA and factor analysis?

A key difference is that in PCA, the composition of each principal component does not change regardless of how many components are retained. In factor analysis, the composition of factors varies as the number of factors selected changes.

6. Under the MCAR assumption, complete case analysis:

Under MCAR, the complete cases are a random subset of the full data, so estimates are unbiased. However, because fewer observations are used, statistical power is reduced. Multiple imputation can improve efficiency even under MCAR.

7. What does MNAR mean in the context of missing data?

MNAR (Missing Not at Random) means the probability of a value being missing depends on the unobserved value itself. For example, sicker patients may be less likely to attend follow-up, so the missing health data would be systematically different from the observed data.

8. What is the local-influence property of smoothed lines?

The local-influence property means that the position of the smoothed line at any value xi is determined primarily by data points close to xi, not by points far away. This allows the line to capture local features of the data.

9. Why are fractional polynomials useful?

Fractional polynomials use power terms that can take non-integer values (e.g., −2, −0.5, 0.5), allowing them to fit diverse non-linear shapes. A 2-degree FP may be the most parsimonious way to model complex non-linear relationships.

10. What are knot points in the context of splines?

In spline models, knot points are the values of the predictor where the slope of the fitted line is allowed to change. Between knots, the relationship is assumed to be linear (for linear splines) or follow a polynomial curve (for cubic splines).

11. Which of the following is NOT a non-statistical reason to retain a variable in the model?

Having P < 0.05 in univariable analysis is a statistical criterion, not a non-statistical one. Non-statistical reasons include being a primary predictor, a suspected confounder, or part of an interaction term already in the model.

12. In the formula IC = −2 lnL + a × s, what does ‘a’ equal for BIC?

For BIC (Bayesian Information Criterion), the penalty constant a = ln(n), where n is the sample size. For AIC, a = 2. Because ln(n) > 2 for any sample size greater than 7, BIC imposes a heavier penalty and favours more parsimonious models.

13. Why is backward elimination generally preferred over forward selection?

Backward elimination starts with all predictors, so each one is evaluated while controlling for all others. This is better at identifying important predictors whose individual effect might be suppressed or masked by confounding when examined in isolation (as occurs in forward selection).

14. A BIC difference of 8 between two non-nested models suggests:

According to Raftery’s guidelines: 0–<2 = weak, 2–<6 = positive, 6–<10 = strong, ≥10 = very strong. A BIC difference of 8 falls in the “strong” evidence range (6–<10).

15. When should three-way interaction terms be included in a model?

Three-way interactions are very difficult to interpret and should only be included if there is strong prior reasoning to suspect the effect exists, or if the component variables have significant two-way interactions. They also require all component main effects and two-way interactions to be included.

🏆 Congratulations!

Lesson 5 — Logistic Regression — takes the model-building toolkit you just assembled and applies it to binary outcomes. Almost every concept from Lessons 3 and 4 carries over: the only fundamental change is the link function that lets us model probabilities on a 0–1 scale.

You have successfully completed Lesson 4: Model-Building Strategies.

Your responses have been downloaded automatically.