Model-Building Strategies
Exploratory Data Analysis For Epidemiology
Kiffer G. Card, PhD, Faculty of Health Sciences, Simon Fraser University
Learning objectives for this lesson:
- Develop a full (maximal) model incorporating biological understanding of the system under study
- Carry out procedures to reduce a large number of predictors to a manageable subset
- Address issues related to the functional form of continuous predictors and missing values
- Build regression-type models using both statistical and non-statistical criteria
- Evaluate the reliability of a regression-type model
- Present the results from an analysis in a meaningful way
This course was developed by Kiffer G. Card, PhD, as a companion to Dohoo, I. R., Martin, S. W., & Stryhn, H. (2012). Methods in Epidemiologic Research. VER Inc.
Glossary — Key Terms, People & Concepts
📚 Reference page — available throughout the lesson
This glossary collects the key concepts, people, and ideas you will meet in this lesson. Use it as a reference while you work through the material, or as a review before assessments. Type in the search box to filter entries.
Introduction & Steps in Model Building
Introduction and Overview
Lesson 3 walked through linear regression with a fixed set of predictors. Real data rarely arrive with a clean “here are the four predictors you should use” instruction. Lesson 4 turns to the broader question of how to build a defensible model when you have many candidate predictors and need to decide which to include. The four content sections move from goals and frameworks (Section 1), to reducing predictors and handling missing values (Section 2), to modelling continuous predictor–outcome relationships flexibly with categorisation, polynomials, and splines (Section 3), and finally to interactions, moderation, and selection criteria (Section 4). Throughout, the central tension is between data-driven flexibility and theory-driven discipline — and the answer is usually closer to theory than the data alone would suggest (Babyak, 2004; Heinze, Wallisch, & Dunkler, 2018).
Learning Objectives
- Distinguish prediction-focused from causal-explanation model-building goals and explain how each shapes the strategy.
- Outline the steps in building a regression model from candidate predictors through final reporting.
- Use a causal diagram to identify the predictors that must be retained regardless of statistical significance.
- Recognise when subject-matter knowledge should override what the data alone seem to suggest.
Why Model-Building Strategies Matter
When building a regression model, we need to decide on the goals of the analysis, incorporate both statistical considerations and subject matter knowledge, and balance the desire for parsimony (simplicity) with the desire for a model that “best fits” the data. The definition of “best fit” depends on the goal of the analysis, and the principles discussed in this chapter apply to all types of regression models.
Key Concept
Regression models are generally built to meet one of two broad objectives: (1) to build the best model for predicting future observations, or (2) to understand the causal relationship(s) between predictors and the outcome. The approach to model building differs depending on which goal you are pursuing.
Goals of the Analysis
If the goal is prediction, we want to keep any variables whose relationship with the dependent variable is questionable—because excluding them might lead to inaccurate predictions when future observations have extreme values for those variables. The details of specific predictors are of little consequence; we just want overall accuracy. Reporting guidelines such as TRIPOD (Collins, Reitsma, Altman, & Moons, 2015) provide a checklist for transparently documenting prediction models.
If the goal is understanding biological relationships, we want precise estimates of coefficients for the variables of interest. Careful attention must be paid to interaction and confounding effects. Factors likely to be confounders should be retained in the model regardless of statistical significance, while factors that are almost certainly not confounders should generally be excluded—especially if they are intervening variables, as their inclusion may bias results.
Steps in Building a Regression Model
The process of building a regression model follows a systematic set of steps. While statistical software handles the computation, the researcher must make many decisions along the way that require both subject matter expertise and statistical reasoning.
Identify the outcome variable and determine whether it needs transformation (e.g., natural log). Then identify the full set of predictors to consider. The maximum model includes all possible predictors of interest. While a large model prevents overlooking important predictors, adding too many increases the risks of collinearity and spurious associations. Key sub-steps include: drawing a causal diagram, potentially reducing predictors, considering missing values, evaluating effects of continuous predictors, and deciding on interaction terms.
Decide how you will determine which variables to retain. Criteria can be non-statistical (e.g., is it a primary predictor of interest? Is it a known confounder?) or statistical (e.g., partial F-tests, likelihood-ratio tests, information criteria like AIC or BIC). Both types of criteria should be considered together.
Choose how to apply the criteria. Options include: examining all possible subsets, forward selection (adding variables one at a time), backward elimination (starting with all variables and removing), or stepwise procedures (combining forward and backward). The strategy determines the order in which variables are evaluated.
Step 4: Conduct the analyses using your chosen strategy and criteria. Step 5: Evaluate the reliability of the chosen model using diagnostics and sensitivity analyses. Step 6: Present the results in a meaningful way, ensuring they are interpretable to your audience and that the model-building process is transparent.
Building a Causal Model
Before beginning the model-building process, it is imperative to have a causal model in place, usually presented as a causal diagram. The diagram identifies potential causal relationships among the predictors and the outcome of interest.
Suppose you want to study the effects of cigarette smoking on birth weight, and you also have data on the mother’s race, education level, total birth order, gestation length, number of babies born, and weight gain during pregnancy.
A causal diagram would show that gestation length and weight gain are intervening variables—they lie on the causal pathway between smoking and birth weight. If the objective is to quantify the total effect of smoking on birth weight, you would not include gestation length or weight gain in the model, because doing so would remove the effect of smoking that is mediated through them.
On the other hand, race and college education might be confounders and should be retained regardless of statistical significance. Building the causal diagram first helps ensure you do not accidentally adjust for intervening variables.
⚠ Important Distinction
Confounders should be retained in the model to avoid bias. Intervening variables should generally be excluded when estimating total effects, because including them removes the indirect effect that passes through them. A causal diagram drawn before model building helps you distinguish between the two.
1. When the goal of a regression model is to understand biological relationships, which of the following is true?
2. What is the first step in building a regression model?
3. In a study of cigarette smoking’s effect on birth weight, why should gestation length generally NOT be included in the model?
✎ Reflection
Think about a research question in your own field. What would the causal diagram look like? How would you distinguish confounders from intervening variables?
Reducing Predictors & Missing Values
Introduction and Overview
Section 1 settled the conceptual goals and the workflow. Section 2 turns to two practical issues that determine which predictors actually make it into the final model. First: when you have more candidate predictors than your sample can support, how do you reduce the set without introducing bias? Second: missing values reduce your effective sample further — how should you handle them?
Learning Objectives
- Apply principled techniques (variable clustering, prior knowledge, dimension reduction) to shrink a long candidate-predictor list.
- Critique purely automated (stepwise) selection and explain why it produces fragile models.
- Distinguish complete-case analysis, single imputation, and multiple imputation, and choose between them based on the missing-data mechanism.
- Document predictor reductions and missing-value decisions so the final model is reproducible.
Reducing the Number of Predictors
It is sometimes necessary to reduce the number of predictors in the model-building process. Before undertaking any reduction, it is essential to identify the primary variables of interest and any variables that might be confounders or interacting variables—these should always be retained for consideration.
Practical Tip
The most appropriate procedure for managing a large number of predictors is often to design a more focused study that collects high-quality data on fewer predictors. This greatly reduces the risk of identifying spurious associations.
🎲 Interactive: Overfitting & the Bias–Variance Tradeoff (Babyak, 2004)
Fit polynomials of increasing degree to a small training sample, then evaluate on a fresh hold-out sample drawn from the same true relationship. In-sample R² always rises with more flexibility — out-of-sample error has a U-shape. The minimum is the sweet spot.
Training data + fitted curve
In-sample R² vs out-of-sample MSE by degree
Screening Predictors Based on Descriptive Statistics
Before starting any model building, become thoroughly familiar with your data using descriptive statistics (means, variances, percentiles for continuous variables; frequency tabulations for categorical variables). This helps identify variables of little value. Guidelines include:
- Avoid variables with large numbers of missing observations
- Select only variables with substantial variability (e.g., if almost all subjects are male, sex will not be a useful predictor)
- If a categorical variable has many categories with small counts, consider combining categories or eliminating the variable
Correlation Analysis
Examining pairwise correlations among predictor variables identifies pairs that contain essentially the same information. Highly correlated predictors (typically r > 0.9) produce multicollinearity, leading to unstable coefficient estimates and incorrect standard errors.
If highly correlated pairs are found, select one based on criteria such as biological plausibility, ease of measurement, and fewer missing values. Note that pairwise screening will not detect multicollinearity arising from linear combinations of multiple predictors.
Creation of Indices & Cronbach’s Alpha
Related predictors can sometimes be combined into a single index. For example, the Hamilton Rating Scale for Depression combines 22 characteristics into an overall depression score.
Cronbach’s alpha evaluates the internal consistency of such a scale—how well each predictor correlates with the overall scale. Interpretation guidelines:
- < 0.60: Unacceptable
- 0.60–0.65: Undesirable
- 0.66–0.70: Minimally acceptable
- 0.71–0.80: Respectable
- 0.81–0.90: Very good
- > 0.90: Consider shortening the scale
One drawback of indices is that they preclude evaluating the effects of the individual factors that were combined.
Screening Variables Based on Unconditional Associations
A common approach is to select only predictors with unconditional associations significant at a liberal P-value (e.g., 0.15 or 0.20). Simple univariable regression models are used for this screening.
One drawback: an important predictor might be excluded if its effect is masked by another variable (i.e., confounding is present). Using a liberal P-value helps prevent this. Another approach is to build the model with significant predictors, then add back excluded predictors one at a time to check if any become significant after adjusting for other variables.
PCA, Factor Analysis & Correspondence Analysis
Principal Components Analysis (PCA) converts a set of k predictor variables into k orthogonal (uncorrelated) principal components, each containing a decreasing proportion of total variation. A small subset of components is then used as predictors, eliminating multicollinearity. Coefficients can be back-transformed to the original predictors, though interpretation is less direct.
Factor analysis is similar but assumes factors with inherent meaning can be created from the original variables. Unlike PCA, the composition of factors varies as the number selected changes. Predictors with high “factor loadings” are identified as important determinants.
Correspondence analysis is designed for categorical variables. It produces a visual summary (2D scatterplot) of complex relationships, showing which clusters of predictors are associated with which clusters of outcome values.
The Problem of Missing Values
Missing data are common in observational studies. Statistical programs use complete case analysis by default—only observations with no missing values for any variable are included. Even a relatively low overall percentage of missing values can result in a substantial reduction of the sample if missing data are spread across observations.
Dealing with Missing Data: Imputation
The two main alternatives to complete case analysis are imputation and analysis methods where missing data are ignorable. Imputation involves replacing missing data points with values predicted from available data.
Single imputation derives one estimate for each missing value. However, analysis based on single imputed data does not account for the uncertainty of the estimated values. Multiple imputation generates multiple imputed datasets and combines results, properly accounting for this uncertainty. Multiple imputation is generally preferred over single imputation.
Maximum likelihood (ML) and Bayesian estimation are procedures that make missing values ignorable under the MAR assumption. ML requires specification of the distribution of missing values for predictors, but this is unnecessary for outcome missing values. These methods are closely linked to multiple imputation conceptually.
1. What does Cronbach’s alpha measure?
2. Under which missing data mechanism is complete case analysis most likely to produce biased results?
3. Why is multiple imputation generally preferred over single imputation?
✎ Reflection
Consider a dataset you have worked with (or imagine one). Which type of missing data mechanism (MCAR, MAR, MNAR) do you think was most likely present, and why? What approach would you take to handle it?
Effects of Continuous Predictors
Introduction and Overview
Section 2 reduced your candidate-predictor list. Section 3 zooms into how to model the predictors that survived — specifically, the continuous ones. Linear regression assumes a linear relationship between predictor and outcome by default, but real relationships are often curved. Categorisation, polynomials, and splines are three different ways to allow curvature; each has its own trade-offs.
Learning Objectives
- Use scatterplots and smoothed lines (e.g., LOESS) to inspect the shape of predictor–outcome relationships before modelling.
- Decide when categorising a continuous predictor helps interpretation and when it discards too much information.
- Fit polynomial terms to capture simple curvature, and recognise their limits at the tails of the data.
- Use splines (linear, restricted cubic) to model flexible non-linear relationships without losing interpretability.
- Compare these approaches and choose one based on the analytic question, sample size, and audience.
Evaluating Continuous Predictor–Outcome Relationships
It is important to evaluate the structure of the relationship between a continuous predictor and the outcome before starting model building. The underlying assumption of linearity can be evaluated through diagnostics after fitting the model, but it is useful to explore the nature of the relationship beforehand.
Key Approaches
Four main approaches to evaluating the effect of continuous predictors are: (1) scatterplots and smoothed line plots, (2) categorising the continuous variable, (3) exploring polynomial models, and (4) using splines.
Scatterplots & Smoothed Lines
Scatterplots are 2-way plots of the outcome (Y-axis) versus the continuous predictor (X-axis). They are most useful for continuous outcomes; scatterplots of dichotomous outcomes present as two lines of dots and are rarely informative by themselves.
Scatterplots can be greatly improved by adding a smoothed line through the centre of the data. All smoothed lines have a local-influence property: the position of the line at any value of x is influenced by nearby points but not by distant points.
There are several types of smoothed line functions:
- Running mean smoother: Computes a simple average of y values in the neighbourhood
- Running line smoother: Fits a simple linear regression through observations in the neighbourhood
- Lowess smoother: Fits a weighted linear regression where points closer to xi receive larger weight (using tricube weighting)
- Local polynomial smoother: Fits a weighted polynomial regression in the neighbourhood
The bandwidth controls the size of the neighbourhood. A bandwidth of 0.8 means 80% of the data is used for each point. Larger bandwidths produce smoother lines but may miss important features.
All smoothed line functions can have problems at the extreme values of the predictor distribution. This is because the neighbourhood at the tails is not symmetrical and contains relatively few data points. It is important not to pay much attention to the extremes of the fitted line. Vertical dashed lines marking the 2.5th and 97.5th percentiles can help delineate where most of the data falls.
Categorising Continuous Predictors
The assumption of linearity can be avoided by converting the continuous predictor into categories. However, this is generally not advisable for three reasons:
- Categorisation involves the loss of information
- It is unlikely that biological processes have a step-function relationship (i.e., sudden changes at specific cutpoints)
- The choice of cutpoints is arbitrary and, if data-driven, may lead to biased results
That said, about 5 categories will usually suffice to control for confounding effects. A model with a categorised variable can be compared to one with a continuous (linear) variable using AIC or BIC.
Polynomial Models
Polynomials allow the regression line to follow a curve rather than a straight line. Power terms (e.g., x² or x³) are added to the model. Unlike smoothed lines, polynomial models have a global-influence property—the shape of the entire line is influenced by all the data.
⚠ Centring to Avoid Collinearity
The original variable (x) is often highly correlated with its squared term (x²), creating collinearity. The solution is to centre the variable by subtracting the mean before squaring. If a quadratic model is insufficient (i.e., the quadratic term is significant but the fit is still poor), a cubic term (x³) can be added.
Fractional Polynomials
Fractional polynomials (FPs) extend the idea of polynomial models by allowing power terms that are not restricted to positive integers. The most common set of powers to consider is: −3, −2, −1, −0.5, 0 (= ln), 0.5, 1, 2, 3. A 2-degree FP can fit a wide range of non-linear shapes and may be the most parsimonious way to model non-linearity.
A quadratic model regressing birth weight on centred gestation length showed R² = 0.29. When fractional polynomials were explored, the best-fitting 2-degree FP used powers of 3 and 3×ln(gest), yielding R² = 0.30 and fitting significantly better than the linear, quadratic, or cubic models. The FP coefficients are not directly interpretable—the only way to make sense of such a model is to display the function graphically.
Splines
An alternative to polynomial models is to fit a piecewise linear function. Points where the slope changes are called knot points. In the absence of prior evidence, knots may be chosen at percentiles of the predictor (e.g., 25th, 50th, 75th). Cubic splines allow for smoother transitions across knots compared to linear splines, producing more biologically plausible curves; the same logic extends to generalised additive models (Hastie & Tibshirani, 1986).
| Approach | Influence | Strengths | Limitations |
|---|---|---|---|
| Smoothed lines | Local | Flexible; reveals non-linearity | Cannot be used in model itself; issues at extremes |
| Categorisation | N/A | Avoids linearity assumption | Loses information; arbitrary cutpoints |
| Polynomials | Global | Simple to implement; formal tests | May over-fit at extremes; collinearity |
| Fractional polynomials | Global | Very flexible with few terms | Coefficients not directly interpretable |
| Splines | Local | Flexible; smooth transitions | Sudden shifts at knots (linear splines) |
1. Why is categorising a continuous predictor generally not advisable?
2. What is a key difference between smoothed lines and polynomial models?
3. Why should you centre a continuous variable before adding its squared term to a regression model?
✎ Reflection
Think about a continuous predictor in your field. Would you expect the relationship with the outcome to be linear? If not, which approach (categorisation, polynomials, fractional polynomials, or splines) would you choose and why?
Interactions & Building the Model
Introduction and Overview
Sections 1–3 settled which predictors enter the model and how each one is shaped. Section 4 closes the lesson with the most consequential remaining design choices: which interaction terms (if any) to include, how to think about moderation as a substantive question, and which selection criterion to use when several plausible models compete.
Learning Objectives
- Identify interaction terms worth testing using subject-matter knowledge before searching the data.
- Distinguish statistical interaction from substantive moderation and explain why the latter requires a story, not just a p-value.
- Compare model-selection criteria (adjusted R2, AIC, BIC, cross-validation) and explain what each rewards.
- Specify a model-building strategy — backward, forward, all-subsets, or theory-driven — and defend the choice.
- Recognise the dangers of data-driven selection (overfitting, optimistic standard errors) and how to mitigate them (Babyak, 2004).
Identifying Interaction Terms
It is important to consider including interaction terms when specifying the maximum model. There are five general strategies for creating and evaluating 2-way interactions:
This is feasible only when the total number of predictors is small (e.g., ≤ 8). You create and test every possible pair of interaction terms.
After building the final main-effects model, create interactions among all predictors that are statistically significant. This reduces the number of interactions to evaluate but may miss interactions with non-significant main effects.
Create interactions among all predictors that have a significant unconditional association with the outcome. This casts a wider net than Strategy 2.
Only create interactions among pairs of variables you suspect (based on evidence from the literature or biological reasoning) might interact. This usually focuses on interactions involving the primary predictor(s) of interest and important confounders.
Only create interactions that involve the exposure variable (primary predictor of interest). This is the most conservative approach but may miss important interactions among covariates.
⚠ Important Rules for Interactions
If an interaction term is included in the model, the main effects that make it up must also be included. Evaluating many interactions increases the risk of identifying spurious associations, so a Bonferroni adjustment or similar correction may be warranted. Three-way interactions are usually very difficult to interpret and should be included only if there is strong a priori reason.
Moderation: Interactions With a Substantive Story
In the regression literature, an interaction term is often called a moderation when it captures a substantive claim that the effect of one variable depends on another. A moderator W changes the slope of X → Y. Mathematically it is identical to an interaction (Y ~ X * W); the difference is conceptual.
Mediation vs. Moderation — One More Time
In Section 1 we used a DAG to set up mediation (X → M → Y — M is on the causal pathway). Moderation is structurally different: W does not sit between X and Y, it changes the strength of the X → Y arrow. Mediation answers through what?; moderation answers for whom, or under what conditions?
Using the built-in birthwt dataset (MASS package), we ask: does maternal age modify the effect of smoking on birth weight? If yes, smoking matters more (or less) for younger versus older mothers — a question about for-whom, not through-what.
# install.packages(c("MASS", "ggplot2", "interactions"))
library(MASS); library(ggplot2); library(interactions)
data("birthwt", package = "MASS")
bw <- birthwt
bw$smoke <- factor(bw$smoke, levels = c(0, 1),
labels = c("non-smoker", "smoker"))
# 1. Main-effects model (no moderation)
m_main <- lm(bwt ~ smoke + age, data = bw)
# 2. Moderation model: smoke * age
m_mod <- lm(bwt ~ smoke * age, data = bw)
# 3. Is the moderation needed? Compare nested models with a likelihood-ratio
# test (here, the partial F-test from anova() because models are linear).
anova(m_main, m_mod)
summary(m_mod)
# 4. Plot the moderation: smoking slopes at low / mean / high age
interact_plot(m_mod, pred = age, modx = smoke,
interval = TRUE) +
labs(y = "Birth weight (g)",
title = "Effect of maternal age on birth weight, by smoking status")
# 5. Simple-slopes / Johnson-Neyman: where does the effect of one variable
# become statistically meaningful across levels of the other?
sim_slopes(m_mod, pred = age, modx = smoke, johnson_neyman = TRUE)
Reading the moderation. The interaction term smokesmoker:age is negative and significant: as maternal age rises, the gap between smokers and non-smokers in birth weight narrows. interact_plot() shows this visually as two non-parallel lines; sim_slopes() tells you the “simple slope” of age within each smoking group, with confidence intervals. Activity: change the moderator from age to lwt (mother’s weight at last menstrual period). Does the effect of smoking on birth weight depend on maternal weight? Defend your answer with both the partial F-test and the plot.
R Reflect on what you just ran
Use the questions below to interpret the output you produced. Look at your console / plot before answering.
1. From anova(m_main, m_mod), what is the F statistic and p-value for adding the smoke:age interaction? Does adding the interaction significantly improve fit, and what does that tell you about whether age moderates the smoking-birthweight relationship?
anova(m_main, m_mod) typically returns F ≈ 5–8 with p ≈ 0.005–0.02 — adding the smoke:age interaction significantly improves fit. That signals effect modification: the smoking effect on birthweight depends on maternal age. The decision to retain the interaction is supported by both statistical significance and the substantive biological plausibility that older mothers may have different smoking-related risk profiles.2. In summary(m_mod), what is the smokesmoker:age coefficient and its sign? In one plain-English sentence, describe how the smoking-vs-non-smoking gap in birthweight changes as maternal age increases.
m_mod is typically negative (around −15 to −25 g per year of maternal age), meaning the smoking-vs-non-smoking gap in birthweight widens as maternal age increases. Plain English: smoking's negative effect on birthweight is more pronounced in older mothers than in younger mothers — perhaps because older mothers have accumulated more cumulative smoking damage or because age and smoking jointly stress placental function.3. From sim_slopes(..., johnson_neyman = TRUE), identify the range of maternal age over which the effect of smoking on birthweight is statistically significant. Outside that region, what would you conclude about smoking's effect for that subgroup?
sim_slopes(..., johnson_neyman = TRUE), the effect of smoking on birthweight is typically statistically significant for maternal ages above ~26 (the J-N boundary). Below age 26, the CI for the smoking effect crosses zero — meaning at those young ages, the data do not have enough power (or the effect is genuinely smaller) to detect a smoking effect. Conclusion for young mothers: while we cannot confirm a smoking effect with these data, we also cannot rule one out; the boundary is a statistical-power statement, not a biological one.Building the Model: Selection Criteria
Once the maximum model has been specified, you need to decide how to determine which predictors to retain. Both non-statistical and statistical criteria should be considered.
Non-Statistical Considerations
Variables should be retained in the model if they:
- Are a primary predictor of interest
- Are thought a priori to be confounders for the primary predictor
- Show evidence of being a confounder (their removal causes a substantial change in the coefficient of interest)
- Are a component of an interaction term included in the model
Statistical Criteria for Nested Models
Models where one model’s predictors are a subset of another’s are called nested models. Tests for nested models include:
- Partial F-test (for linear regression)
- Wald test (most commonly used; can be unreliable if P-values are near 0.05 or SEs appear suspect)
- Likelihood-ratio test (LRT) (has the best statistical properties but requires fitting both models)
For categorical variables with multiple indicator terms, evaluate the overall significance of all indicators together, not individual terms.
Information Criteria (AIC & BIC)
For non-nested models, information criteria are used. The general formula is:
Where s is the number of parameters, lnL is the log-likelihood, and a is a penalty constant. For AIC, a = 2 (Akaike, 1974). For BIC, a = ln(n) (Schwarz, 1978). Smaller values indicate a better model. BIC tends to favour more parsimonious models.
Guidelines for interpreting BIC differences between models:
- 0–<2: Weak evidence
- 2–<6: Positive evidence
- 6–<10: Strong evidence
- ≥10: Very strong evidence
Adjusted R² & Mallow’s Cp
Adjusted R² maximises the variance explained while penalising unnecessary complexity. The model that maximises adjusted R² is preferred.
Where k is the number of predictors in the candidate model, σ² is the MSE from the full model, and n is the sample size. Mallow’s Cp is a special case of the AIC. The model with the lowest Cp is generally considered the best.
Specifying the Selection Strategy
Once criteria are established, there are several strategies for selecting which variables to include in the final model.
Best Practice Summary
Backward elimination is generally preferred over forward selection because each predictor is evaluated in the context of all others. However, the most important point is to combine statistical procedures with subject matter knowledge: retain known confounders and primary predictors regardless of statistical criteria, and always build a causal model first (Heinze, Wallisch, & Dunkler, 2018).
1. If an interaction term between variables A and B is included in a regression model, which of the following must also be true?
2. What is the key difference between AIC and BIC?
3. Why is backward elimination generally preferred over forward selection?
✎ Reflection
Reflect on the tension between statistical model selection (AIC, BIC, stepwise methods) and subject matter knowledge. Why might a model selected purely by statistical criteria fail to answer your research question?
Lesson 4 — Final Assessment
Bringing It All Together
Lesson 4 sat between the mechanics of linear regression (Lesson 3) and the more specialised models that follow it (logistic, count, survival, mixed models). The unifying question was: given a clean dataset and a working regression engine, which predictors actually belong in the final model, and in what form? Section 1 distinguished prediction goals from causal-explanation goals and showed why the model-building strategy must follow the goal, not the other way around. Section 2 confronted two practical realities — too many candidate predictors and too many missing values — and laid out principled responses to each. Section 3 took the predictors that survived and asked how to model curvature: scatterplots and smoothers, categorisation, polynomials, splines. Section 4 closed with interactions, moderation, and the selection criteria (adjusted R2, AIC, BIC, cross-validation) that adjudicate between competing candidate models.
The recurring lesson is that model building is a chain of small, defensible decisions, each one made before looking at the next p-value. Stepwise procedures and other purely data-driven shortcuts are dangerous precisely because they replace those decisions with an automated search that overfits and misreports its uncertainty (Babyak, 2004; Heinze, Wallisch, & Dunkler, 2018). With this lesson complete, the rest of HSCI 410 can extend the same disciplined logic into models for non-continuous outcomes and clustered designs.
This assessment covers all sections of Lesson 4. You must answer all 15 questions correctly to complete the lesson. Read each question carefully and review the feedback for any incorrect answers before retrying.
Key Takeaways from Lesson 4
- Model-building strategy depends on the goal: prediction tolerates noisy predictors for accuracy; causal explanation retains confounders and excludes mediators regardless of significance.
- Reducing the predictor set should be driven by subject-matter knowledge and causal structure, not by automated stepwise procedures.
- Missing-data handling (complete case, single imputation, multiple imputation) must match the missingness mechanism — defaulting to complete case is rarely defensible.
- Continuous predictors deserve a functional-form check; categorisation, polynomials, and splines each have a place but solve different problems.
- Interaction terms should be specified before looking at the data; statistical interaction without a substantive moderation story is rarely worth keeping.
- Model-selection criteria (adjusted R2, AIC, BIC, cross-validation) reward different things; pick the criterion that matches your analytic goal and report it transparently.
✎ Final Reflection
Now that you have completed all four sections, summarise the key steps you would follow when building a regression model for a real-world epidemiological study. What role does subject matter knowledge play at each step?
1. What are the two broad objectives of building a regression model?
2. Why should parsimony guide model building?
3. What is the purpose of drawing a causal diagram before model building?
4. Which technique is used to evaluate whether related predictors can be combined into a single index?
5. What is the main difference between PCA and factor analysis?
6. Under the MCAR assumption, complete case analysis:
7. What does MNAR mean in the context of missing data?
8. What is the local-influence property of smoothed lines?
9. Why are fractional polynomials useful?
10. What are knot points in the context of splines?
11. Which of the following is NOT a non-statistical reason to retain a variable in the model?
12. In the formula IC = −2 lnL + a × s, what does ‘a’ equal for BIC?
13. Why is backward elimination generally preferred over forward selection?
14. A BIC difference of 8 between two non-nested models suggests:
15. When should three-way interaction terms be included in a model?