Model-Building Strategies
Exploratory Data Analysis For Epidemiology
Kiffer G. Card, PhD, Faculty of Health Sciences, Simon Fraser University
Learning objectives for this lesson:
- Develop a full (maximal) model incorporating biological understanding of the system under study
- Carry out procedures to reduce a large number of predictors to a manageable subset
- Address issues related to the functional form of continuous predictors and missing values
- Build regression-type models using both statistical and non-statistical criteria
- Evaluate the reliability of a regression-type model
- Present the results from an analysis in a meaningful way
This course was developed by Kiffer G. Card, PhD, as a companion to Dohoo, I. R., Martin, S. W., & Stryhn, H. (2012). Methods in Epidemiologic Research. VER Inc.
Introduction & Steps in Model Building
Why Model-Building Strategies Matter
When building a regression model, we need to decide on the goals of the analysis, incorporate both statistical considerations and subject matter knowledge, and balance the desire for parsimony (simplicity) with the desire for a model that “best fits” the data. The definition of “best fit” depends on the goal of the analysis, and the principles discussed in this chapter apply to all types of regression models.
Regression models are generally built to meet one of two broad objectives: (1) to build the best model for predicting future observations, or (2) to understand the causal relationship(s) between predictors and the outcome. The approach to model building differs depending on which goal you are pursuing.
Goals of the Analysis
If the goal is prediction, we want to keep any variables whose relationship with the dependent variable is questionable—because excluding them might lead to inaccurate predictions when future observations have extreme values for those variables. The details of specific predictors are of little consequence; we just want overall accuracy.
If the goal is understanding biological relationships, we want precise estimates of coefficients for the variables of interest. Careful attention must be paid to interaction and confounding effects. Factors likely to be confounders should be retained in the model regardless of statistical significance, while factors that are almost certainly not confounders should generally be excluded—especially if they are intervening variables, as their inclusion may bias results.
Steps in Building a Regression Model
The process of building a regression model follows a systematic set of steps. While statistical software handles the computation, the researcher must make many decisions along the way that require both subject matter expertise and statistical reasoning.
Identify the outcome variable and determine whether it needs transformation (e.g., natural log). Then identify the full set of predictors to consider. The maximum model includes all possible predictors of interest. While a large model prevents overlooking important predictors, adding too many increases the risks of collinearity and spurious associations. Key sub-steps include: drawing a causal diagram, potentially reducing predictors, considering missing values, evaluating effects of continuous predictors, and deciding on interaction terms.
Decide how you will determine which variables to retain. Criteria can be non-statistical (e.g., is it a primary predictor of interest? Is it a known confounder?) or statistical (e.g., partial F-tests, likelihood-ratio tests, information criteria like AIC or BIC). Both types of criteria should be considered together.
Choose how to apply the criteria. Options include: examining all possible subsets, forward selection (adding variables one at a time), backward elimination (starting with all variables and removing), or stepwise procedures (combining forward and backward). The strategy determines the order in which variables are evaluated.
Step 4: Conduct the analyses using your chosen strategy and criteria. Step 5: Evaluate the reliability of the chosen model using diagnostics and sensitivity analyses. Step 6: Present the results in a meaningful way, ensuring they are interpretable to your audience and that the model-building process is transparent.
Building a Causal Model
Before beginning the model-building process, it is imperative to have a causal model in place, usually presented as a causal diagram. The diagram identifies potential causal relationships among the predictors and the outcome of interest.
Suppose you want to study the effects of cigarette smoking on birth weight, and you also have data on the mother’s race, education level, total birth order, gestation length, number of babies born, and weight gain during pregnancy.
A causal diagram would show that gestation length and weight gain are intervening variables—they lie on the causal pathway between smoking and birth weight. If the objective is to quantify the total effect of smoking on birth weight, you would not include gestation length or weight gain in the model, because doing so would remove the effect of smoking that is mediated through them.
On the other hand, race and college education might be confounders and should be retained regardless of statistical significance. Building the causal diagram first helps ensure you do not accidentally adjust for intervening variables.
Confounders should be retained in the model to avoid bias. Intervening variables should generally be excluded when estimating total effects, because including them removes the indirect effect that passes through them. A causal diagram drawn before model building helps you distinguish between the two.
✔ Check Your Understanding
1. When the goal of a regression model is to understand biological relationships, which of the following is true?
2. What is the first step in building a regression model?
3. In a study of cigarette smoking’s effect on birth weight, why should gestation length generally NOT be included in the model?
✎ Reflection
Think about a research question in your own field. What would the causal diagram look like? How would you distinguish confounders from intervening variables?
Reducing Predictors & Missing Values
Reducing the Number of Predictors
It is sometimes necessary to reduce the number of predictors in the model-building process. Before undertaking any reduction, it is essential to identify the primary variables of interest and any variables that might be confounders or interacting variables—these should always be retained for consideration.
The most appropriate procedure for managing a large number of predictors is often to design a more focused study that collects high-quality data on fewer predictors. This greatly reduces the risk of identifying spurious associations.
Screening Predictors Based on Descriptive Statistics
Before starting any model building, become thoroughly familiar with your data using descriptive statistics (means, variances, percentiles for continuous variables; frequency tabulations for categorical variables). This helps identify variables of little value. Guidelines include:
- Avoid variables with large numbers of missing observations
- Select only variables with substantial variability (e.g., if almost all subjects are male, sex will not be a useful predictor)
- If a categorical variable has many categories with small counts, consider combining categories or eliminating the variable
Correlation Analysis
Examining pairwise correlations among predictor variables identifies pairs that contain essentially the same information. Highly correlated predictors (typically r > 0.9) produce multicollinearity, leading to unstable coefficient estimates and incorrect standard errors.
If highly correlated pairs are found, select one based on criteria such as biological plausibility, ease of measurement, and fewer missing values. Note that pairwise screening will not detect multicollinearity arising from linear combinations of multiple predictors.
Creation of Indices & Cronbach’s Alpha
Related predictors can sometimes be combined into a single index. For example, the Hamilton Rating Scale for Depression combines 22 characteristics into an overall depression score.
Cronbach’s alpha evaluates the internal consistency of such a scale—how well each predictor correlates with the overall scale. Interpretation guidelines:
- < 0.60: Unacceptable
- 0.60–0.65: Undesirable
- 0.66–0.70: Minimally acceptable
- 0.71–0.80: Respectable
- 0.81–0.90: Very good
- > 0.90: Consider shortening the scale
One drawback of indices is that they preclude evaluating the effects of the individual factors that were combined.
Screening Variables Based on Unconditional Associations
A common approach is to select only predictors with unconditional associations significant at a liberal P-value (e.g., 0.15 or 0.20). Simple univariable regression models are used for this screening.
One drawback: an important predictor might be excluded if its effect is masked by another variable (i.e., confounding is present). Using a liberal P-value helps prevent this. Another approach is to build the model with significant predictors, then add back excluded predictors one at a time to check if any become significant after adjusting for other variables.
PCA, Factor Analysis & Correspondence Analysis
Principal Components Analysis (PCA) converts a set of k predictor variables into k orthogonal (uncorrelated) principal components, each containing a decreasing proportion of total variation. A small subset of components is then used as predictors, eliminating multicollinearity. Coefficients can be back-transformed to the original predictors, though interpretation is less direct.
Factor analysis is similar but assumes factors with inherent meaning can be created from the original variables. Unlike PCA, the composition of factors varies as the number selected changes. Predictors with high “factor loadings” are identified as important determinants.
Correspondence analysis is designed for categorical variables. It produces a visual summary (2D scatterplot) of complex relationships, showing which clusters of predictors are associated with which clusters of outcome values.
The Problem of Missing Values
Missing data are common in observational studies. Statistical programs use complete case analysis by default—only observations with no missing values for any variable are included. Even a relatively low overall percentage of missing values can result in a substantial reduction of the sample if missing data are spread across observations.
Dealing with Missing Data: Imputation
The two main alternatives to complete case analysis are imputation and analysis methods where missing data are ignorable. Imputation involves replacing missing data points with values predicted from available data.
Single imputation derives one estimate for each missing value. However, analysis based on single imputed data does not account for the uncertainty of the estimated values. Multiple imputation generates multiple imputed datasets and combines results, properly accounting for this uncertainty. Multiple imputation is generally preferred over single imputation.
Maximum likelihood (ML) and Bayesian estimation are procedures that make missing values ignorable under the MAR assumption. ML requires specification of the distribution of missing values for predictors, but this is unnecessary for outcome missing values. These methods are closely linked to multiple imputation conceptually.
✔ Check Your Understanding
1. What does Cronbach’s alpha measure?
2. Under which missing data mechanism is complete case analysis most likely to produce biased results?
3. Why is multiple imputation generally preferred over single imputation?
✎ Reflection
Consider a dataset you have worked with (or imagine one). Which type of missing data mechanism (MCAR, MAR, MNAR) do you think was most likely present, and why? What approach would you take to handle it?
Effects of Continuous Predictors
Evaluating Continuous Predictor–Outcome Relationships
It is important to evaluate the structure of the relationship between a continuous predictor and the outcome before starting model building. The underlying assumption of linearity can be evaluated through diagnostics after fitting the model, but it is useful to explore the nature of the relationship beforehand.
Four main approaches to evaluating the effect of continuous predictors are: (1) scatterplots and smoothed line plots, (2) categorising the continuous variable, (3) exploring polynomial models, and (4) using splines.
Scatterplots & Smoothed Lines
Scatterplots are 2-way plots of the outcome (Y-axis) versus the continuous predictor (X-axis). They are most useful for continuous outcomes; scatterplots of dichotomous outcomes present as two lines of dots and are rarely informative by themselves.
Scatterplots can be greatly improved by adding a smoothed line through the centre of the data. All smoothed lines have a local-influence property: the position of the line at any value of x is influenced by nearby points but not by distant points.
There are several types of smoothed line functions:
- Running mean smoother: Computes a simple average of y values in the neighbourhood
- Running line smoother: Fits a simple linear regression through observations in the neighbourhood
- Lowess smoother: Fits a weighted linear regression where points closer to xi receive larger weight (using tricube weighting)
- Local polynomial smoother: Fits a weighted polynomial regression in the neighbourhood
The bandwidth controls the size of the neighbourhood. A bandwidth of 0.8 means 80% of the data is used for each point. Larger bandwidths produce smoother lines but may miss important features.
All smoothed line functions can have problems at the extreme values of the predictor distribution. This is because the neighbourhood at the tails is not symmetrical and contains relatively few data points. It is important not to pay much attention to the extremes of the fitted line. Vertical dashed lines marking the 2.5th and 97.5th percentiles can help delineate where most of the data falls.
Categorising Continuous Predictors
The assumption of linearity can be avoided by converting the continuous predictor into categories. However, this is generally not advisable for three reasons:
- Categorisation involves the loss of information
- It is unlikely that biological processes have a step-function relationship (i.e., sudden changes at specific cutpoints)
- The choice of cutpoints is arbitrary and, if data-driven, may lead to biased results
That said, about 5 categories will usually suffice to control for confounding effects. A model with a categorised variable can be compared to one with a continuous (linear) variable using AIC or BIC.
Polynomial Models
Polynomials allow the regression line to follow a curve rather than a straight line. Power terms (e.g., x² or x³) are added to the model. Unlike smoothed lines, polynomial models have a global-influence property—the shape of the entire line is influenced by all the data.
The original variable (x) is often highly correlated with its squared term (x²), creating collinearity. The solution is to centre the variable by subtracting the mean before squaring. If a quadratic model is insufficient (i.e., the quadratic term is significant but the fit is still poor), a cubic term (x³) can be added.
Fractional Polynomials
Fractional polynomials (FPs) extend the idea of polynomial models by allowing power terms that are not restricted to positive integers. The most common set of powers to consider is: −3, −2, −1, −0.5, 0 (= ln), 0.5, 1, 2, 3. A 2-degree FP can fit a wide range of non-linear shapes and may be the most parsimonious way to model non-linearity.
A quadratic model regressing birth weight on centred gestation length showed R² = 0.29. When fractional polynomials were explored, the best-fitting 2-degree FP used powers of 3 and 3×ln(gest), yielding R² = 0.30 and fitting significantly better than the linear, quadratic, or cubic models. The FP coefficients are not directly interpretable—the only way to make sense of such a model is to display the function graphically.
Splines
An alternative to polynomial models is to fit a piecewise linear function. Points where the slope changes are called knot points. In the absence of prior evidence, knots may be chosen at percentiles of the predictor (e.g., 25th, 50th, 75th). Cubic splines allow for smoother transitions across knots compared to linear splines, producing more biologically plausible curves.
| Approach | Influence | Strengths | Limitations |
|---|---|---|---|
| Smoothed lines | Local | Flexible; reveals non-linearity | Cannot be used in model itself; issues at extremes |
| Categorisation | N/A | Avoids linearity assumption | Loses information; arbitrary cutpoints |
| Polynomials | Global | Simple to implement; formal tests | May over-fit at extremes; collinearity |
| Fractional polynomials | Global | Very flexible with few terms | Coefficients not directly interpretable |
| Splines | Local | Flexible; smooth transitions | Sudden shifts at knots (linear splines) |
✔ Check Your Understanding
1. Why is categorising a continuous predictor generally not advisable?
2. What is a key difference between smoothed lines and polynomial models?
3. Why should you centre a continuous variable before adding its squared term to a regression model?
✎ Reflection
Think about a continuous predictor in your field. Would you expect the relationship with the outcome to be linear? If not, which approach (categorisation, polynomials, fractional polynomials, or splines) would you choose and why?
Interactions & Building the Model
Identifying Interaction Terms
It is important to consider including interaction terms when specifying the maximum model. There are five general strategies for creating and evaluating 2-way interactions:
This is feasible only when the total number of predictors is small (e.g., ≤ 8). You create and test every possible pair of interaction terms.
After building the final main-effects model, create interactions among all predictors that are statistically significant. This reduces the number of interactions to evaluate but may miss interactions with non-significant main effects.
Create interactions among all predictors that have a significant unconditional association with the outcome. This casts a wider net than Strategy 2.
Only create interactions among pairs of variables you suspect (based on evidence from the literature or biological reasoning) might interact. This usually focuses on interactions involving the primary predictor(s) of interest and important confounders.
Only create interactions that involve the exposure variable (primary predictor of interest). This is the most conservative approach but may miss important interactions among covariates.
If an interaction term is included in the model, the main effects that make it up must also be included. Evaluating many interactions increases the risk of identifying spurious associations, so a Bonferroni adjustment or similar correction may be warranted. Three-way interactions are usually very difficult to interpret and should be included only if there is strong a priori reason.
Building the Model: Selection Criteria
Once the maximum model has been specified, you need to decide how to determine which predictors to retain. Both non-statistical and statistical criteria should be considered.
Non-Statistical Considerations
Variables should be retained in the model if they:
- Are a primary predictor of interest
- Are thought a priori to be confounders for the primary predictor
- Show evidence of being a confounder (their removal causes a substantial change in the coefficient of interest)
- Are a component of an interaction term included in the model
Statistical Criteria for Nested Models
Models where one model’s predictors are a subset of another’s are called nested models. Tests for nested models include:
- Partial F-test (for linear regression)
- Wald test (most commonly used; can be unreliable if P-values are near 0.05 or SEs appear suspect)
- Likelihood-ratio test (LRT) (has the best statistical properties but requires fitting both models)
For categorical variables with multiple indicator terms, evaluate the overall significance of all indicators together, not individual terms.
Information Criteria (AIC & BIC)
For non-nested models, information criteria are used. The general formula is:
Where s is the number of parameters, lnL is the log-likelihood, and a is a penalty constant. For AIC, a = 2. For BIC, a = ln(n). Smaller values indicate a better model. BIC tends to favour more parsimonious models.
Guidelines for interpreting BIC differences between models:
- 0–<2: Weak evidence
- 2–<6: Positive evidence
- 6–<10: Strong evidence
- ≥10: Very strong evidence
Adjusted R² & Mallow’s Cp
Adjusted R² maximises the variance explained while penalising unnecessary complexity. The model that maximises adjusted R² is preferred.
Where k is the number of predictors in the candidate model, σ² is the MSE from the full model, and n is the sample size. Mallow’s Cp is a special case of the AIC. The model with the lowest Cp is generally considered the best.
Specifying the Selection Strategy
Once criteria are established, there are several strategies for selecting which variables to include in the final model.
Backward elimination is generally preferred over forward selection because each predictor is evaluated in the context of all others. However, the most important point is to combine statistical procedures with subject matter knowledge: retain known confounders and primary predictors regardless of statistical criteria, and always build a causal model first.
✔ Check Your Understanding
1. If an interaction term between variables A and B is included in a regression model, which of the following must also be true?
2. What is the key difference between AIC and BIC?
3. Why is backward elimination generally preferred over forward selection?
✎ Reflection
Reflect on the tension between statistical model selection (AIC, BIC, stepwise methods) and subject matter knowledge. Why might a model selected purely by statistical criteria fail to answer your research question?
Lesson 3 — Final Assessment
This assessment covers all sections of Lesson 3. You must answer all 15 questions correctly to complete the lesson. Read each question carefully and review the feedback for any incorrect answers before retrying.
✎ Final Reflection
Now that you have completed all four sections, summarise the key steps you would follow when building a regression model for a real-world epidemiological study. What role does subject matter knowledge play at each step?
✔ Final Assessment
1. What are the two broad objectives of building a regression model?
2. Why should parsimony guide model building?
3. What is the purpose of drawing a causal diagram before model building?
4. Which technique is used to evaluate whether related predictors can be combined into a single index?
5. What is the main difference between PCA and factor analysis?
6. Under the MCAR assumption, complete case analysis:
7. What does MNAR mean in the context of missing data?
8. What is the local-influence property of smoothed lines?
9. Why are fractional polynomials useful?
10. What are knot points in the context of splines?
11. Which of the following is NOT a non-statistical reason to retain a variable in the model?
12. In the formula IC = −2 lnL + a × s, what does ‘a’ equal for BIC?
13. Why is backward elimination generally preferred over forward selection?
14. A BIC difference of 8 between two non-nested models suggests:
15. When should three-way interaction terms be included in a model?