Linear Regression
Exploratory Data Analysis For Epidemiology
Kiffer G. Card, PhD, Faculty of Health Sciences, Simon Fraser University
Learning objectives for this lesson:
- Identify when least squares regression is an appropriate analytical tool
- Construct a linear model with control of confounding and identification of interaction
- Interpret regression coefficients from both technical and causal perspectives
- Convert nominal, ordinal, or continuous predictors into indicator variables
- Assess model assumptions including linearity, homoscedasticity, and normality of residuals
- Detect and address collinearity among predictor variables
- Identify study designs that require a time-series approach to analysis
This course was developed by Kiffer G. Card, PhD, as a companion to Dohoo, I. R., Martin, S. W., & Stryhn, H. (2012). Methods in Epidemiologic Research. VER Inc.
Glossary — Key Terms, People & Concepts
📚 Reference page — available throughout the lesson
This glossary collects the key concepts, people, and ideas you will meet in this lesson. Use it as a reference while you work through the material, or as a review before assessments. Type in the search box to filter entries.
Introduction & Regression Analysis
Introduction and Overview
Lessons 1 and 2 produced a clean, descriptive view of the data. Lesson 3 takes the next step from description to inference: linear regression is the workhorse model for explaining or predicting a continuous outcome from one or more predictors (Stigler, 1981; Wikipedia, 2026). Across five content sections we walk through this in order: the simple and multivariable model and what its coefficients mean (Section 1), the ANOVA decomposition and how to test the model and its individual coefficients (Section 2), how to handle different types of predictor variables and detect collinearity (Section 3), how to detect and model interactions and how to give a regression a defensible causal interpretation (Section 4), and finally diagnostics and reporting (Section 5).
Learning Objectives
- State when linear regression is the appropriate modelling choice for a public-health outcome.
- Write down and interpret the simple linear regression equation, including the intercept and slope.
- Extend the simple model to a multivariable model and explain what each coefficient now represents.
- Distinguish predictive from causal interpretations of regression coefficients.
Why Linear Regression?
Up to this point, most examples of relating an outcome to an exposure have been based on qualitative outcome variables—that is, variables that are categorical or dichotomous. Linear regression is suitable for modelling the outcome when it is measured on a continuous or near-continuous scale. Examples include birth weight, blood pressure, body mass index, and disease frequency at a regional level.
Key Concept
In regression analysis, the relationship between the outcome and the predictors is asymmetric—we think the value of the outcome is caused by (or we wish to predict it by) the value of another variable (the predictor). Using X-variables to predict Y does not necessarily imply causation; we might just be estimating predictive associations.
The Simple Regression Model
When only one predictor variable is used, the model is called a simple regression model. The term “model” denotes the formal statistical formula that describes the relationship between the predictor and the outcome.
Springs, residuals, and the line that minimizes them. Next ▶ advances scenes.
A 6-scene visualization of OLS: scattered observations, a wobbling candidate line, residuals as physical springs, and the line settling into the unique position that minimizes the sum of squared errors.
In this equation, β0 is the intercept (or constant), β1 is the regression coefficient, and ε is the error term. The errors are assumed to be normally and independently distributed (ε ~ N(0, σ²)). We estimate these errors by residuals—the difference between the observed value and the value predicted by the model.
✏ Interactive: OLS Line-of-Best-Fit Sandbox
Click anywhere on the chart to add a point. Click on an existing point to remove it. The least-squares line, residuals, R², and standard error of the slope update live. Add an extreme outlier and watch one observation drag the entire line (Cook, 1977; Belsley, Kuh, & Welsch, 1980).
The Multivariable Model
Almost without exception, the regression models used by epidemiologists will contain more than one predictor variable. These are known as multiple regression or multivariable models.
Terminology Note
Multivariate indicates 2 or more outcome variables; multivariable denotes more than 1 predictor. In epidemiology, we almost always mean multivariable models.
A major difference from simple regression is that in the multivariable model, β1 is an estimate of the effect of X1 on Y after controlling for the effects of X2. This is the key advantage of multivariable analysis—it accounts for confounding by extraneous variables.
In observational studies, incorporating more than one predictor almost always leads to a more complete understanding of how the outcome varies, and it decreases the chance that the regression coefficients for exposures of interest are biased by confounding variables. The βs are not biased by any variable included in the equation, but they can be biased if confounding variables are omitted from the equation.
Assuming we have not included intervening variables or effects of the outcome in our model, the βs are not confounded by any variable in the regression equation. However, from a causal perspective, if intervening variables are included, the coefficients do not estimate the causal effect. One can never be sure that there are no important unmeasured confounders that were omitted from the model.
A major trade-off in model-building is to avoid omitting necessary confounding variables while not including variables of little importance. Including too many unimportant variables increases the number of βs estimated and may lead to poor performance of the equation on future datasets. Also, having to measure unnecessary variables increases the cost of future work.
Picking up the cleaned phaa_survey_clean.csv from Lesson 2, we will (1) test bivariate correlations, (2) inspect a correlation matrix for the numeric variables we plan to include in a model, and (3) fit a multivariable linear regression for systolic BP. The full annotated script is in r-activities/HSCI_410_Lesson_3_Linear_Regression.R.
# 0. Load the cleaned data + packages we will use ---------------------------
library(corrplot); library(regclass); library(caret)
phaa <- read.csv("phaa_survey_clean.csv", stringsAsFactors = FALSE)
# 1. Bivariate correlation between two numeric variables --------------------
cor.test(phaa$age, phaa$systolic_bp,
use = "complete.obs",
method = "pearson")
# 2. Correlation matrix + visual --------------------------------------------
keep_num <- c("age", "bmi", "systolic_bp", "diastolic_bp",
"phys_act_min", "discrimination_score",
"social_support_score", "dep_score", "anx_score")
cor_mat <- cor(phaa[, keep_num], use = "complete.obs")
round(cor_mat, 2)
corrplot(cor_mat, method = "color", type = "upper",
addCoef.col = "black", tl.col = "black", tl.srt = 45)
# 3. Set the reference level on a factor before fitting lm() ----------------
phaa$gender <- as.factor(phaa$gender)
phaa$gender <- relevel(phaa$gender, ref = "Woman")
# 4. Multivariable linear model for systolic BP -----------------------------
model_3 <- lm(systolic_bp ~ age + gender + smoker + bmi
+ dep_score + phys_act_min,
data = phaa)
summary(model_3)
confint(model_3)
# 5. Diagnostics: linearity, equal variance, normal residuals, outliers -----
par(mfrow = c(2, 2)); plot(model_3); par(mfrow = c(1, 1))
VIF(model_3) # multicollinearity
varImp(model_3) # variable importance
How to read the output. Each coefficient in summary(model_3) is the average change in systolic BP per one-unit increase in that predictor, holding the other predictors constant. The (Intercept) is the predicted BP when every numeric predictor is 0 and every factor is at its reference level — not always meaningful, which is why we centre age in Lesson 4. VIF values > 5 mean two predictors are carrying mostly the same information.
R Reflect on what you just ran
Use the questions below to interpret the output you produced. Look at your console / plot before answering.
1. From cor.test(phaa$age, phaa$systolic_bp), what is the Pearson r and its 95% CI? Does the CI exclude zero? Translate the magnitude into plain English (small / moderate / strong).
2. In summary(model_3), what is the coefficient on age and its p-value? In one sentence, state the adjusted association of age and systolic BP, and whether the 95% CI from confint() excludes zero.
summary(model_3) typically shows an age coefficient of ~0.5 mmHg per year (range 0.3–0.7 depending on covariate inclusion) with p < 0.001 in a sample of n > 500. The 95% CI from confint() excludes zero. Adjusted association: each additional year of age is associated with roughly 0.5 mmHg higher systolic BP, after accounting for sex, BMI, and smoking. The effect compounds over age decades, explaining the ~15–20 mmHg average rise from age 30 to 70.3. Look at VIF(model_3). Which predictor has the highest VIF? Is it above 5 or 10? If you removed it, how would you expect the SE on a correlated predictor to change?
VIF(model_3) typically shows the highest VIF for one of the BP-related variables or BMI — usually around 2–3, below the conventional 5/10 thresholds. If a predictor with VIF > 5 were removed, the SE on its correlated counterpart would drop noticeably (typically by 15–30%), and the point estimate would shift slightly as the model re-attributes shared variance. Multicollinearity doesn't bias coefficients, but inflates their SEs and CIs, making true effects appear non-significant.1. What type of outcome variable is linear regression most suitable for?
2. In the equation Y = β0 + β1X1 + ε, what does β1 represent?
3. What is the key advantage of a multivariable regression model over a simple regression model?
Reflection
Think of a continuous outcome variable in your field of interest. What predictors would you include in a regression model? How would you decide which variables are confounders versus intervening variables?
Hypothesis Testing & Effect Estimation
Introduction and Overview
Section 1 set up the regression model. Section 2 turns to the question of whether the model is doing useful work: how much of the variation in the outcome does it actually explain, and which individual coefficients are meaningfully different from zero? The ANOVA decomposition and the formal tests of model significance are how those questions get answered.
Learning Objectives
- Decompose the variability of Y using the ANOVA sums-of-squares table.
- Use the overall F-test to assess whether a regression model explains useful variation.
- Test individual coefficients with t-tests and report effect sizes with 95% confidence intervals.
- Interpret R2 and adjusted R2 as measures of model fit, and recognise their limits.
The ANOVA Table
The idea behind regression is that information in the X-variables can be used to predict the value of Y. The formal way this is approached is to ascertain how much of the sums of squares (SS) of Y we can explain with knowledge of the X-variable(s).
| Source | Sums of Squares | df | Mean Square | F-test |
|---|---|---|---|---|
| Model (regression) | SSM = Σ(Ŷi − ȲY)2 | dfM = k | MSM = SSM/dfM | MSM/MSE |
| Error (residual) | SSE = Σ(Yi − Ŷi)2 | dfE = n−(k+1) | MSE = SSE/dfE | |
| Total | SST = Σ(Yi − ȲY)2 | dfT = n−1 | MST = SST/dfT |
Here, k is the number of predictor variables in the model (not counting the intercept). When the SS are divided by their degrees of freedom (df), the result is a mean square—denoted MSM (model), MSE (error), and MST (total). The MSE is our estimate of the error variance σ², and the square root of σ² is called the root MSE or the standard error of prediction.
Assessing the Significance of a Linear Regression Model
We use the F-test from the ANOVA table to assess whether the predictors collectively have a statistically significant relationship with the outcome. The null hypothesis is H0: β1 = β2 = … = βk = 0.
A simple linear regression model with birth weight (-bwt-) as the outcome and gestation length (-gest-) as the sole predictor was fit using the bw5k dataset (n = 5,000).
Results: F(1, 4998) = 1,790.09, P < 0.0001, R² = 0.2637. The coefficient for -gest- is 124.5 gm per week (95% CI: 118.7–130.3), meaning for each additional week of gestation, birth weight increases by approximately 124.5 gm.
Testing Individual Regression Coefficients
A t-test with n−(k+1) degrees of freedom is used to evaluate the significance of any individual regression coefficient. The usual null hypothesis is H0: βj = 0.
There are 2 types of variation in play: (1) from estimation of the regression parameters (the usual SE), and (2) from a new observation about the regression line. The prediction interval for a new observation involves both sources. The further the value x* is from the mean of X1, the greater the variability in the prediction. The 95% confidence interval is calculated as: 95% CI = Y ± t.05(SE).
R² (the coefficient of determination) describes the amount of variance in the outcome “explained” by the predictor variables. One formula: R² = SSM/SST = 1 − (SSE/SST). Unfortunately, R² always increases as variables are added to the model. The adjusted R² = 1 − (MSE/MST) adjusts for the number of predictors and is useful for comparing models with different numbers of variables.
Sometimes it is necessary to simultaneously evaluate the significance of a group of X-variables (e.g., a set of indicator variables for a nominal variable). We compare the SSE of the full model with the SSE of the reduced model (without the group) using a partial F-test. This tells us whether the set of variables as a group contributes significantly to the model.
The F-test has a straightforward interpretation only when the X-variables are manipulated treatments in a controlled experiment. In observational studies, the F-statistic is influenced by the number of variables available, their correlations, the total number of subjects, and the method used for variable selection. Most variable selection methods tend to maximise F, meaning the observed F overestimates the actual significance of the model.
🎲 Interactive: What Does a p-Value Actually Mean?
Run hundreds of simulated studies. Each study fits a regression of Y on X with a chosen true effect and sample size. Watch the distribution of p-values build up. With no real effect, p-values are uniform on [0,1]. With a real effect, p-values pile up near zero. Power = the proportion below α.
One simulated study (most recent)
A scatter of n points; black line = OLS fit; t-statistic and p-value displayed.
Distribution of p-values across studies
Histogram of all p-values run so far. Red region = p < α (significant).
1. What does the F-test in the ANOVA table assess?
2. What does R² (the coefficient of determination) measure?
3. Why is adjusted R² preferred over R² when comparing models with different numbers of predictors?
Reflection
Consider a regression model you have seen in published research or coursework. How would you interpret the R² value? What does a low R² mean practically, and does it necessarily indicate a poor model?
Nature of X-Variables & Collinearity
Introduction and Overview
Section 2 evaluated the model's overall fit. Section 3 turns to a practical question that often determines whether your model gives sensible answers: are your predictors entered correctly? Continuous, categorical, indicator, and polynomial predictors all need different handling, and highly correlated predictors (multicollinearity) can destabilize coefficient estimates without obvious warning signs.
Learning Objectives
- Choose appropriate scaling for continuous predictors so that coefficients are interpretable.
- Convert nominal and ordinal categorical predictors into indicator variables (dummy coding).
- Recognise hierarchical indicator structures and code them correctly.
- Detect collinearity using correlation matrices and the variance inflation factor (VIF).
- Decide when collinear predictors should be dropped, combined, or kept with caution.
Types of Predictor Variables
The X-variables can be continuous or categorical. Categorical variables can be either nominal (levels with no meaningful numerical representation, e.g., race or city of residence) or ordinal (ordered levels, e.g., severity: low, medium, high). Nominal and ordinal variables with more than 2 levels must be converted to indicator variables before entering the regression.
Scaling Variables
Often the predictor variables have a limited range of possible or sensible values. For example, if gestation length is a predictor, the intercept reflects birth weight at 0 weeks—which is meaningless. It is useful to scale these variables by subtracting the lowest possible sensible value (or the average) before entering them into the model. This makes the intercept interpretable without changing the regression coefficient or its SE.
Example: Subtracting 39 weeks (the average gestation length) from -gest- gives gest39 = gest − 39. Now β0 reflects birth weight for a 39-week gestation (3,341 gm), a much more meaningful value than the original constant of −1,514 gm.
Regular (Disjoint) Indicator Variables
Indicator variables (also called dummy variables) are created variables whose values have no direct physical relationship to the characteristic being described. For a nominal variable with j levels, we need j − 1 indicator variables. The omitted level becomes the referent (comparison) category.
Example: For mother’s race with 3 categories, we create 2 indicator variables (X1 and X2). Race 3 (with both indicators = 0) becomes the referent. β1 estimates the difference in outcome between races 1 and 3, while β2 estimates the difference between races 2 and 3.
Hierarchical (Incremental) Indicator Variables
If the predictor variables are ordinal in type (reflecting relative changes in an underlying characteristic), hierarchical indicator variables are often preferred. These contrast the outcome in each level against the level immediately preceding it (assuming all hierarchical variables are in the model).
Example: For mother’s education (4 levels), the disjoint indicators compare each level to the lowest (baseline). The hierarchical indicators instead show: the coefficient for level 4 reflects the difference between level 3 (some college) and level 4 (university degree), showing the incremental effect of each step up in education.
| Variable | Indicator Coding | Hierarchical Coding |
|---|---|---|
| meduc_c4=2 (high school diploma) | 20.046 | 20.046 |
| meduc_c4=3 (some college) | 53.270 | 33.224 |
| meduc_c4=4 (university degree) | 80.599 | 27.329 |
Detecting Highly Correlated (Collinear) Variables
If the predictor variables are too highly correlated, a number of problems arise. The estimated effect of each variable depends on the other predictors in the model. With highly correlated predictors, the βs will be highly and negatively correlated, and in extreme cases none of the individual coefficients will be significantly different from zero despite a significant overall F-test.
Collinearity Example
When a quadratic term (-gest_sq-) was added to a model already containing -gest-, the correlation between the two was 0.99, giving a VIF of 131. The SE of -gest- increased over 11 times (from 2.94 to 32.99). Centring -gest- by subtracting 39 (the mean) reduced the VIF from 131 to just 1.54 and the SE back down to 3.58.
1. For a nominal variable with 4 categories, how many indicator (dummy) variables are needed?
2. What does a VIF value greater than 10 suggest?
3. What is the primary purpose of centring a continuous variable before adding it to a regression model?
Reflection
Why might highly correlated predictor variables cause problems in a multivariable regression model? What strategies would you use to detect and address collinearity in your own analyses?
Interaction & Causal Interpretation
Introduction and Overview
Sections 1–3 set up a model with main effects only. Section 4 takes two final design steps: testing whether the effect of one predictor depends on another (interaction) and giving the resulting coefficients a defensible causal interpretation. Both push linear regression beyond a curve-fitting exercise into a tool for answering causal questions, anchored in the DAG-based framework you met in HSCI 341 Lesson 1.
Learning Objectives
- Specify and test interaction terms between two predictors.
- Interpret a model with interactions correctly — main effects no longer have a single overall meaning.
- Use a DAG to decide which covariates belong in the model for a causal question.
- Distinguish confounders from mediators and explain why adjusting for a mediator can mislead.
- Translate a fitted regression into a defensible causal claim, with explicit assumptions.
Detecting and Modelling Interaction
Given the component cause model, we might expect to see interaction when 2 factors act synergistically or antagonistically. In previous sections, models contained only main effects—assuming the association of X1 to Y is the same at all levels of X2. An interaction term tests whether the effect of one variable depends on the level of another.
We assess interaction by testing whether β3 = 0. If the interaction is absent (i.e., β3 is not significantly different from 0), the main effects (additive) model is deemed adequate. If the interaction is needed, centring becomes useful because it allows us to interpret β1 and β2 as linear effects when the centred version of the other variable is zero.
Example 14.9: The dichotomous versions of maternal weight gain (wtgain_c2: <30 lb vs ≥30 lb) and total birth order (tbo_c2: primiparous vs multiparous) were evaluated. The main effects model showed both factors were significant. Adding the interaction term (wg_c2*tbo_c2) revealed a significant interaction (β3 = −88.4, P = 0.010).
This means the positive effect of multiparous birth on birth weight is present if weight gain is low, but is negligible if weight gain is high. Similarly, high weight gain has a bigger effect in primiparous births (227 gm) than in multiparous births (139 gm).
Interactions involving categorical variables (with more than 2 levels) are modelled by including products between all indicator variables needed in the main effects model. For example, the interaction between a 3-level and a 4-level categorical variable requires (3−1) × (4−1) = 6 product variables. These 6 variables should be tested and explored as a group using the partial F-test.
In many multivariable analyses, the number of possibilities for interaction is large and there is no single correct way to assess if interaction is present. Unless the potential number of interactions is small, interactions should be limited to those of biological relevance. It is generally recommended that 3- and 4-way interactions only be investigated when there are good, biologically sound reasons for doing so.
Causal Interpretation of a Multivariable Linear Model
So far, we have focused on the technical interpretation of regression coefficients. When making causal inferences, extra care is needed to ensure that only the appropriate variables are included in the analysis. A causal diagram is very helpful in this regard.
Key Causal Principle
If a variable is an intervening variable (on the causal pathway between exposure and outcome), including it in the model will change the interpretation (Greenland, Pearl, & Robins, 1999). For example, if gestation length is an intervening variable between cigarette smoking and birth weight, including -gest- in the model adjusts away part of the causal effect of smoking. The total effect of smoking would be obtained from a model without -gest-, while the direct effect (not mediated through gestation) would require including it.
Our objective is to evaluate the effects of cigarette smoking (-cig-) on birth weight (-bwt-). The causal diagram indicates that gestation length (-gest-) is an intervening variable between -cig- and -bwt-. Consequently, -gest- and -wtgain- should be excluded from the model when estimating the total causal effect of smoking on birth weight.
The model includes: -white- (potential confounder), -college- (potential confounder), and -cig_2- (the exposure of interest). The interaction between -cig_2- and -white- was assessed.
1. What does a significant interaction term (β3) in a regression model indicate?
2. When estimating the total causal effect of an exposure, what should you do with intervening variables?
3. What tool is recommended before building a multivariable model to help distinguish confounders from intervening variables?
Reflection
Consider an exposure–outcome relationship you are interested in. Draw (or describe) a causal diagram identifying potential confounders and intervening variables. How would the choice of which variables to include affect your estimate of the causal effect?
Lesson 3 — Comprehensive Assessment
Bringing It All Together
Lesson 3 took the dataset you cleaned in Lesson 2 and built a working linear regression around it. Section 1 introduced the simple and multivariable model and made clear how each coefficient should be read. Section 2 used the ANOVA decomposition to ask whether the model is doing useful work, and the t-tests and confidence intervals to ask the same question of each coefficient. Section 3 dealt with the messiness of real predictors — scaling continuous variables, dummy-coding categorical ones, building hierarchical indicators, and using VIF to spot the collinearity that quietly destabilises coefficients. Section 4 closed the loop by adding interactions and giving the fitted model a defensible causal reading via a DAG.
The arc of the lesson is that linear regression is not just a line through points. It is an inferential tool whose coefficients carry meaning only when the predictors have been entered correctly, the assumptions have been checked, and the causal structure has been laid out in advance. Lesson 4 picks up directly from here with the question of which predictors should enter the model in the first place — the model-building strategies that turn the machinery of this lesson into a defensible final analysis.
The final assessment below covers all material from this lesson. You must answer all 15 questions correctly (100%) and complete the final reflection to finish the lesson.
Key Takeaways from Lesson 3
- Linear regression models a continuous outcome as a weighted sum of predictors plus normally distributed error.
- The ANOVA decomposition tells you how much variation the model explains; F-tests and t-tests assess overall fit and individual coefficients.
- Categorical predictors must be entered as indicator variables; the choice of reference category changes how every coefficient is read.
- Multicollinearity inflates standard errors without changing predictions — check VIFs whenever predictors are correlated.
- An interaction term means the effect of one predictor depends on another; main effects must then be interpreted at specific levels.
- A regression earns a causal interpretation only when the DAG, adjustment set, and assumptions are stated explicitly — not by default.
Final Reflection
Reflect on the full chapter. How does linear regression differ from the categorical outcome methods you have previously studied? In what situations would you choose linear regression, and what are the key assumptions you need to verify before trusting your results?
1. Linear regression is most appropriate when the outcome variable is:
2. In the simple regression model Y = β0 + β1X1 + ε, what does β0 represent?
3. Residuals in a regression model are:
4. In a multivariable model, β1 represents the effect of X1 on Y:
5. The root MSE (root mean square error) in a regression model is:
6. The null hypothesis for the overall F-test in regression is:
7. If R² = 0.26 in a regression model, what can we conclude?
8. Why should we use adjusted R² rather than R² when comparing models?
9. To code a nominal variable with 5 categories for regression, you would create:
10. What is the purpose of scaling a predictor variable (e.g., subtracting the mean)?
11. A VIF of 1.0 for a predictor indicates that:
12. In the model Y = β0 + β1X1 + β2X2 + β3X1*X2 + ε, a significant β3 indicates:
13. Including an intervening variable in a causal regression model will:
14. The regression coefficient for a dichotomous predictor (coded 0/1) represents:
15. Ignoring measurement errors in predictor variables tends to: