Linear Regression

Exploratory Data Analysis For Epidemiology

Learning objectives for this lesson:

Identify when least squares regression is an appropriate analytical tool
Construct a linear model with control of confounding and identification of interaction
Interpret regression coefficients from both technical and causal perspectives
Convert nominal, ordinal, or continuous predictors into indicator variables
Assess model assumptions including linearity, homoscedasticity, and normality of residuals
Detect and address collinearity among predictor variables
Identify study designs that require a time-series approach to analysis

This course was developed by Dr. Kiffer G. Card, Faculty of Health Sciences, Simon Fraser University based on Dohoo, I. R., Martin, S. W., & Stryhn, H. (2012). Methods in Epidemiologic Research. VER Inc.

Reference

Glossary: Key Terms, People & Concepts

📚 Reference page, available throughout the lesson

This glossary collects the key concepts, people, and ideas you will meet in this lesson. Use it as a reference while you work through the material, or as a review before assessments. Type in the search box to filter entries.

Key Concepts & Ideas

Linear Regression A model that expresses a continuous outcome as a linear combination of predictors plus error. The workhorse method for relating Y to one or more X's when Y is roughly continuous.

Simple Linear Regression A linear regression with a single predictor: Y = β₀ + β₁X + ε. Fits a line through the data minimizing squared residuals.

Multiple Regression Linear regression with two or more predictors. Each β is interpreted as the change in Y per unit change in that predictor, holding the others constant.

Intercept (β₀) The expected value of Y when all predictors equal zero. Often without a sensible interpretation if zero is outside the data range; centering predictors helps.

Beta Coefficient (Slope) A regression coefficient (β) representing the expected change in the outcome per one-unit increase in that predictor, with other predictors held fixed.

Residual The difference between an observed Y and the value predicted by the model (Y − Ŷ). Patterns in residuals diagnose model misspecification.

R² (Coefficient of Determination) The proportion of variance in the outcome explained by the model, ranging from 0 to 1. Adjusted R² penalises adding predictors.

Interaction (Effect Modification) A situation where the effect of one predictor on Y depends on the level of another. Modeled by including a product term (X₁ × X₂) in the regression.

Dummy / Indicator Variable A 0/1 variable used to encode a categorical predictor in a regression. A k-level factor needs k−1 dummies plus a reference category.

Methods & Statistical Concepts

OLS (Ordinary Least Squares) The estimation method for linear regression that finds β's minimizing the sum of squared residuals. Best Linear Unbiased Estimator (BLUE) under the Gauss-Markov assumptions (Stigler, 1981).

Homoscedasticity Constant variance of residuals across the range of fitted values. Violated when residual spread fans out (heteroscedasticity); biases standard errors (White, 1980).

Normality of Residuals The assumption that residuals are normally distributed. Required for valid t- and F-tests in small samples; checked via Q-Q plots.

Independence of Errors The assumption that residuals are uncorrelated. Violated by clustered or time-series data; addressed with mixed models or GEE.

Leverage A measure of how unusual a data point's predictor values are. High-leverage points have the potential to strongly influence fitted coefficients.

Cook's Distance A diagnostic combining leverage and residual to flag observations whose deletion materially changes the fitted regression. Values > 1 (or > 4/n) warrant investigation (Cook, 1977).

Multicollinearity High correlation among predictors, which inflates standard errors and destabilises coefficient estimates. Diagnosed with VIF.

VIF (Variance Inflation Factor) A diagnostic for multicollinearity: VIF > 5–10 signals problematic correlation between a predictor and the others.

Standard Error of β The estimated standard deviation of a regression coefficient. Used to form confidence intervals (β ± 1.96·SE) and t-statistics for significance testing.

F-test (Overall Model) A test of the joint null that all slopes equal zero. Reported in the ANOVA table; significance indicates the model explains variance better than the mean alone.

Key People

Francis Galton (1822–1911) English polymath who introduced the concept of regression (“regression toward the mean”) while studying inheritance of stature in parents and children. See the Francis Galton entry for detail.

Karl Pearson (1857–1936) English statistician who formalised the correlation coefficient (Pearson's r) and many of the foundations of regression and biometrics.

Ronald A. Fisher (1890–1962) British statistician who developed maximum likelihood estimation, ANOVA, and the F-test, and unified linear models within the analysis of variance framework.

No matching entries. Try a different search term.

Section 1

Introduction & Regression Analysis

⏱ Estimated time: 15 minutes

Lesson 3 · HSCI 410

Linear Regression

The workhorse model for continuous outcomes, built from first principles to causal reading.

Why this model

From categorical to continuous outcomes

Linear regression models a continuous outcome as a weighted sum of predictors plus a normally distributed error. Ordinary least squares finds the best-fitting coefficients.

The simple model

\[ \color{#0B7B6B}{Y} = \color{#C2410C}{\beta_0} + \color{#6D28D9}{\beta_1} \color{#1D4ED8}{X_1} + \color{#BE185D}{\varepsilon} \]

Y outcome β₀ intercept β₁ slope X₁ predictor ε error

The applied context

Formulas grounded in a real dataset

The R activity uses the cleaned survey data from an earlier lesson. Throughout, the running example is a multivariable model for systolic blood pressure with age, BMI, depression score, smoking, and physical activity as predictors.

Each formula in the narration corresponds to something you can fit in R.
The diagnostics in a later section are checks you will run on your own model output.
No need to run R now. The written lesson has the full annotated code below.

Section 1 of 4

Introduction & Regression Analysis

The simple and multivariable models, and what their coefficients actually mean.

The simple model

Three components, one equation

Simple linear regression (Eq 14.1)

\[ \color{#0B7B6B}{Y} = \color{#C2410C}{\beta_0} + \color{#6D28D9}{\beta_1} \color{#1D4ED8}{X_1} + \color{#BE185D}{\varepsilon}, \quad \color{#BE185D}{\varepsilon} \sim N(0,\,\sigma^2) \]

Y outcome β₀ intercept β₁ slope X₁ predictor ε error, normal mean 0

$\beta_0$

Intercept: predicted Y when X is 0.

$\beta_1$

Slope: change in Y per one-unit increase in X.

$\varepsilon$

Error: normal with mean 0, constant variance sigma-squared.

Least squares

Minimizing the sum of squared residuals

OLS objective

\[ \color{#0B7B6B}{\hat{\boldsymbol{\beta}}} = \underset{\boldsymbol{\beta}}{\arg\min} \sum_{i=1}^{n} \left(\color{#C2410C}{Y_i} - \color{#6D28D9}{\hat{Y}_i}\right)^2 \]

β̂ fitted coefficients Yᵢ observed value Ŷᵢ predicted value

The multivariable model

Each coefficient is an adjusted estimate

Multivariable linear regression (Eq 14.3)

\[ \color{#0B7B6B}{Y} = \color{#C2410C}{\beta_0} + \color{#6D28D9}{\beta_1} \color{#1D4ED8}{X_1} + \color{#6D28D9}{\beta_2} \color{#1D4ED8}{X_2} + \cdots + \color{#BE185D}{\varepsilon} \]

Y outcome β₀ intercept βⱼ slope for predictor j Xⱼ predictor j ε error

$\beta_1$ estimates the effect of $X_1$ on $Y$ holding $X_2, \ldots, X_k$ constant.

The betas are not biased by any variable included in the equation, but they can be biased if confounders are omitted from the model.Dohoo, Martin & Stryhn, 2012

Two cautions

Terminology and model parsimony

Terminology

Multivariable: more than one predictor. Multivariate: more than one outcome. In epidemiology, almost every regression model is multivariable.

Parsimony

Including unnecessary predictors inflates estimation error and reduces performance on new data. Include all important confounders; remove variables that add nothing.

Carry forward

What to take into the next section

$\beta_1$ is a slope: the expected change in $Y$ per one-unit increase in $X_1$.
In a multivariable model, every $\beta$ is an adjusted estimate holding other predictors constant.
Predictive association and causation are different things. A later section handles the distinction explicitly.

Introduction and Overview

Earlier lessons produced a clean, descriptive view of the data. This lesson takes the next step from description to inference: linear regression is the workhorse model for explaining or predicting a continuous outcome from one or more predictors, fit by ordinary least squares (Stigler, 1981). Across four content sections we walk through this in order: the simple and multivariable model and what its coefficients mean (this section), the ANOVA decomposition and how to test the model and its individual coefficients (a later section), how to handle different types of predictor variables and detect collinearity (a later section), and how to detect and model interactions and give a regression a defensible causal interpretation (a later section). Model diagnostics, the residual and influence checks that show whether the fit can be trusted, run alongside the R work throughout.

Learning Objectives

State when linear regression is the appropriate modelling choice for a public-health outcome.
Write down and interpret the simple linear regression equation, including the intercept and slope.
Extend the simple model to a multivariable model and explain what each coefficient now represents.
Distinguish predictive from causal interpretations of regression coefficients.

Why Linear Regression?

Up to this point, most examples of relating an outcome to an exposure have been based on qualitative outcome variables, that is, variables that are categorical or dichotomous. Linear regression is suitable for modelling the outcome when it is measured on a continuous or near-continuous scale. Examples include birth weight, blood pressure, body mass index, and disease frequency at a regional level.

Key Concept

In regression analysis, the relationship between the outcome and the predictors is asymmetric: we think the value of the outcome is caused by (or we wish to predict it by) the value of another variable (the predictor). Using X-variables to predict Y does not necessarily imply causation; we might just be estimating predictive associations.

The Simple Regression Model

When only one predictor variable is used, the model is called a simple regression model. The term “model” denotes the formal statistical formula that describes the relationship between the predictor and the outcome.

▸ INTERACTIVE STORY: THE BEST-FIT TUG OF WAR
Open full screen ↗

Springs, residuals, and the line that minimizes them. Next ▶ advances scenes.

A 6-scene visualization of OLS: scattered observations, a wobbling candidate line, residuals as physical springs, and the line settling into the unique position that minimizes the sum of squared errors.

Simple linear regression (Eq 14.1)

\[ \color{#0B7B6B}{Y} = \color{#C2410C}{\beta_0} + \color{#6D28D9}{\beta_1} \color{#1D4ED8}{X_1} + \color{#BE185D}{\varepsilon} \]

The outcome equals the intercept plus the slope times the predictor, plus the error term.

In this equation, β₀ is the intercept (or constant), β₁ is the regression coefficient, and ε is the error term. The errors are assumed to be normally and independently distributed (ε ~ N(0, σ²)). We estimate these errors by residuals, the difference between the observed value and the value predicted by the model.

Reading a coefficient in plain words

Suppose the outcome Y is systolic blood pressure in mmHg and the predictor X₁ is age in years, and the fitted line is Ŷ = 100 + 0.5 × age. The intercept of 100 is the model’s predicted blood pressure at age 0, which is only a mathematical anchor rather than a real value for a newborn. The slope of 0.5 is usually the part you care about: comparing two people whose ages differ by one year, the model predicts the older one has, on average, a systolic blood pressure about 0.5 mmHg higher, roughly 20 mmHg across a 40-year span. Read it as a comparison of averages between groups that differ in the predictor; it does not by itself describe what happens inside any one person over time.

Intercept (β₀)Click to explore

Regression Coefficient (β₁)Click to explore

Error Term (ε)Click to explore

✏ Interactive: OLS Line-of-Best-Fit Sandbox

Click anywhere on the chart to add a point. Click on an existing point to remove it. The least-squares line, residuals, R², and standard error of the slope update live. Add an extreme outlier and watch one observation drag the entire line (Cook, 1977; Belsley, Kuh, & Welsch, 1980).

Show residuals Show "true" line (β=1) Highlight leverage points

Slope (β̂)

–

Intercept (α̂)

–

R²

–

SE(β̂)

–

RMSE

–

Sum sq. resid.

–

t-stat

–

Try this: load a random sample, then click "Add an outlier" to drop a point at (10, 1). One leverage point can pull the slope by half a unit, the visible reason why diagnostic plots, influence statistics, and robust estimators matter (Huber, 1964; Cook, 1977).

Two residual-versus-fitted plots: on the left an even horizontal band (homoscedastic); on the right a fan that widens with the fitted value (heteroscedastic). — A residuals-versus-fitted plot is the standard check for constant variance. An even band (left) supports homoscedasticity; a widening fan (right) signals heteroscedasticity, which leaves coefficients unbiased but distorts their standard errors.

The Multivariable Model

Almost without exception, the regression models used by epidemiologists will contain more than one predictor variable. These are known as multiple regression or multivariable models.

Terminology Note

Multivariate indicates 2 or more outcome variables; multivariable denotes more than 1 predictor. In epidemiology, we almost always mean multivariable models.

Multivariable linear regression (Eq 14.3)

\[ \color{#0B7B6B}{Y} = \color{#C2410C}{\beta_0} + \color{#6D28D9}{\beta_1} \color{#1D4ED8}{X_1} + \color{#6D28D9}{\beta_2} \color{#1D4ED8}{X_2} + \color{#BE185D}{\varepsilon} \]

The outcome equals the intercept plus each slope times its predictor, summed over all predictors, plus the error.

A major difference from simple regression is that in the multivariable model, β₁ is an estimate of the effect of X₁ on Y after controlling for the effects of X₂. This is the key advantage of multivariable analysis: it accounts for confounding by extraneous variables.

Why use multivariable models?

In observational studies, incorporating more than one predictor almost always leads to a more complete understanding of how the outcome varies, and it decreases the chance that the regression coefficients for exposures of interest are biased by confounding variables. The βs are not biased by any variable included in the equation, but they can be biased if confounding variables are omitted from the equation.

Confounders vs intervening variables

Assuming we have not included intervening variables or effects of the outcome in our model, the βs are not confounded by any variable in the regression equation. However, from a causal perspective, if intervening variables are included, the coefficients do not estimate the causal effect. One can never be sure that there are no important unmeasured confounders that were omitted from the model.

Trade-offs in model building

A major trade-off in model-building is to avoid omitting necessary confounding variables while not including variables of little importance. Including too many unimportant variables increases the number of βs estimated and may lead to poor performance of the equation on future datasets. Also, having to measure unnecessary variables increases the cost of future work.

R Activity: correlation and a first multivariable linear model

Picking up the cleaned phaa_survey_clean.csv from an earlier lesson, we will (1) test bivariate correlations, (2) inspect a correlation matrix for the numeric variables we plan to include in a model, and (3) fit a multivariable linear regression for systolic BP. The full annotated script is in r-activities/HSCI_410_Lesson_3_Linear_Regression.R.

# 0. Load the cleaned data + packages we will use ---------------------------
library(corrplot);  library(regclass);  library(caret)
phaa <- read.csv("phaa_survey_clean.csv", stringsAsFactors = FALSE)

# 1. Bivariate correlation between two numeric variables --------------------
cor.test(phaa$age, phaa$systolic_bp,
         method = "pearson")

# 2. Correlation matrix + visual --------------------------------------------
keep_num <- c("age", "bmi", "systolic_bp", "diastolic_bp",
              "phys_act_min", "discrimination_score",
              "social_support_score", "dep_score", "anx_score")
cor_mat <- cor(phaa[, keep_num], use = "complete.obs")
round(cor_mat, 2)
corrplot(cor_mat, method = "color", type = "upper",
         addCoef.col = "black", tl.col = "black", tl.srt = 45)

# 3. Set the reference level on a factor before fitting lm() ----------------
phaa$gender <- as.factor(phaa$gender)
phaa$gender <- relevel(phaa$gender, ref = "Woman")

# 4. Multivariable linear model for systolic BP -----------------------------
model_3 <- lm(systolic_bp ~ age + gender + smoker + bmi
                          + dep_score + phys_act_min,
              data = phaa)
summary(model_3)
confint(model_3)

# 5. Diagnostics: linearity, equal variance, normal residuals, outliers -----
par(mfrow = c(2, 2));  plot(model_3);  par(mfrow = c(1, 1))
VIF(model_3)            # multicollinearity
varImp(model_3)         # variable importance

How to read the output. Each coefficient in summary(model_3) is the average change in systolic BP per one-unit increase in that predictor, holding the other predictors constant. The (Intercept) is the predicted BP when every numeric predictor is 0 and every factor is at its reference level, which is not always meaningful, which is why we centre age in a later lesson. VIF values > 5 mean two predictors are carrying mostly the same information.

R Reflect on what you just ran

Use the questions below to interpret the output you produced. Look at your console / plot before answering.

1. From cor.test(phaa$age, phaa$systolic_bp), what is the Pearson r and its 95% CI? Does the CI exclude zero? Translate the magnitude into plain English (small / moderate / strong).

Model answerPearson r is around 0.30–0.35, 95% CI roughly (0.22, 0.41), clearly excluding zero. The magnitude is small-to-moderate by Cohen's benchmarks (small ~0.1, moderate ~0.3, strong ~0.5+), a real but modest age-BP relationship. The CI excluding zero plus a non-trivial effect size means age explains some of the variation in BP, but most of it (about 90% of the variance) is attributable to other factors.

2. In summary(model_3), what is the coefficient on age and its p-value? In one sentence, state the adjusted association of age and systolic BP, and whether the 95% CI from confint() excludes zero.

Model answersummary(model_3) typically shows an age coefficient of ~0.5 mmHg per year (range 0.3–0.7 depending on covariate inclusion) with p < 0.001 in a sample of n > 500. The 95% CI from confint() excludes zero. Adjusted association: each additional year of age is associated with roughly 0.5 mmHg higher systolic BP, after accounting for sex, BMI, and smoking. The effect compounds over age decades, explaining the ~15–20 mmHg average rise from age 30 to 70.

3. Look at VIF(model_3). Which predictor has the highest VIF? Is it above 5 or 10? If you removed it, how would you expect the SE on a correlated predictor to change?

Model answerVIF(model_3) typically shows the highest VIF for one of the BP-related variables or BMI, usually around 2–3, below the conventional 5/10 thresholds. If a predictor with VIF > 5 were removed, the SE on its correlated counterpart would drop noticeably (typically by 15–30%), and the point estimate would shift slightly as the model re-attributes shared variance. Multicollinearity doesn't bias coefficients, but inflates their SEs and CIs, making true effects appear non-significant.

Saved.

Knowledge check: this section

1. What type of outcome variable is linear regression most suitable for?

Dichotomous outcomes Continuous or near-continuous outcomes Nominal categorical outcomes

Linear regression is suitable for modelling the outcome when it is measured on a continuous or near-continuous scale, such as birth weight, blood pressure, or body mass index.

2. In the equation Y = β₀ + β₁X₁ + ε, what does β₁ represent?

The predicted value of Y when X₁ = 0 The random error in the model The change in Y for each one-unit increase in X₁

β₁ is the regression coefficient that describes how the mean value of Y changes for each one-unit increase in X₁. The intercept (β₀) is the value of Y when X₁ = 0.

3. What is the key advantage of a multivariable regression model over a simple regression model?

It can control for confounding by extraneous variables It always produces larger R² values It eliminates the need for the error term

The key advantage is that each β coefficient estimates the effect of its predictor after controlling for all other variables in the model, thereby reducing bias from confounding variables.

Reflection

Think of a continuous outcome variable in your field of interest. What predictors would you include in a regression model? How would you decide which variables are confounders versus intervening variables?

Model answerPick an outcome (e.g., HbA1c in adults with type-2 diabetes). Predictors: age, sex, BMI, physical activity, dietary patterns, medication adherence, sleep duration, depression score, SES. Confounders vs. intervening variables: confounders are causes of both exposure and outcome that exist before the exposure (age, SES, family history); intervening variables are on the causal pathway from exposure to outcome (medication adherence is between treatment intensity and HbA1c). The distinction is decided by a DAG, not by statistical tests: if a variable is a mediator and you want the total effect of an exposure, do NOT adjust for it (otherwise you block part of the causal path you're trying to estimate). If you want the direct effect, DO adjust. Pre-registering the DAG and the adjustment set prevents post hoc rationalisation.

Reflection saved!

* Complete the quiz and reflection to continue.

Section 2

Hypothesis Testing & Effect Estimation

⏱ Estimated time: 20 minutes

Section 2 of 4

Hypothesis Testing & Effect Estimation

ANOVA decomposition, the overall F-test, individual t-tests, and the coefficient of determination.

The ANOVA decomposition

Partitioning total variation in Y

Sums of squares identity

\[ \underbrace{\sum(Y_i - \bar{Y})^2}_{\color{#0B7B6B}{\text{SST}}} = \underbrace{\sum(\hat{Y}_i - \bar{Y})^2}_{\color{#C2410C}{\text{SSM}}} + \underbrace{\sum(Y_i - \hat{Y}_i)^2}_{\color{#6D28D9}{\text{SSE}}} \]

SST total variation SSM explained by model SSE residual (unexplained)

SSM

Model sum of squares. Variation explained by predictors. Degrees of freedom = k.

SSE

Error sum of squares. Residual variation. Degrees of freedom = n minus (k + 1).

SST

Total sum of squares. Degrees of freedom = n minus 1.

The F-test

Overall model significance

Overall F-statistic

\[ \color{#0B7B6B}{F} = \frac{\color{#C2410C}{\text{MSM}}}{\color{#6D28D9}{\text{MSE}}} = \frac{\text{SSM}/k}{\text{SSE}/(n-k-1)} \]

F overall F-ratio MSM mean square, model MSE mean square, error

Null hypothesis: $H_0: \beta_1 = \beta_2 = \cdots = \beta_k = 0$.

Example 14.1: birth weight on gestation length (n = 5,000). F(1, 4998) = 1,790, p < 0.0001, R² = 0.26. Each additional week of gestation adds about 124.5 grams of birth weight (95 percent CI: 118.7 to 130.3).Dohoo et al., 2012

Individual coefficients

The t-test for each slope

t-statistic for $\hat{\beta}_j$ (Eq 14.6)

\[ \color{#0B7B6B}{t} = \frac{\color{#C2410C}{\hat{\beta}_j} - \color{#6D28D9}{\beta^*}}{\color{#1D4ED8}{\text{SE}(\hat{\beta}_j)}}, \quad df = n - (k+1) \]

t t-statistic β̂ⱼ estimated coefficient β* null value (usually 0) SE standard error

95 percent confidence interval

\[ \color{#C2410C}{\hat{\beta}_j} \;\pm\; \color{#6D28D9}{t_{0.025}} \cdot \color{#1D4ED8}{\text{SE}(\hat{\beta}_j)} \]

β̂ⱼ estimated coefficient t₀.₀₂₅ critical t-value SE standard error

A non-significant result means insufficient precision to distinguish the effect from zero, not that the effect is zero. Prediction intervals for new observations are wider than confidence intervals, and widen away from the mean of X.

Model fit

R-squared and adjusted R-squared

Coefficient of determination

\[ \color{#0B7B6B}{R^2} = \frac{\color{#C2410C}{\text{SSM}}}{\color{#1D4ED8}{\text{SST}}} = 1 - \frac{\color{#6D28D9}{\text{SSE}}}{\color{#1D4ED8}{\text{SST}}} \]

R² proportion of variance explained SSM model sum of squares SSE error sum of squares SST total sum of squares

Adjusted R-squared

\[ \color{#0B7B6B}{R^2_{\text{adj}}} = 1 - \frac{\color{#C2410C}{\text{MSE}}}{\color{#1D4ED8}{\text{MST}}} \]

R²₃ₐ₌ adjusted for predictors MSE mean square error MST mean square total

R² always rises

Adding any variable increases R², regardless of whether it contains useful information.

Adjusted R² can fall

Adding an uninformative variable increases the MSE relative to MST, so adjusted R² declines. Use it to compare models of different sizes; test groups of terms with a partial F-test.

Carry forward

Testing done; now predictor form

The F-test and t-tests quantify whether the model explains variation; R² quantifies how much.
In observational studies, variable selection methods that maximize F produce inflated significance. Report adjusted R² alongside R².
A significant overall model can still contain poorly coded predictors. A later section addresses that directly.

Introduction and Overview

An earlier section set up the regression model. This section turns to the question of whether the model is doing useful work: how much of the variation in the outcome does it actually explain, and which individual coefficients are meaningfully different from zero? The ANOVA decomposition and the formal tests of model significance are how those questions get answered.

Learning Objectives

Decompose the variability of Y using the ANOVA sums-of-squares table.
Use the overall F-test to assess whether a regression model explains useful variation.
Test individual coefficients with t-tests and report effect sizes with 95% confidence intervals.
Interpret R² and adjusted R² as measures of model fit, and recognise their limits.

The ANOVA Table

The idea behind regression is that information in the X-variables can be used to predict the value of Y. The formal way this is approached is to ascertain how much of the sums of squares (SS) of Y we can explain with knowledge of the X-variable(s).

Source	Sums of Squares	df	Mean Square	F-test
Model (regression)	SSM = Σ(Ŷ_i − Ȳ)²	dfM = k	MSM = SSM/dfM	MSM/MSE
Error (residual)	SSE = Σ(Y_i − Ŷ_i)²	dfE = n−(k+1)	MSE = SSE/dfE
Total	SST = Σ(Y_i − Ȳ)²	dfT = n−1	MST = SST/dfT

Here, k is the number of predictor variables in the model (not counting the intercept). When the SS are divided by their degrees of freedom (df), the result is a mean square, denoted MSM (model), MSE (error), and MST (total). The MSE is our estimate of the error variance σ², and the square root of σ² is called the root MSE or the standard error of prediction.

Assessing the Significance of a Linear Regression Model

We use the F-test from the ANOVA table to assess whether the predictors collectively have a statistically significant relationship with the outcome. The null hypothesis is H₀: β₁ = β₂ = … = β_k = 0.

In plain terms, the overall F-test asks a single yes-or-no question: taken as a set, do the predictors track the outcome better than simply predicting its overall mean for everyone? If the answer is no, the F ratio sits near 1; a large F with a small p-value says the predictors are doing real work.

Example 14.1: Birth Weight on Gestation Length

A simple linear regression model with birth weight (-bwt-) as the outcome and gestation length (-gest-) as the sole predictor was fit using the bw5k dataset (n = 5,000).

Results: F(1, 4998) = 1,790.09, P < 0.0001, R² = 0.2637. The coefficient for -gest- is 124.5 gm per week (95% CI: 118.7–130.3), meaning for each additional week of gestation, birth weight increases by approximately 124.5 gm.

Testing Individual Regression Coefficients

A t-test with n−(k+1) degrees of freedom is used to evaluate the significance of any individual regression coefficient. The usual null hypothesis is H₀: β_j = 0.

t-test for a regression coefficient (Eq 14.6)

\[ \color{#0B7B6B}{t} = \frac{\color{#C2410C}{\hat{\beta}_j} - \color{#6D28D9}{\beta^*}}{\color{#1D4ED8}{\text{SE}(\hat{\beta}_j)}} \]

The t-statistic is the estimated coefficient minus its null value (usually 0), divided by the coefficient’s standard error.

Predictions & intervals

Two sources of uncertainty stack when you use a fitted model to predict. The first is uncertainty about where the regression line itself sits, captured by the usual standard error. The second is the natural scatter of an individual observation around that line. A confidence interval for the mean of Y at a chosen value x* uses only the first source: Ŷ ± t_.05·SE. A prediction interval for a single new individual adds the second source, so it is always wider than the confidence interval. Both intervals widen as x* moves further from the mean of X₁, because the line is pinned down most tightly near the centre of the data.

R² and adjusted R²

R² (the coefficient of determination) describes the amount of variance in the outcome “explained” by the predictor variables. One formula: R² = SSM/SST = 1 − (SSE/SST). Unfortunately, R² always increases as variables are added to the model. The adjusted R² = 1 − (MSE/MST) adjusts for the number of predictors and is useful for comparing models with different numbers of variables.

Testing groups of predictor variables

Sometimes it is necessary to simultaneously evaluate the significance of a group of X-variables (e.g., a set of indicator variables for a nominal variable). We compare the SSE of the full model with the SSE of the reduced model (without the group) using a partial F-test. This tells us whether the set of variables as a group contributes significantly to the model.

Interpreting the F-statistic

The F-test has a straightforward interpretation only when the X-variables are manipulated treatments in a controlled experiment. In observational studies, the F-statistic is influenced by the number of variables available, their correlations, the total number of subjects, and the method used for variable selection. Most variable selection methods tend to maximise F, meaning the observed F overestimates the actual significance of the model.

🎲 Interactive: What Does a p-Value Actually Mean?

Run hundreds of simulated studies. Each study fits a regression of Y on X with a chosen true effect and sample size. Watch the distribution of p-values build up. With no real effect, p-values are uniform on [0,1]. With a real effect, p-values pile up near zero. Power = the proportion below α.

One simulated study (most recent)

A scatter of n points; black line = OLS fit; t-statistic and p-value displayed.

Distribution of p-values across studies

Histogram of all p-values run so far. Red region = p < α (significant).

True effect (β) 0.50

Sample size (n) 30

Residual SD (σ) 1.00

α (significance threshold) 0.05

Last p-value

–

% significant (p < α)

–

Studies run

Theoretical power

–

Try this: switch truth to "no effect", run 1,000 studies. The histogram is flat; that is what a uniform p-value distribution looks like. Now switch to "real effect" and run again: the histogram piles up at zero. Power is just how much it piles up at zero.

Knowledge check: Section 2

1. What does the F-test in the ANOVA table assess?

Whether individual regression coefficients differ from zero Whether the predictors collectively have a significant relationship with the outcome Whether the residuals are normally distributed

The F-test from the ANOVA table tests the null hypothesis that all regression coefficients (except the intercept) are simultaneously equal to zero. It assesses the overall significance of the model.

2. What does R² (the coefficient of determination) measure?

The proportion of variance in the outcome explained by the predictor variables The correlation between two predictor variables The probability that the model is correct

R² = SSM/SST = 1 − (SSE/SST). It represents the amount of variance in the outcome variable that is “explained” or “accounted for” by the predictor variables in the model.

3. Why is adjusted R² preferred over R² when comparing models with different numbers of predictors?

Adjusted R² is always larger than R² R² decreases when variables are added to the model R² always increases with more predictors, regardless of their usefulness; adjusted R² penalises for added variables

R² always increases as variables are added to a regression model. The adjusted R² = 1 − (MSE/MST) accounts for the number of variables and will tend to decline if the added variables contain little additional information about the outcome.

Reflection

Consider a regression model you have seen in published research or coursework. How would you interpret the R² value? What does a low R² mean practically, and does it necessarily indicate a poor model?

Model answerR² quantifies the proportion of variance in the outcome explained by the predictors in the model. A low R² (e.g., 0.05) means 5% of variance is explained, which does not necessarily mean the model is bad. In epidemiology, low R² is common because health outcomes have many causes; even a well-specified model of CVD might have R² = 0.15 because genetics, environment, and chance all contribute. What matters is (a) whether the model coefficients are estimating the causal quantity of interest with reasonable precision, and (b) whether the model fits the data (residual diagnostics). A high R² with biased coefficients is worse than a low R² with unbiased ones (Anscombe, 1973). For prediction-focused work R² matters more; for causal inference, it's secondary to identification.

Reflection saved!

* Complete the quiz and reflection to continue.

Section 3

Nature of X-Variables & Collinearity

⏱ Estimated time: 20 minutes

Section 3 of 4

Nature of X-Variables & Collinearity

Scaling, indicator coding, hierarchical indicators, and diagnosing multicollinearity with the variance inflation factor.

Continuous predictors

Scaling and centering

Subtracting a reference value from X makes the intercept interpretable without changing the slope or its standard error.

Centered gestation (example)

\[ \color{#0B7B6B}{\text{gest39}} = \color{#C2410C}{\text{gest}} - 39 \]

gest39 centred gestation gest gestational age (weeks)

Before centering

Intercept = −1,514 g (birth weight at 0 weeks). Uninterpretable.

After centering at 39 weeks

Intercept = 3,341 g (birth weight at average gestation). Meaningful.

Categorical predictors

Indicator (dummy) variables

A nominal variable with $j$ levels needs $j - 1$ indicator variables. The omitted level is the reference category.

Multicollinearity

Variance inflation factor

VIF formula (Eq 14.12)

\[ \color{#0B7B6B}{\text{VIF}_j} = \frac{1}{1 - \color{#C2410C}{R^2_{X_j}}} \]

VIFⱼ variance inflation, predictor j R² variance of Xj explained by other predictors

$R^2_{X_j}$ is the R-squared from regressing $X_j$ on all other predictors. VIF = 1: no collinearity. VIF > 10: serious inflation.

Before centering gest

Correlation between gest and gest-squared = 0.99. VIF = 131. SE of gest inflated 11-fold.

After centering

VIF drops to 1.54. SE returns near its original value. Interpretation restored.

Responding to collinearity

Three options, one principle

Drop one

Use the causal diagram to decide which correlated predictor is the more proximal cause.

Combine

Form a composite score or principal component from the correlated variables.

Keep with caution

If both are needed on substantive grounds, report the inflated standard errors and wide confidence intervals explicitly.

Collinearity inflates standard errors but does not bias coefficient estimates. Predictions remain accurate; individual coefficient interpretation does not.

Carry forward

What to take into the next section

Center continuous predictors to make the intercept meaningful and reduce collinearity in constructed terms.
A $j$-level nominal variable needs $j - 1$ indicators with an explicit reference category.
Check VIF for all predictors. Values above 5 deserve attention; above 10 signals serious collinearity.

Introduction and Overview

An earlier section evaluated the model's overall fit. This section turns to a practical question that often determines whether your model gives sensible answers: are your predictors entered correctly? Continuous, categorical, indicator, and polynomial predictors all need different handling, and highly correlated predictors (multicollinearity) can destabilize coefficient estimates without obvious warning signs.

Learning Objectives

Choose appropriate scaling for continuous predictors so that coefficients are interpretable.
Convert nominal and ordinal categorical predictors into indicator variables (dummy coding).
Recognise hierarchical indicator structures and code them correctly.
Detect collinearity using correlation matrices and the variance inflation factor (VIF).
Decide when collinear predictors should be dropped, combined, or kept with caution.

Types of Predictor Variables

The X-variables can be continuous or categorical. Categorical variables can be either nominal (levels with no meaningful numerical representation, e.g., race or city of residence) or ordinal (ordered levels, e.g., severity: low, medium, high). Nominal and ordinal variables with more than 2 levels must be converted to indicator variables before entering the regression.

Scaling Variables

Often the predictor variables have a limited range of possible or sensible values. For example, if gestation length is a predictor, the intercept reflects birth weight at 0 weeks, which is meaningless. It is useful to scale these variables by subtracting the lowest possible sensible value (or the average) before entering them into the model. This makes the intercept interpretable without changing the regression coefficient or its SE.

Example: Subtracting 39 weeks (the average gestation length) from -gest- gives gest39 = gest − 39. Now β₀ reflects birth weight for a 39-week gestation (3,341 gm), a much more meaningful value than the original constant of −1,514 gm.

Regular (Disjoint) Indicator Variables

Indicator variables (also called dummy variables) are created variables whose values have no direct physical relationship to the characteristic being described. For a nominal variable with j levels, we need j − 1 indicator variables. The omitted level becomes the referent (comparison) category.

Example: For mother’s race with 3 categories, we create 2 indicator variables (X₁ and X₂). Race 3 (with both indicators = 0) becomes the referent. β₁ estimates the difference in outcome between races 1 and 3, while β₂ estimates the difference between races 2 and 3.

Hierarchical (Incremental) Indicator Variables

If the predictor variables are ordinal in type (reflecting relative changes in an underlying characteristic), hierarchical indicator variables are often preferred. These contrast the outcome in each level against the level immediately preceding it (assuming all hierarchical variables are in the model).

Example: For mother’s education (4 levels), the disjoint indicators compare each level to the lowest (baseline). The hierarchical indicators instead show: the coefficient for level 4 reflects the difference between level 3 (some college) and level 4 (university degree), showing the incremental effect of each step up in education.

Variable	Indicator Coding	Hierarchical Coding
meduc_c4=2 (high school diploma)	20.046	20.046
meduc_c4=3 (some college)	53.270	33.224
meduc_c4=4 (university degree)	80.599	27.329

Detecting Highly Correlated (Collinear) Variables

If the predictor variables are too highly correlated, a number of problems arise. The estimated effect of each variable depends on the other predictors in the model. With highly correlated predictors, the βs will be highly and negatively correlated, and in extreme cases none of the individual coefficients will be significantly different from zero despite a significant overall F-test.

VIFClick to explore

CentringClick to explore

Measurement ErrorClick to explore

Variance inflation factor (Eq 14.12)

\[ \color{#0B7B6B}{\text{VIF}_j} = \frac{1}{1 - \color{#C2410C}{R^2_{X_j}}} \]

The variance inflation factor for a predictor grows as the share of its variance explained by the other predictors approaches 1, signalling multicollinearity.

Collinearity Example

When a quadratic term (-gest_sq-) was added to a model already containing -gest-, the correlation between the two was 0.99, giving a VIF of 131. The SE of -gest- increased over 11 times (from 2.94 to 32.99). Centring -gest- by subtracting 39 (the mean) reduced the VIF from 131 to just 1.54 and the SE back down to 3.58.

Knowledge check: this section

1. For a nominal variable with 4 categories, how many indicator (dummy) variables are needed?

4 3 2

For a nominal variable with j levels, we need j − 1 indicator variables. The omitted level becomes the referent (reference) category for comparison. With 4 categories, we need 3 indicator variables.

2. What does a VIF value greater than 10 suggest?

The variable is not a significant predictor The model has good predictive validity There is serious collinearity between predictor variables

A conservative guide for interpreting VIFs is that values above 10 indicate serious collinearity. While this does not necessarily mean the model is useless, it should always be taken as a warning about the interpretation of regression coefficients and the increase in their standard errors.

3. What is the primary purpose of centring a continuous variable before adding it to a regression model?

To reduce correlations between constructed variables (e.g., power terms or interaction terms) To improve the R² of the model To change the regression coefficient of the predictor

Centring reduces the correlation between a variable and its constructed derivatives (such as power terms or interaction terms). It does not change the predictions or the fit of the model, only the values and interpretation of the regression coefficients and the intercept.

Reflection

Why might highly correlated predictor variables cause problems in a multivariable regression model? What strategies would you use to detect and address collinearity in your own analyses?

Model answerHighly correlated predictors cause variance inflation: the OLS estimator distributes shared variance between the correlated predictors, producing large standard errors and unstable coefficients (a small change in data shifts a coefficient by a lot). The point estimates remain unbiased in expectation but are noisy (White, 1980; Long & Ervin, 2000). Detection: compute VIF (rule of thumb > 5 or > 10 is concerning), examine the correlation matrix among predictors, run condition indices on the design matrix. Strategies to address: (a) drop one of the correlated pair (justified by DAG: keep the one closer to the causal mechanism); (b) combine them into a single index (principal-component or composite score); (c) use shrinkage methods (ridge regression, LASSO) that handle collinearity by design (Belsley, Kuh, & Welsch, 1980); (d) increase sample size if feasible. None of these are a substitute for substantive thinking about whether both variables are needed.

Reflection saved!

* Complete the quiz and reflection to continue.

Section 4

Interaction & Causal Interpretation

⏱ Estimated time: 20 minutes

Section 4 of 4

Interaction & Causal Interpretation

Testing effect modification with a product term, and reading regression coefficients through a causal diagram.

Interaction terms

When one effect depends on another

Model with interaction (Eq 14.15)

\[ \color{#0B7B6B}{Y} = \beta_0 + \beta_1 \color{#1D4ED8}{X_1} + \beta_2 \color{#1D4ED8}{X_2} + \color{#C2410C}{\beta_3} (\color{#1D4ED8}{X_1} \times \color{#1D4ED8}{X_2}) + \varepsilon \]

Y outcome β₃ interaction coefficient X₁×X₂ product of predictors

If $\beta_3 \neq 0$, the slope of $X_1$ on $Y$ equals $\beta_1 + \beta_3 X_2$, changing with the value of $X_2$.

Example 14.9: weight gain × birth order, $\beta_3 = -88.4$, p = 0.010. High weight gain adds 227 g in first births but 139 g in later births.

Interactions in practice

Categorical interactions, and restraint

Categorical × categorical

Cross every indicator: a 3-level by 4-level interaction needs (3−1) × (4−1) = 6 product terms, tested together with a partial F-test.

Practical advice

Limit interactions to those with biological relevance. Investigate 3- and 4-way interactions only with good, biologically sound reasons.

Interpreting with interactions

Main effects at a specific value

When $\beta_3 \neq 0$, plot the relationship of $X_1$ and $Y$ at representative values of $X_2$ to communicate the interaction effect clearly.

Causal interpretation

The DAG decides what to adjust for

Confounders versus mediators

Two roles, two decisions

Confounder

Causes both exposure and outcome. Sits outside the causal path. Include it to reduce bias and estimate the total effect of the exposure.

Mediator (intervening variable)

Lies on the causal path from exposure to outcome. Exclude it to estimate the total effect. Include it only to estimate the direct effect.

The DAG, drawn before analysis, makes this classification explicit. Statistical tests cannot make this distinction for you.

Carry forward

What to take to the final assessment

A significant interaction term means the slope of one predictor changes across levels of another. Main effects must then be interpreted at specific values.
A DAG drawn before analysis decides which variables are confounders (adjust for them) and which are mediators (exclude for total effects).
The next lesson extends this to model-building strategies: deciding which variables should enter the model and in what order.

Introduction and Overview

Earlier sections set up a model with main effects only. This section takes two final design steps: testing whether the effect of one predictor depends on another (interaction) and giving the resulting coefficients a defensible causal interpretation. Both push linear regression beyond a curve-fitting exercise into a tool for answering causal questions, anchored in the DAG-based framework you met in an earlier lesson.

Learning Objectives

Specify and test interaction terms between two predictors.
Interpret a model with interactions correctly: main effects no longer have a single overall meaning.
Use a DAG to decide which covariates belong in the model for a causal question.
Distinguish confounders from mediators and explain why adjusting for a mediator can mislead.
Translate a fitted regression into a defensible causal claim, with explicit assumptions.

Detecting and Modelling Interaction

Given the component cause model, we might expect to see interaction when 2 factors act synergistically or antagonistically. In previous sections, models contained only main effects, assuming the association of X₁ to Y is the same at all levels of X₂. An interaction term tests whether the effect of one variable depends on the level of another.

Model with interaction term (Eq 14.15)

\[ \color{#0B7B6B}{Y} = \beta_0 + \beta_1 \color{#1D4ED8}{X_1} + \beta_2 \color{#1D4ED8}{X_2} + \color{#C2410C}{\beta_3} (\color{#1D4ED8}{X_1} \times \color{#1D4ED8}{X_2}) + \varepsilon \]

Adding the interaction coefficient on the product of two predictors lets the effect of one predictor depend on the level of the other.

We assess interaction by testing whether β₃ = 0. If the interaction is absent (i.e., β₃ is not significantly different from 0), the main effects (additive) model is deemed adequate. If the interaction is needed, centring becomes useful because it allows us to interpret β₁ and β₂ as linear effects when the centred version of the other variable is zero.

Interaction between two dichotomous variables

Example 14.9: The dichotomous versions of maternal weight gain (wtgain_c2: <30 lb vs ≥30 lb) and total birth order (tbo_c2: primiparous vs multiparous) were evaluated. The main effects model showed both factors were significant. Adding the interaction term (wg_c2*tbo_c2) revealed a significant interaction (β₃ = −88.4, P = 0.010).

This means the positive effect of multiparous birth on birth weight is present if weight gain is low, but is negligible if weight gain is high. Similarly, high weight gain has a bigger effect in primiparous births (227 gm) than in multiparous births (139 gm).

Interactions with categorical variables

Interactions involving categorical variables (with more than 2 levels) are modelled by including products between all indicator variables needed in the main effects model. For example, the interaction between a 3-level and a 4-level categorical variable requires (3−1) × (4−1) = 6 product variables. These 6 variables should be tested and explored as a group using the partial F-test.

Practical advice on interaction terms

In many multivariable analyses, the number of possibilities for interaction is large and there is no single correct way to assess if interaction is present. Unless the potential number of interactions is small, interactions should be limited to those of biological relevance. It is generally recommended that 3- and 4-way interactions only be investigated when there are good, biologically sound reasons for doing so.

Causal Interpretation of a Multivariable Linear Model

So far, we have focused on the technical interpretation of regression coefficients. When making causal inferences, extra care is needed to ensure that only the appropriate variables are included in the analysis. A causal diagram is very helpful in this regard.

Key Causal Principle

If a variable is an intervening variable (on the causal pathway between exposure and outcome), including it in the model will change the interpretation (Greenland, Pearl, & Robins, 1999). For example, if gestation length is an intervening variable between cigarette smoking and birth weight, including -gest- in the model adjusts away part of the causal effect of smoking. The total effect of smoking would be obtained from a model without -gest-, while the direct effect (not mediated through gestation) would require including it.

Example: Causal Model for Smoking & Birth Weight

Our objective is to evaluate the effects of cigarette smoking (-cig-) on birth weight (-bwt-). The causal diagram indicates that gestation length (-gest-) is an intervening variable between -cig- and -bwt-. Consequently, -gest- and -wtgain- should be excluded from the model when estimating the total causal effect of smoking on birth weight.

The model includes: -white- (potential confounder), -college- (potential confounder), and -cig_2- (the exposure of interest). The interaction between -cig_2- and -white- was assessed.

Total EffectClick to explore

Direct EffectClick to explore

Confounders vs InterveningClick to explore

Knowledge check: this section

1. What does a significant interaction term (β₃) in a regression model indicate?

The two predictor variables are highly collinear The main effects model fits the data better The effect of one predictor on the outcome depends on the level of the other predictor

A significant interaction term indicates that the association between one predictor and the outcome varies across levels of the other predictor. The main effects (additive) model is no longer sufficient to describe the relationship.

2. When estimating the total causal effect of an exposure, what should you do with intervening variables?

Include them in the model to control for their effects Exclude them from the model Use them as the outcome variable instead

Including intervening variables in the model adjusts away part of the causal effect that operates through the mediator. To estimate the total effect, intervening variables should be excluded; including them gives the direct effect only.

3. What tool is recommended before building a multivariable model to help distinguish confounders from intervening variables?

A causal diagram A correlation matrix A residual plot

A causal diagram (directed acyclic graph) is very helpful for identifying which variables are confounders (should be included), which are intervening variables (may need to be excluded for total effects), and which are effects of the outcome (should not be included).

Reflection

Consider an exposure–outcome relationship you are interested in. Draw (or describe) a causal diagram identifying potential confounders and intervening variables. How would the choice of which variables to include affect your estimate of the causal effect?

Model answerFor an exposure (say, dietary fibre intake) and outcome (incident type-2 diabetes), DAG: fibre → diabetes, with confounders age, sex, SES, BMI at baseline, physical activity, smoking, family history (all pointing into both fibre intake and diabetes). Intervening variables: HbA1c, insulin sensitivity, weight change during follow-up, on the causal pathway. Effect of variable inclusion: (a) adjusting for confounders gives an unbiased causal effect of fibre on diabetes (the total effect); (b) adjusting for an intervening variable (e.g., HbA1c) gives the direct effect bypassing that mediator, but blocks the indirect effect, attenuating the total causal effect estimate; (c) adjusting for a collider (e.g., a hospitalisation event affected by both fibre and diabetes status) introduces selection bias. The DAG forces these distinctions explicit; routine "control for everything" defaults blur them and produce biased estimates.

Reflection saved!

* Complete the quiz and reflection to continue.

HSCI 410 · Lesson 3

Exploratory Data Analysis For Epidemiology

Linear Regression

Learning objectives for this lesson:

Glossary: Key Terms, People & Concepts

Introduction & Regression Analysis

Linear Regression

From categorical to continuous outcomes

Formulas grounded in a real dataset

Introduction & Regression Analysis

Three components, one equation

\(\beta_0\)

\(\beta_1\)

\(\varepsilon\)

Minimizing the sum of squared residuals

Each coefficient is an adjusted estimate

Terminology and model parsimony

Terminology

Parsimony

What to take into the next section

Introduction and Overview

Learning Objectives

Why Linear Regression?

Key Concept

The Simple Regression Model

✏ Interactive: OLS Line-of-Best-Fit Sandbox

The Multivariable Model

Terminology Note

R Reflect on what you just ran

Reflection

Hypothesis Testing & Effect Estimation

Hypothesis Testing & Effect Estimation

Partitioning total variation in Y

SSM

SSE

SST

Overall model significance

The t-test for each slope

R-squared and adjusted R-squared

R² always rises

Adjusted R² can fall

Testing done; now predictor form

Introduction and Overview

Learning Objectives

The ANOVA Table

Assessing the Significance of a Linear Regression Model

Testing Individual Regression Coefficients

🎲 Interactive: What Does a p-Value Actually Mean?

One simulated study (most recent)

Distribution of p-values across studies

Reflection

Nature of X-Variables & Collinearity

Nature of X-Variables & Collinearity

Scaling and centering

Before centering

After centering at 39 weeks

Indicator (dummy) variables

Variance inflation factor

Before centering gest

After centering

Three options, one principle

Drop one

Combine

Keep with caution

What to take into the next section

Introduction and Overview

Learning Objectives

Types of Predictor Variables

Scaling Variables

Regular (Disjoint) Indicator Variables

Hierarchical (Incremental) Indicator Variables

Detecting Highly Correlated (Collinear) Variables

Collinearity Example

Reflection

Interaction & Causal Interpretation

Interaction & Causal Interpretation

When one effect depends on another

Categorical interactions, and restraint

Categorical × categorical

Practical advice