Linear Regression

Exploratory Data Analysis For Epidemiology

Kiffer G. Card, PhD, Faculty of Health Sciences, Simon Fraser University

Learning objectives for this lesson:

Identify when least squares regression is an appropriate analytical tool
Construct a linear model with control of confounding and identification of interaction
Interpret regression coefficients from both technical and causal perspectives
Convert nominal, ordinal, or continuous predictors into indicator variables
Assess model assumptions including linearity, homoscedasticity, and normality of residuals
Detect and address collinearity among predictor variables
Identify study designs that require a time-series approach to analysis

This course was developed by Kiffer G. Card, PhD, as a companion to Dohoo, I. R., Martin, S. W., & Stryhn, H. (2012). Methods in Epidemiologic Research. VER Inc.

Section 1

Introduction & Regression Analysis

⏱ Estimated time: 15 minutes

Why Linear Regression?

Up to this point, most examples of relating an outcome to an exposure have been based on qualitative outcome variables—that is, variables that are categorical or dichotomous. Linear regression is suitable for modelling the outcome when it is measured on a continuous or near-continuous scale. Examples include birth weight, blood pressure, body mass index, and disease frequency at a regional level.

Key Concept

In regression analysis, the relationship between the outcome and the predictors is asymmetric—we think the value of the outcome is caused by (or we wish to predict it by) the value of another variable (the predictor). Using X-variables to predict Y does not necessarily imply causation; we might just be estimating predictive associations.

The Simple Regression Model

When only one predictor variable is used, the model is called a simple regression model. The term “model” denotes the formal statistical formula that describes the relationship between the predictor and the outcome.

Simple Linear Regression (Eq 14.1)

Y = β₀ + β₁X₁ + ε

In this equation, β₀ is the intercept (or constant), β₁ is the regression coefficient, and ε is the error term. The errors are assumed to be normally and independently distributed (ε ~ N(0, σ²)). We estimate these errors by residuals—the difference between the observed value and the value predicted by the model.

📍

Intercept (β₀)

Click to learn more

📈

Regression Coefficient (β₁)

Click to learn more

🔍

Error Term (ε)

Click to learn more

The Multivariable Model

Almost without exception, the regression models used by epidemiologists will contain more than one predictor variable. These are known as multiple regression or multivariable models.

Terminology Note

Multivariate indicates 2 or more outcome variables; multivariable denotes more than 1 predictor. In epidemiology, we almost always mean multivariable models.

Multivariable Linear Regression (Eq 14.3)

Y = β₀ + β₁X₁ + β₂X₂ + ε

A major difference from simple regression is that in the multivariable model, β₁ is an estimate of the effect of X₁ on Y after controlling for the effects of X₂. This is the key advantage of multivariable analysis—it accounts for confounding by extraneous variables.

Why use multivariable models?

In observational studies, incorporating more than one predictor almost always leads to a more complete understanding of how the outcome varies, and it decreases the chance that the regression coefficients for exposures of interest are biased by confounding variables. The βs are not biased by any variable included in the equation, but they can be biased if confounding variables are omitted from the equation.

Confounders vs intervening variables

Assuming we have not included intervening variables or effects of the outcome in our model, the βs are not confounded by any variable in the regression equation. However, from a causal perspective, if intervening variables are included, the coefficients do not estimate the causal effect. One can never be sure that there are no important unmeasured confounders that were omitted from the model.

Trade-offs in model building

A major trade-off in model-building is to avoid omitting necessary confounding variables while not including variables of little importance. Including too many unimportant variables increases the number of βs estimated and may lead to poor performance of the equation on future datasets. Also, having to measure unnecessary variables increases the cost of future work.

Section 1 Knowledge Check

1. What type of outcome variable is linear regression most suitable for?

Dichotomous outcomes Continuous or near-continuous outcomes Nominal categorical outcomes

Linear regression is suitable for modelling the outcome when it is measured on a continuous or near-continuous scale, such as birth weight, blood pressure, or body mass index.

2. In the equation Y = β₀ + β₁X₁ + ε, what does β₁ represent?

The predicted value of Y when X₁ = 0 The random error in the model The change in Y for each one-unit increase in X₁

β₁ is the regression coefficient that describes how the mean value of Y changes for each one-unit increase in X₁. The intercept (β₀) is the value of Y when X₁ = 0.

3. What is the key advantage of a multivariable regression model over a simple regression model?

It can control for confounding by extraneous variables It always produces larger R² values It eliminates the need for the error term

The key advantage is that each β coefficient estimates the effect of its predictor after controlling for all other variables in the model, thereby reducing bias from confounding variables.

Reflection

Think of a continuous outcome variable in your field of interest. What predictors would you include in a regression model? How would you decide which variables are confounders versus intervening variables?

Reflection saved!

* Complete the quiz and reflection to continue.

Section 2

Hypothesis Testing & Effect Estimation

⏱ Estimated time: 20 minutes

The ANOVA Table

The idea behind regression is that information in the X-variables can be used to predict the value of Y. The formal way this is approached is to ascertain how much of the sums of squares (SS) of Y we can explain with knowledge of the X-variable(s).

Source	Sums of Squares	df	Mean Square	F-test
Model (regression)	SSM = Σ(Ŷ_i − ȲY)²	dfM = k	MSM = SSM/dfM	MSM/MSE
Error (residual)	SSE = Σ(Y_i − Ŷ_i)²	dfE = n−(k+1)	MSE = SSE/dfE
Total	SST = Σ(Y_i − ȲY)²	dfT = n−1	MST = SST/dfT

Here, k is the number of predictor variables in the model (not counting the intercept). When the SS are divided by their degrees of freedom (df), the result is a mean square—denoted MSM (model), MSE (error), and MST (total). The MSE is our estimate of the error variance σ², and the square root of σ² is called the root MSE or the standard error of prediction.

Assessing the Significance of a Linear Regression Model

We use the F-test from the ANOVA table to assess whether the predictors collectively have a statistically significant relationship with the outcome. The null hypothesis is H₀: β₁ = β₂ = … = β_k = 0.

Example 14.1: Birth Weight on Gestation Length

A simple linear regression model with birth weight (-bwt-) as the outcome and gestation length (-gest-) as the sole predictor was fit using the bw5k dataset (n = 5,000).

Results: F(1, 4998) = 1,790.09, P < 0.0001, R² = 0.2637. The coefficient for -gest- is 124.5 gm per week (95% CI: 118.7–130.3), meaning for each additional week of gestation, birth weight increases by approximately 124.5 gm.

Testing Individual Regression Coefficients

A t-test with n−(k+1) degrees of freedom is used to evaluate the significance of any individual regression coefficient. The usual null hypothesis is H₀: β_j = 0.

t-test for a Regression Coefficient (Eq 14.6)

t = (β_j − β*) / SE(β_j)

Predictions & intervals

There are 2 types of variation in play: (1) from estimation of the regression parameters (the usual SE), and (2) from a new observation about the regression line. The prediction interval for a new observation involves both sources. The further the value x* is from the mean of X₁, the greater the variability in the prediction. The 95% confidence interval is calculated as: 95% CI = Y ± t_.05(SE).

R² and adjusted R²

R² (the coefficient of determination) describes the amount of variance in the outcome “explained” by the predictor variables. One formula: R² = SSM/SST = 1 − (SSE/SST). Unfortunately, R² always increases as variables are added to the model. The adjusted R² = 1 − (MSE/MST) adjusts for the number of predictors and is useful for comparing models with different numbers of variables.

Testing groups of predictor variables

Sometimes it is necessary to simultaneously evaluate the significance of a group of X-variables (e.g., a set of indicator variables for a nominal variable). We compare the SSE of the full model with the SSE of the reduced model (without the group) using a partial F-test. This tells us whether the set of variables as a group contributes significantly to the model.

Interpreting the F-statistic

The F-test has a straightforward interpretation only when the X-variables are manipulated treatments in a controlled experiment. In observational studies, the F-statistic is influenced by the number of variables available, their correlations, the total number of subjects, and the method used for variable selection. Most variable selection methods tend to maximise F, meaning the observed F overestimates the actual significance of the model.

Section 2 Knowledge Check

1. What does the F-test in the ANOVA table assess?

Whether individual regression coefficients differ from zero Whether the predictors collectively have a significant relationship with the outcome Whether the residuals are normally distributed

The F-test from the ANOVA table tests the null hypothesis that all regression coefficients (except the intercept) are simultaneously equal to zero. It assesses the overall significance of the model.

2. What does R² (the coefficient of determination) measure?

The proportion of variance in the outcome explained by the predictor variables The correlation between two predictor variables The probability that the model is correct

R² = SSM/SST = 1 − (SSE/SST). It represents the amount of variance in the outcome variable that is “explained” or “accounted for” by the predictor variables in the model.

3. Why is adjusted R² preferred over R² when comparing models with different numbers of predictors?

Adjusted R² is always larger than R² R² decreases when variables are added to the model R² always increases with more predictors, regardless of their usefulness; adjusted R² penalises for added variables

R² always increases as variables are added to a regression model. The adjusted R² = 1 − (MSE/MST) accounts for the number of variables and will tend to decline if the added variables contain little additional information about the outcome.

Reflection

Consider a regression model you have seen in published research or coursework. How would you interpret the R² value? What does a low R² mean practically, and does it necessarily indicate a poor model?

Reflection saved!

* Complete the quiz and reflection to continue.

Section 3

Nature of X-Variables & Collinearity

⏱ Estimated time: 20 minutes

Types of Predictor Variables

The X-variables can be continuous or categorical. Categorical variables can be either nominal (levels with no meaningful numerical representation, e.g., race or city of residence) or ordinal (ordered levels, e.g., severity: low, medium, high). Nominal and ordinal variables with more than 2 levels must be converted to indicator variables before entering the regression.

Scaling Variables

Often the predictor variables have a limited range of possible or sensible values. For example, if gestation length is a predictor, the intercept reflects birth weight at 0 weeks—which is meaningless. It is useful to scale these variables by subtracting the lowest possible sensible value (or the average) before entering them into the model. This makes the intercept interpretable without changing the regression coefficient or its SE.

Example: Subtracting 39 weeks (the average gestation length) from -gest- gives gest39 = gest − 39. Now β₀ reflects birth weight for a 39-week gestation (3,341 gm), a much more meaningful value than the original constant of −1,514 gm.

Regular (Disjoint) Indicator Variables

Indicator variables (also called dummy variables) are created variables whose values have no direct physical relationship to the characteristic being described. For a nominal variable with j levels, we need j − 1 indicator variables. The omitted level becomes the referent (comparison) category.

Example: For mother’s race with 3 categories, we create 2 indicator variables (X₁ and X₂). Race 3 (with both indicators = 0) becomes the referent. β₁ estimates the difference in outcome between races 1 and 3, while β₂ estimates the difference between races 2 and 3.

Hierarchical (Incremental) Indicator Variables

If the predictor variables are ordinal in type (reflecting relative changes in an underlying characteristic), hierarchical indicator variables are often preferred. These contrast the outcome in each level against the level immediately preceding it (assuming all hierarchical variables are in the model).

Example: For mother’s education (4 levels), the disjoint indicators compare each level to the lowest (baseline). The hierarchical indicators instead show: the coefficient for level 4 reflects the difference between level 3 (some college) and level 4 (university degree), showing the incremental effect of each step up in education.

Variable	Indicator Coding	Hierarchical Coding
meduc_c4=2 (high school diploma)	20.046	20.046
meduc_c4=3 (some college)	53.270	33.224
meduc_c4=4 (university degree)	80.599	27.329

Detecting Highly Correlated (Collinear) Variables

If the predictor variables are too highly correlated, a number of problems arise. The estimated effect of each variable depends on the other predictors in the model. With highly correlated predictors, the βs will be highly and negatively correlated, and in extreme cases none of the individual coefficients will be significantly different from zero despite a significant overall F-test.

📊

VIF

Click to learn more

🎯

Centring

Click to learn more

⚠

Measurement Error

Click to learn more

Variance Inflation Factor (Eq 14.12)

VIF = 1 / (1 − R²_x)

Collinearity Example

When a quadratic term (-gest_sq-) was added to a model already containing -gest-, the correlation between the two was 0.99, giving a VIF of 131. The SE of -gest- increased over 11 times (from 2.94 to 32.99). Centring -gest- by subtracting 39 (the mean) reduced the VIF from 131 to just 1.54 and the SE back down to 3.58.

Section 3 Knowledge Check

1. For a nominal variable with 4 categories, how many indicator (dummy) variables are needed?

4 3 2

For a nominal variable with j levels, we need j − 1 indicator variables. The omitted level becomes the referent (reference) category for comparison. With 4 categories, we need 3 indicator variables.

2. What does a VIF value greater than 10 suggest?

The variable is not a significant predictor The model has good predictive validity There is serious collinearity between predictor variables

A conservative guide for interpreting VIFs is that values above 10 indicate serious collinearity. While this does not necessarily mean the model is useless, it should always be taken as a warning about the interpretation of regression coefficients and the increase in their standard errors.

3. What is the primary purpose of centring a continuous variable before adding it to a regression model?

To reduce correlations between constructed variables (e.g., power terms or interaction terms) To improve the R² of the model To change the regression coefficient of the predictor

Centring reduces the correlation between a variable and its constructed derivatives (such as power terms or interaction terms). It does not change the predictions or the fit of the model—only the values and interpretation of the regression coefficients and the intercept.

Reflection

Why might highly correlated predictor variables cause problems in a multivariable regression model? What strategies would you use to detect and address collinearity in your own analyses?

Reflection saved!

* Complete the quiz and reflection to continue.

Section 4

Interaction & Causal Interpretation

⏱ Estimated time: 20 minutes

Detecting and Modelling Interaction

Given the component cause model, we might expect to see interaction when 2 factors act synergistically or antagonistically. In previous sections, models contained only main effects—assuming the association of X₁ to Y is the same at all levels of X₂. An interaction term tests whether the effect of one variable depends on the level of another.

Model with Interaction Term (Eq 14.15)

Y = β₀ + β₁X₁ + β₂X₂ + β₃X₁*X₂ + ε

We assess interaction by testing whether β₃ = 0. If the interaction is absent (i.e., β₃ is not significantly different from 0), the main effects (additive) model is deemed adequate. If the interaction is needed, centring becomes useful because it allows us to interpret β₁ and β₂ as linear effects when the centred version of the other variable is zero.

Interaction between two dichotomous variables

Example 14.9: The dichotomous versions of maternal weight gain (wtgain_c2: <30 lb vs ≥30 lb) and total birth order (tbo_c2: primiparous vs multiparous) were evaluated. The main effects model showed both factors were significant. Adding the interaction term (wg_c2*tbo_c2) revealed a significant interaction (β₃ = −88.4, P = 0.010).

This means the positive effect of multiparous birth on birth weight is present if weight gain is low, but is negligible if weight gain is high. Similarly, high weight gain has a bigger effect in primiparous births (227 gm) than in multiparous births (139 gm).

Interactions with categorical variables

Interactions involving categorical variables (with more than 2 levels) are modelled by including products between all indicator variables needed in the main effects model. For example, the interaction between a 3-level and a 4-level categorical variable requires (3−1) × (4−1) = 6 product variables. These 6 variables should be tested and explored as a group using the partial F-test.

Practical advice on interaction terms

In many multivariable analyses, the number of possibilities for interaction is large and there is no single correct way to assess if interaction is present. Unless the potential number of interactions is small, interactions should be limited to those of biological relevance. It is generally recommended that 3- and 4-way interactions only be investigated when there are good, biologically sound reasons for doing so.

Causal Interpretation of a Multivariable Linear Model

So far, we have focused on the technical interpretation of regression coefficients. When making causal inferences, extra care is needed to ensure that only the appropriate variables are included in the analysis. A causal diagram is very helpful in this regard.

Key Causal Principle

If a variable is an intervening variable (on the causal pathway between exposure and outcome), including it in the model will change the interpretation. For example, if gestation length is an intervening variable between cigarette smoking and birth weight, including -gest- in the model adjusts away part of the causal effect of smoking. The total effect of smoking would be obtained from a model without -gest-, while the direct effect (not mediated through gestation) would require including it.

Example: Causal Model for Smoking & Birth Weight

Our objective is to evaluate the effects of cigarette smoking (-cig-) on birth weight (-bwt-). The causal diagram indicates that gestation length (-gest-) is an intervening variable between -cig- and -bwt-. Consequently, -gest- and -wtgain- should be excluded from the model when estimating the total causal effect of smoking on birth weight.

The model includes: -white- (potential confounder), -college- (potential confounder), and -cig_2- (the exposure of interest). The interaction between -cig_2- and -white- was assessed.

🎯

Total Effect

Click to learn more

➡

Direct Effect

Click to learn more

🔬

Confounders vs Intervening

Click to learn more

Section 4 Knowledge Check

1. What does a significant interaction term (β₃) in a regression model indicate?

The two predictor variables are highly collinear The main effects model fits the data better The effect of one predictor on the outcome depends on the level of the other predictor

A significant interaction term indicates that the association between one predictor and the outcome varies across levels of the other predictor. The main effects (additive) model is no longer sufficient to describe the relationship.

2. When estimating the total causal effect of an exposure, what should you do with intervening variables?

Include them in the model to control for their effects Exclude them from the model Use them as the outcome variable instead

Including intervening variables in the model adjusts away part of the causal effect that operates through the mediator. To estimate the total effect, intervening variables should be excluded; including them gives the direct effect only.

3. What tool is recommended before building a multivariable model to help distinguish confounders from intervening variables?

A causal diagram A correlation matrix A residual plot

A causal diagram (directed acyclic graph) is very helpful for identifying which variables are confounders (should be included), which are intervening variables (may need to be excluded for total effects), and which are effects of the outcome (should not be included).

Reflection

Consider an exposure–outcome relationship you are interested in. Draw (or describe) a causal diagram identifying potential confounders and intervening variables. How would the choice of which variables to include affect your estimate of the causal effect?

Reflection saved!

* Complete the quiz and reflection to continue.

HSCI 410 — Lesson 2

Exploratory Data Analysis For Epidemiology

Linear Regression

Learning objectives for this lesson:

Introduction & Regression Analysis

Why Linear Regression?

The Simple Regression Model

The Multivariable Model

Section 1 Knowledge Check

Reflection

Hypothesis Testing & Effect Estimation

The ANOVA Table

Assessing the Significance of a Linear Regression Model

Testing Individual Regression Coefficients

Section 2 Knowledge Check

Reflection

Nature of X-Variables & Collinearity

Types of Predictor Variables

Scaling Variables

Regular (Disjoint) Indicator Variables

Hierarchical (Incremental) Indicator Variables

Detecting Highly Correlated (Collinear) Variables

Section 3 Knowledge Check

Reflection

Interaction & Causal Interpretation

Detecting and Modelling Interaction

Causal Interpretation of a Multivariable Linear Model

Section 4 Knowledge Check

Reflection

Lesson 2 — Comprehensive Assessment

Final Reflection

Final Assessment (15 Questions)

Lesson 2 Complete!