# Lesson 3 — Linear Regression (v3 expanded)

*Companion-podcast transcript • Sarah & Kiffer* 
*~5,098 words • ~27.6 min audio*

---

**Sarah:** Welcome back to Office Hours. I'm Sarah.

**Kiffer:** And I'm Kiffer. Today we're working through Lesson 3, Linear Regression. This is the first regression model in the curriculum, and honestly it's the workhorse of quantitative epidemiology. If you understand linear regression deeply, every other model you meet in this course is a variation on the same idea.

**Sarah:** Let me set the stakes for that. Why do we need regression at all? In the earlier material, you saw measures of association like risk ratios, odds ratios, mean differences. Those are great when you have a single exposure and a single outcome. But the real world is messier. You almost always have multiple variables that move together. Age. Sex. Body mass index. Smoking. Income. They're all entangled. Regression is the tool we use to pull the threads apart.

**Kiffer:** Right. And linear regression specifically is for continuous outcomes. By continuous I mean a variable that can take a wide range of numeric values on a more or less smooth scale. Birth weight in grams. Systolic blood pressure in millimeters of mercury. Body mass index. Cholesterol. Test scores. Anything where the outcome is a number, not a yes-or-no, and where the gaps between values are meaningful.

**Sarah:** And we'll talk later in the course about what to do when the outcome is binary, or a count, or a time-to-event. Those need different models, generalized linear models, logistic regression, Poisson, Cox. But linear regression is the parent model. The conceptual scaffolding for everything else.

**Kiffer:** Three sections today. First, simple linear regression with a single predictor. We'll spend real time on this because every piece of vocabulary we develop here transfers directly to the more complicated models. Second, multiple linear regression, with more than one predictor. That's where regression becomes a tool for adjustment, for controlling confounding. And third, diagnostics. The checks that tell you whether your model is actually trustworthy.

**Sarah:** Let's start with section one. Simple linear regression.

**Kiffer:** The basic model. I'll say it in symbols and then in plain words. The symbolic version is, Y equals beta zero plus beta one times X plus epsilon.

**Sarah:** And in plain words, that's saying the outcome equals an intercept plus a slope times the predictor plus an error term. Five pieces. Let's walk through each one because every piece of regression notation in this course descends from this equation.

**Kiffer:** Yeah. Let's anchor it with an example we'll carry through the whole episode. Birth weight as a function of gestational age. Birth weight in grams is our outcome, the Y. Gestational age in weeks, how long the pregnancy lasted before delivery, is our predictor, the X. Babies who stay in the womb longer tend to weigh more at birth. So we expect a positive relationship.

**Sarah:** And this is a real example from the lesson. The textbook fits exactly this model on a dataset of five thousand births. So we're using a setup that real epidemiologists have actually run.

**Kiffer:** Okay. Piece one. Y. The outcome. In our example, that's birth weight in grams for a particular baby. Each baby in the dataset has their own value of Y.

**Sarah:** Piece two. X. The predictor, sometimes called the explanatory variable or the exposure. In our example, that's gestational age in weeks. Each baby also has a value of X.

**Kiffer:** Piece three. Beta zero. Beta is a Greek letter. We pronounce it bay-tah. Beta zero is the intercept. The expected value of Y when X equals zero. In words, the expected birth weight of a baby with a gestational age of zero weeks.

**Sarah:** Which, of course, is meaningless. There is no such baby. Zero weeks of gestation isn't a person. So even though the intercept is mathematically required for the line to exist, the literal interpretation can be nonsensical. We'll come back to this when we talk about centering predictors.

**Kiffer:** Piece four. Beta one. The slope. Also called the regression coefficient. This is the expected change in Y for a one-unit increase in X. In our example, the expected change in birth weight in grams for each additional week of gestation.

**Sarah:** And this is the piece you usually care about. Beta one is the answer to the question, how does the outcome change as the predictor changes? The intercept anchors the line in space. The slope tells you the shape of the relationship. The slope is the science.

**Kiffer:** And piece five. Epsilon. The Greek letter for E. Pronounced ep-si-lon. The error term. Epsilon represents everything that affects birth weight that isn't gestational age. Maternal nutrition, genetics, smoking, the baby's sex, hundreds of other factors we haven't measured. All of that is bundled into epsilon.

**Sarah:** And we make an assumption about epsilon. We assume the errors are random, with a mean of zero, and approximately Normally distributed with some constant variance. The Greek letter sigma, written sigma squared, denotes that error variance. Don't worry too much about the formal assumption right now. The point is that epsilon is the wiggle. The piece of the outcome that the model can't predict.

**Kiffer:** So putting it all together. Birth weight equals intercept plus slope times gestational age plus error. The model says, on average, a baby's weight at birth is some baseline number plus a fixed amount per week of gestation, with random variation on top.

**Sarah:** Now how do we actually get values for beta zero and beta one? We have a dataset of five thousand babies. We know each baby's birth weight and gestational age. How do we draw the best line through that cloud of points?

**Kiffer:** This is where ordinary least squares comes in. Ordinary least squares is the estimation method. The standard abbreviation is OLS, but I want to keep saying the full phrase because the words tell you what the method actually does.

**Sarah:** Ordinary least squares finds the values of beta zero and beta one that minimize the sum of squared residuals.

**Kiffer:** Let's unpack that phrase. A residual is the difference between an observed value of Y and the value the line predicts for that same X. So for a baby who weighed three thousand four hundred grams at thirty nine weeks, if our line predicts three thousand three hundred fifty grams at thirty nine weeks, the residual is fifty grams. The line under-predicted by fifty.

**Sarah:** Now do that for every baby in the dataset. Five thousand residuals. Some positive, where the line under-predicted. Some negative, where it over-predicted. We square each one, which makes them all positive and penalizes big misses more than small ones. Then we add up all five thousand squared residuals. That sum is what we want to make as small as possible.

**Kiffer:** And least squares is the method that finds the unique line that minimizes that sum. There's calculus behind it. Software does the work. But the intuition is simple. We're picking the line that comes closest, in a squared-distance sense, to all the data points at once.

**Sarah:** Once you have your estimates, the next question is how confident you are in them. This is where statistical inference enters. Three pieces of inference go with every regression coefficient. The standard error. The confidence interval. And the t-test.

**Kiffer:** The standard error of beta one tells you the precision of the slope estimate. It's basically the typical wiggle you'd expect in beta one if you ran the same study on a different sample drawn from the same population. Small standard error means you can pin the slope down precisely. Big standard error means the data don't tightly constrain it.

**Sarah:** And from the standard error you can build a 95 percent confidence interval. The estimate plus or minus one point nine six times the standard error. That gives you a range of plausible values for the true slope. If the textbook example gives a coefficient of one hundred twenty four point five grams per week with a confidence interval running from one hundred eighteen point seven to one hundred thirty point three, that's saying we're 95 percent confident the true effect of an additional week of gestation is somewhere in that interval.

**Kiffer:** And the third inferential piece is the t-test of the slope against zero. The null hypothesis is that beta one equals zero, meaning gestational age has no effect on birth weight. The alternative is that beta one is not zero. The t-statistic is the estimate divided by its standard error. If that ratio is large in absolute value, the p-value is small, and we reject the null.

**Sarah:** In the birth weight example, the t-statistic is enormous and the p-value is well below zero point zero zero zero one. The slope of one hundred twenty four point five grams per week is wildly statistically distinguishable from zero. Which, frankly, isn't surprising. Of course gestational age affects birth weight. But the formal test is still important because in many studies the answer isn't obvious.

**Kiffer:** Now let's talk about R squared. R-squared is a single number that summarizes how much of the variation in the outcome the model accounts for. It's pronounced R-squared, sometimes called the coefficient of determination.

**Sarah:** R-squared ranges from zero to one. An R-squared of zero means the model explains none of the variance in Y. An R-squared of one means it explains all of it, which essentially never happens with real data. An R-squared of zero point two six, like in the birth weight example, means about 26 percent of the variation in birth weight is accounted for by gestational age. The remaining 74 percent is residual. Other factors. Random noise. Things we haven't modeled.

**Kiffer:** And here's where I want to be really careful. R-squared is not a measure of model quality on its own. Students sometimes get fixated on R-squared like it's a grade. It isn't.

**Sarah:** Why not?

**Kiffer:** Two reasons. First, you can have a low R-squared and still have a real, important effect. Suppose you're studying whether a public health intervention reduces blood pressure. Your R-squared might only be zero point zero five because blood pressure varies for hundreds of reasons. But the slope on the intervention might be exactly the size you wanted, and statistically significant, and policy-relevant. Low R-squared, real finding.

**Sarah:** And second, you can have a high R-squared on a model that's substantively wrong. If you put predictors in your model that don't belong, like mediators or proxies for the outcome itself, you can drive R-squared up while making the causal interpretation worse. R-squared rewards explanatory power, but it doesn't ask whether the explanation is the right one.

**Kiffer:** So treat R-squared as a piece of context, not a verdict. The slope, the confidence interval, the diagnostics, and the causal logic all matter more for the question of whether your model is doing useful work.

**Sarah:** Okay. That's section one. Simple linear regression. Y equals beta zero plus beta one X plus epsilon. The slope is the change in Y per unit change in X. We estimate it by ordinary least squares, minimizing the sum of squared residuals. The standard error gives you precision. The confidence interval gives you a range of plausible values. The t-test gives you a p-value against the null of zero. R-squared tells you how much variance is explained, but isn't a verdict on model quality.

**Kiffer:** Okay, that's all of section one. Let's move on to section two, multiple linear regression.

**Sarah:** Multiple linear regression is the same logic with more than one predictor. The model becomes Y equals beta zero plus beta one X1 plus beta two X2 plus and so on, plus epsilon. As many predictors as you want. Each one gets its own slope coefficient.

**Kiffer:** Quick terminology note that comes up in the lesson. Multiple, with an L-E ending, refers to the number of predictors. Multivariable, the term epidemiologists prefer, also refers to multiple predictors. Multivariate, with an A-T-E ending, technically refers to multiple outcomes. Most of what we do in this course is multivariable, even though people use the words loosely.

**Sarah:** And the interpretation of each beta is where things get really interesting. In multiple regression, beta one is the expected change in Y for a one-unit increase in X1, holding all other predictors constant.

**Kiffer:** That phrase, holding all other predictors constant, is doing enormous work. Let me unpack it.

**Sarah:** Yeah. Suppose we extend the birth weight example. Now we model birth weight as a function of gestational age, maternal age, smoking status, and prenatal care. Four predictors. The slope on gestational age, beta one, now means the expected change in birth weight per additional week of gestation, holding maternal age, smoking, and prenatal care constant.

**Kiffer:** In other words, we're asking, if we compare two babies who are identical on smoking, maternal age, and prenatal care, but differ in gestational age by one week, how different is their expected birth weight? That's beta one in the multiple regression model. Same idea for each other coefficient. Each beta gives you the partial effect of its predictor at fixed values of the others.

**Sarah:** And this is exactly the move that makes regression a tool for adjustment. Adjustment for confounding. The thing you've been hearing about since this material. If your exposure of interest is, say, smoking, and a known confounder is maternal age, you put both in the model. The slope on smoking now estimates the smoking effect adjusted for maternal age.

**Kiffer:** Which is a huge deal. In the simpler descriptive measures from this material, you can stratify by a confounder, calculate the measure of association within each stratum, and pool. That works for one or two confounders. But once you have five, ten, fifteen variables you want to adjust for, stratification becomes impossible. Regression handles all of it in one model.

**Sarah:** But, and this is a big but, regression adjustment isn't magic. It's only as good as the variables you decide to include. And here's where the directed acyclic graph framework comes in. The directed acyclic graph, abbreviated DAG, is a diagram that maps the assumed causal relationships among your variables. Exposure here. Outcome there. Arrows between them showing what affects what.

**Kiffer:** We covered DAGs earlier. The key idea, and the reason they matter for regression, is that the DAG tells you which variables you should adjust for and which you shouldn't. Not every variable in your dataset is a confounder. Some are mediators. Some are colliders. And adjusting for the wrong type of variable does damage.

**Sarah:** Let me define the three roles. A confounder is a variable that affects both the exposure and the outcome through pathways other than the exposure-outcome path itself. Maternal age affects smoking, and maternal age affects birth weight through other channels. Adjusting for confounders is the whole reason we run multiple regression.

**Kiffer:** A mediator, sometimes called an intervening variable, is a variable on the causal path between the exposure and the outcome. Smoking causes reduced gestational age, which causes lower birth weight. Gestational age is a mediator between smoking and birth weight. If you adjust for it, you block part of the effect you're trying to estimate. You'd get the direct effect of smoking on birth weight, the part not going through gestation. But you'd lose the total effect.

**Sarah:** And a collider is a variable that the exposure and the outcome both cause. The arrows from exposure and outcome both point into the collider. Adjusting for a collider, by including it in the regression, opens a back-door pathway that wasn't there before. It induces an association between exposure and outcome where there wasn't one. That's collider bias.

**Kiffer:** So the DAG-based discipline is, draw your causal diagram first, before you build the regression. Identify the confounders. Adjust for them. Identify the mediators. Don't adjust for them, unless you specifically want to estimate the direct effect rather than the total effect. Identify the colliders. Don't adjust for them under any circumstances.

**Sarah:** And this is one of the most important methodological habits for an epidemiologist to develop. The garbage-in-garbage-out problem in regression isn't just about bad data. It's about bad variable selection. A regression with the wrong adjustment set can give you a confidently wrong answer, with tight confidence intervals and small p-values. The numbers look great. The answer is misleading.

**Kiffer:** The next caveat is the linearity assumption. Linear regression assumes the relationship between each predictor and the outcome is, well, linear. A straight line. The same change per unit no matter where you are on the predictor's range.

**Sarah:** And that's a strong assumption. Plenty of real-world relationships aren't linear. The classic example is alcohol and mortality. Light drinkers actually have lower mortality than non-drinkers in many studies. Heavy drinkers have much higher mortality. The shape is a J. Or maybe a U. The lowest risk is somewhere off zero.

**Kiffer:** If you fit a straight line through that J-shape, the slope can come out anywhere depending on the range of alcohol intake in your sample. You might conclude alcohol has no effect, or even a protective effect, when really the relationship is just non-linear and your linear model is averaging over the curve in a misleading way.

**Sarah:** There are fixes. You can include a quadratic term, gestational age squared, to let the model bend. You can use spline functions for more flexible shapes. You can transform the predictor or the outcome on a log scale. We'll touch on these again at the end of the episode under diagnostics. The general lesson is that linearity is an assumption, not a fact, and you have to check it.

**Kiffer:** Alright, the next move is categorical predictors.

**Sarah:** Right. Many predictors aren't continuous. Race. Smoking status. Education level. Region. These are categorical variables. They have a finite number of unordered or ordered levels. You can't just plug a category like Black, White, or Asian into a regression equation as a number, because regression expects numeric input.

**Kiffer:** The solution is dummy coding. For a categorical variable with k levels, you create k minus one binary indicator variables. Each indicator is zero or one. One level is left out as the reference category. The coefficients on the indicators tell you the difference between each level and the reference.

**Sarah:** Concrete example. Suppose education has four levels. Less than high school. High school diploma. Some college. University degree. That's k equals four. So we create three dummy indicators. The omitted level, say less than high school, becomes the reference. The coefficient on the high-school-diploma indicator is the average difference in the outcome between people with high school and people with less than high school. The coefficient on the some-college indicator is the difference between some college and less than high school. And so on.

**Kiffer:** The choice of reference category is arbitrary in a mathematical sense but matters for interpretation. Pick a reference level that gives you meaningful comparisons. Often you pick the most common category, or a baseline that other categories are naturally contrasted against.

**Sarah:** And the textbook covers a couple of refinements. Hierarchical or incremental indicators where you compare each level to the one immediately below, instead of all to a common baseline. Useful for ordinal variables where the steps matter. We won't go deep on that here, but know it exists.

**Kiffer:** And the last big topic in section two is interactions.

**Sarah:** An interaction, sometimes called effect modification, is when the effect of one predictor depends on the level of another. The simple regression model assumes the slope on X1 is the same regardless of X2. The interaction model allows the slope on X1 to change with X2.

**Kiffer:** You model an interaction by including a product term. You take X1 times X2 and add it as another predictor. The model becomes Y equals beta zero plus beta one X1 plus beta two X2 plus beta three times X1 times X2 plus epsilon. Beta three is the interaction coefficient. If beta three is significantly different from zero, the effect of X1 on Y depends on the value of X2.

**Sarah:** The textbook example is maternal weight gain interacting with whether the mother is having her first baby or a later baby. The effect of weight gain on birth weight is bigger in first-time mothers than in mothers who've had multiple births. So weight gain and birth order interact.

**Kiffer:** Interactions are powerful because they let regression capture the kind of effect modification you saw earlier. They're also dangerous because the number of possible interaction terms grows fast. With ten predictors, you have forty-five possible two-way interactions. The lesson's advice is to limit interactions to those with a biological or theoretical rationale, not to fish for them.

**Sarah:** Okay. Section two summary. Multiple linear regression has multiple predictors. Each beta is interpreted holding all other predictors constant. This is the move that makes regression a tool for adjustment. The DAG tells you what to adjust for. Linearity is an assumption you have to check. Categorical predictors get dummy coded with one level as reference. Interactions are modeled as product terms when biological reasoning suggests the effect of one predictor depends on another.

**Kiffer:** That brings us to section three, diagnostics.

**Sarah:** And this is where I think a lot of regression analysis falls down in practice. People fit the model, read off the coefficients, write up the paper. They never check whether the model's assumptions are actually satisfied. Diagnostics are how you check.

**Kiffer:** Linear regression rests on four core assumptions. Linearity. Independence. Homoscedasticity. And normality of residuals. Let's go through each one and then talk about how you check it.

**Sarah:** Assumption one. Linearity. The relationship between each predictor and the outcome is linear, holding others constant. We just talked about this.

**Kiffer:** Assumption two. Independence. The observations are independent of each other. Each baby's birth weight is its own data point, not influenced by any other baby in the dataset. Independence is mostly a study-design property. If you sampled a clustered population like classrooms or hospitals, independence is violated and you need a different model. Mixed-effects models, for example.

**Sarah:** Assumption three. Homoscedasticity. That's a mouthful. It comes from Greek and means same scatter. Homo for same. Scedastic from a root meaning to scatter. Sometimes called homogeneity of variance. The assumption is that the variance of the residuals is constant across the range of fitted values. The scatter around the line should be the same whether you're at low values of X or high values of X.

**Kiffer:** When the residual scatter changes across the range of predictions, that's heteroscedasticity. Hetero meaning different. And it's bad news for inference. Your standard errors get distorted. Your confidence intervals are wrong. The slope estimate itself is still unbiased, but the precision claims around it aren't trustworthy.

**Sarah:** Assumption four. Normality of residuals. The error terms are approximately Normally distributed with that constant variance we just talked about. The Normal distribution is the bell curve. Symmetric, with most of the mass clustered around the middle and progressively less out in the tails.

**Kiffer:** In moderate to large samples, normality of residuals matters less than it sounds. The central limit theorem tends to rescue your inference even when the residuals aren't strictly Normal. But severe departures, especially heavy tails or strong skew, can still cause problems.

**Sarah:** Now how do you check these assumptions? The standard answer is diagnostic plots. There are four classic ones, and in R you get all four with a single line of code. You fit the model, then call the plot function on the fitted object. Four plots come up. Let's walk through each.

**Kiffer:** Diagnostic plot one. Residuals versus fitted. On the X axis, the fitted values, what the model predicts. On the Y axis, the residuals, what's left over. What should it look like?

**Sarah:** Random scatter around the horizontal line at zero. No pattern. No curve. No fanning out or funneling in. If you see a curve, that's a signal of non-linearity. The model's straight line isn't capturing the shape. If you see fanning, where the residuals get bigger as the fitted values get bigger, that's heteroscedasticity. The variance is changing with the prediction.

**Kiffer:** Diagnostic plot two. The quantile-quantile plot of residuals. Often called the Q-Q plot for short, but I'll keep saying quantile-quantile for clarity. This plot compares the actual residuals to what you'd expect if they were perfectly Normally distributed.

**Sarah:** On the X axis, the theoretical quantiles from a Normal distribution. On the Y axis, the actual quantiles of your residuals. If your residuals are Normal, the points fall on a straight diagonal line. If they curve up at one end, you have a heavy upper tail. If they curve down, a heavy lower tail. If they're S-shaped, you have heavy tails on both sides.

**Kiffer:** Diagnostic plot three. The scale-location plot. Sometimes called the spread-location plot. The square root of the absolute standardized residuals on the Y axis. The fitted values on the X axis. What you want is a flat horizontal trend, ideally with random scatter.

**Sarah:** If the line slopes upward, the residual variance is growing with the prediction. That's a clearer way to see heteroscedasticity than the residuals-versus-fitted plot, because it removes the sign of the residual and just looks at the magnitude.

**Kiffer:** Diagnostic plot four. Residuals versus leverage. This one identifies influential observations. Points that are pulling the regression line around more than their fair share.

**Sarah:** Leverage is a measure of how unusual the predictor values are for that observation. A point with extreme X values has high leverage. Influence is leverage combined with how far the point is from the line. A high-leverage point that sits exactly on the line has little influence. A high-leverage point that's far from the line has a lot.

**Kiffer:** The standard summary statistic for influence is Cook's distance. Named after Dennis Cook, the statistician who developed it in 1977. Cook's distance combines leverage and residual size into a single number. Values above zero point five are typically flagged as worth investigating. Above one point zero is a serious flag.

**Sarah:** And the practical use of this plot is to catch a single observation that's driving your conclusions. If removing one or two points changes the slope substantially, you need to think hard about why. Is the data point a coding error? Is it a real but extreme observation? Is it a different kind of person who maybe shouldn't be in the analysis? You don't automatically delete influential points. You investigate them.

**Kiffer:** Now what do you do if your diagnostics flag a problem? Three broad strategies.

**Sarah:** First. Transformations. If the residuals show a curved pattern or growing variance, sometimes a log transformation of the outcome fixes both at once. Take the log of birth weight instead of birth weight itself. Or a square-root transformation, which is gentler than a log. The transformation can pull a J-shaped relationship into something approximately linear and stabilize the variance.

**Kiffer:** Second. Robust regression methods. These are alternatives to ordinary least squares that are less sensitive to outliers and to violations of the Normal-error assumption. Methods like M-estimators down-weight observations with large residuals so they don't dominate the fit. Quantile regression, which models the median or other quantiles instead of the mean, is another robust option.

**Sarah:** And third, the most general fix. Generalized linear models with appropriate link functions. Generalized linear models, abbreviated GLM, extend the linear regression framework to outcomes that aren't continuous and Normally distributed. Logistic regression for binary outcomes uses a logit link. Poisson regression for counts uses a log link. We'll cover GLMs in the next several lessons. For now, just know that when your outcome really doesn't fit the linear-Normal mold, you don't force it. You pick the right model for the outcome.

**Kiffer:** And one related point. If your independence assumption is violated because of clustering or repeated measures, you need a model that accounts for that structure. Mixed-effects models. Generalized estimating equations. Time-series methods if you have observations over time on the same units. The lesson flags this as the natural place where linear regression hands off to more elaborate frameworks.

**Sarah:** Let's pull the takeaways together. There's a lot here, and a beginning epidemiology student should walk away with a few clear ones.

**Kiffer:** Yeah. Let me list them. First takeaway. Linear regression is the workhorse for continuous outcomes. The model is, outcome equals intercept plus slope times predictor plus error. The slope is the change in the outcome per one-unit increase in the predictor. That's almost always what you care about.

**Sarah:** Second. Estimation by ordinary least squares. The line that minimizes the sum of squared residuals. Inference proceeds through the standard error of the slope, the 95 percent confidence interval, and the t-test against zero. R-squared tells you how much variance is explained, but it's not a verdict on model quality.

**Kiffer:** Third. Multiple linear regression generalizes the model to multiple predictors. Each beta is interpreted holding all other predictors constant. This is the move that makes regression a tool for adjustment. Including a confounder gives you the exposure effect adjusted for that confounder.

**Sarah:** Fourth. The directed acyclic graph is your guide to which variables to include. Confounders, yes. Mediators, only if you specifically want the direct effect rather than the total effect. Colliders, never. Adjusting for the wrong variables can produce confidently wrong answers.

**Kiffer:** Fifth. The linearity assumption is real. Non-linear relationships, like the J-shape from alcohol and mortality, can mislead a linear model. Categorical predictors need dummy coding with one reference level. Interactions are modeled as product terms when biology suggests the effect of one predictor depends on another.

**Sarah:** Sixth. Four assumptions. Linearity. Independence. Homoscedasticity. Normality of residuals. You check them with diagnostic plots. Residuals versus fitted. Quantile-quantile plot of residuals. Scale-location plot. Residuals versus leverage with Cook's distance for influential observations.

**Kiffer:** And seventh. When assumptions are violated, you have options. Transformations like log or square root can rescue linearity and homoscedasticity. Robust regression methods are less sensitive to outliers and non-Normal errors. Generalized linear models with appropriate link functions handle outcomes that aren't continuous and Normally distributed. Mixed-effects and time-series models handle non-independence.

**Sarah:** And one more practical recommendation. Don't skip the diagnostic plots. The time it takes to fit a model is two seconds in R. The time it takes to look at the four diagnostic plots is another two seconds. There is no excuse for skipping it. Most regression mistakes in published research would be caught if the authors actually looked at their residuals.

**Kiffer:** And one habit I'd encourage. Before you fit any regression, draw the directed acyclic graph by hand. On paper. Even for a simple analysis. The discipline of asking, what causes what, and which arrows do I need to block, will save you from a lot of bad analysis. The DAG comes before the regression. Always.

**Sarah:** Next up is Lesson 4. Model Building Strategies. How to systematically build a regression model when you have many candidate predictors. The DAG-based, theory-first approach you previewed today is the answer.

**Kiffer:** Take care, everyone.

**Sarah:** See you in the next episode.
