# Lesson 4 — Model Building Strategies (v3 expanded)

*Companion-podcast transcript • Sarah & Kiffer* 
*~5234 words • ~28.3 min audio*

---

**Sarah:** Welcome back to Office Hours. I'm Sarah.

**Kiffer:** And I'm Kiffer. Today we're working through Lesson 4, Model Building Strategies. This is the lesson where we step back from any one regression technique and ask the bigger question. When you have a dataset and dozens of candidate predictors, how do you decide which ones go into the model?

**Sarah:** And the answer this lesson keeps coming back to is, it depends on what you're trying to do. Model building isn't a single recipe. The strategy you pick depends on the goal of your analysis.

**Kiffer:** Exactly. And the lesson is really sharp on this point. There are three broad goals you might have when you fit a regression model. Description. Prediction. Or causal inference. Each one calls for a different approach. If you mix them up, you can do everything technically right and still produce a misleading result.

**Sarah:** Let's define those three goals carefully, because they sound similar but they're really different.

**Kiffer:** Description is the simplest. You just want to summarize patterns in your data. What's the average birth weight by maternal age group? How does smoking prevalence vary across provinces? You're not trying to predict the future and you're not trying to estimate a causal effect. You're describing what's there. The bar for a descriptive model is mostly that the summary be honest and the patterns be reported transparently.

**Sarah:** Prediction is the next goal. You want to estimate an outcome for a new individual or a future observation. Given everything I know about this patient, what's their predicted risk of a heart attack in the next ten years? You don't necessarily care which variables are doing the work. You care about getting the prediction right. The bar for a predictive model is out-of-sample accuracy.

**Kiffer:** And causal inference is the third goal. You want to estimate the effect of a specific exposure on an outcome. Does smoking cause low birth weight, and if so, by how much? You care a lot about which variables are doing the work, because the whole point is to isolate the effect of that one exposure. The bar for a causal model is unbiased estimation of the effect of the exposure on the outcome, holding everything else fixed.

**Sarah:** And the model-building strategy genuinely differs across these three goals. Let me try to make that concrete. For prediction, you want any variable that improves out-of-sample accuracy, even if you have no idea why it works. For causal inference, you want exactly the variables that block confounding pathways, and you specifically don't want variables that lie on the causal pathway between your exposure and your outcome.

**Kiffer:** And for description, you mostly want a parsimonious model that summarizes the data well, without overinterpreting any single coefficient. So three goals, three strategies. The lesson walks through four model-building strategies that map onto these goals. Let's go through them one at a time.

**Sarah:** Let's start with strategy one. Purposeful selection for causal inference. This is the recommended approach when your goal is to estimate the effect of an exposure on an outcome.

**Kiffer:** And the heart of purposeful selection is that you don't let the data choose your variables. You let your subject-matter knowledge choose them. Specifically, you start by drawing a directed acyclic graph, which we abbreviate as DAG. We covered DAGs earlier, but let me re-define them quickly for anyone who needs the refresher.

**Sarah:** A directed acyclic graph is a diagram of the causal relationships you believe exist among your variables. Each variable is a node. Each arrow is a hypothesized causal effect, pointing from cause to effect. Acyclic means no variable can cause itself through a loop. You draw the graph based on what you know from theory, prior literature, and biological reasoning, before you ever look at the data.

**Kiffer:** Right. And the DAG tells you three categories of variables that matter for causal inference. Confounders, mediators, and colliders.

**Sarah:** Let me define each. A confounder is a variable that causes both your exposure and your outcome. Income could be a confounder for the smoking-and-birth-weight question, because income affects both whether someone smokes and the birth weight of their baby. Confounders open backdoor pathways that produce spurious associations, so you need to adjust for them by including them in the model.

**Kiffer:** A mediator is a variable that lies on the causal pathway between your exposure and your outcome. Smoking might cause reduced placental blood flow, which then reduces birth weight. Placental blood flow is the mediator. The exposure causes the mediator, and the mediator causes the outcome. If you condition on a mediator when your goal is the total effect, you block the indirect pathway and you understate the effect of the exposure.

**Sarah:** And a collider is a variable that is caused by both your exposure and your outcome, or by two other variables on relevant paths. Conditioning on a collider opens a non-causal pathway and can introduce bias even when none existed before. Colliders are the trickiest of the three because the bias is created by the act of adjusting, not by the act of leaving them out.

**Kiffer:** And the rule for purposeful selection is, include confounders, exclude mediators, exclude colliders. That sounds simple, but it requires you to know which variable is which, which requires you to think carefully about the causal structure before you fit anything.

**Sarah:** And this is why the lesson keeps emphasizing subject-matter knowledge. The DAG is not something a statistical algorithm can generate for you. It comes from your understanding of the system. Two researchers looking at the same dataset can draw different DAGs, and that's not a bug, that's the discipline showing where the assumptions live.

**Kiffer:** Let me make this concrete with the example from the lesson. Suppose you want to study the effect of cigarette smoking on birth weight. You also have data on the mother's race, education level, total birth order, gestation length, number of babies born, and weight gain during pregnancy.

**Sarah:** The DAG would tell you that gestation length and weight gain are intervening variables. They lie on the causal pathway between smoking and birth weight. Smoking affects how long the pregnancy lasts and how much weight the mother gains, which then affects birth weight.

**Kiffer:** And if your goal is the total effect of smoking on birth weight, you would not include gestation length or weight gain in the model. Including them would block the indirect effect that passes through them, and you'd underestimate the total impact of smoking. The number you report would be the effect of smoking that is not mediated through gestation length or weight gain, which is a much smaller number and a much more confusing thing to interpret.

**Sarah:** But race and education might be confounders. They could affect both the likelihood of smoking and the baby's birth weight independently. Those should be retained in the model, regardless of whether their statistical significance is high or low in your sample.

**Kiffer:** And that point is really important. In purposeful selection for causal inference, you keep confounders in the model based on prior knowledge, not based on a P-value. A confounder that happens to look statistically non-significant in your particular sample is still doing the job of blocking a backdoor path. Dropping it because the P-value is greater than zero point zero five would reintroduce the bias you were trying to control.

**Sarah:** Purposeful selection also covers another decision. The functional form of each predictor. By functional form I mean, does the predictor enter the model linearly, or as a polynomial, or as a spline, or as categories?

**Kiffer:** And again, you let subject-matter knowledge guide that choice. If you know from biology that the relationship between maternal age and birth weight is roughly U-shaped, with worse outcomes for very young and very old mothers, you don't fit a straight line. You use a flexible functional form like a restricted cubic spline that can capture the curve. We'll come back to splines in detail in a few minutes.

**Sarah:** Okay, on to strategy two. Stepwise selection, which covers forward, backward, and stepwise procedures. These are automated algorithms that let the data choose which variables to include based on statistical criteria.

**Kiffer:** Forward selection starts with no predictors and adds them one at a time, choosing whichever variable most improves the model at each step. It stops when no remaining variable meets some inclusion threshold, like a P-value of zero point one or zero point one five.

**Sarah:** Backward elimination does the opposite. It starts with all candidate predictors in the model and removes them one at a time, dropping whichever variable contributes the least at each step. It stops when every remaining variable meets the retention criterion. Backward is generally considered better than forward, because each variable is evaluated in the context of all the others.

**Kiffer:** And stepwise selection alternates between forward and backward steps. Add a variable, then check whether any previously added variable should be dropped, then look for the next variable to add, and so on. It keeps going until the model stabilizes.

**Sarah:** These methods are still common in software packages and you'll see them in the older literature. But the lesson is clear that stepwise selection is discouraged for causal inference, and there are very good reasons.

**Kiffer:** Three big problems. First, inflated Type I error. By Type I error I mean a false positive, declaring an effect exists when it really doesn't. When you test many candidate variables and pick the ones that look significant, you're effectively running many hypothesis tests. The probability that at least one will look significant by chance balloons. The published P-value next to a variable that survived stepwise selection is not the P-value you think it is.

**Sarah:** Second, unstable selection. If you bootstrap your data, drawing slightly different samples and rerunning the stepwise procedure, you often get very different final models. That's a sign that the selected variables aren't really the ones the data are pointing to. They're the ones that happened to win in your particular sample, and a slightly different sample would have given you a slightly different model.

**Kiffer:** Third, biased confidence intervals and biased coefficients. The coefficients on the variables that survive the selection procedure are systematically too large in absolute value. The standard errors don't account for the fact that you went through a model search, so the reported confidence intervals are too narrow. Your inference is over-confident.

**Sarah:** And on top of all that, stepwise selection has no concept of confounding versus mediation. It can drop a known confounder just because the P-value happens to fall above the threshold, leaving you with a model that fails to control for the confounding pathway you were trying to block. It can also keep a mediator, because mediators are usually highly predictive of the outcome, which biases your estimate of the exposure effect.

**Kiffer:** So the modern guidance is, for causal inference, don't use pure stepwise selection. Use purposeful selection, guided by your DAG and subject-matter knowledge. Stepwise might still play a small role for screening within a wider strategy, but it shouldn't be the strategy.

**Sarah:** That brings us to strategy three. Penalized regression for prediction. This is the modern toolkit for predictive models when you have many candidate predictors and you want a model that performs well on new data.

**Kiffer:** The big idea behind penalized regression is that ordinary linear regression fits the coefficients by minimizing the sum of squared residuals. That's just the prediction error on the training data. With many predictors, that minimization can over-fit the noise, and the resulting coefficients perform badly on new data.

**Sarah:** Penalized regression adds a penalty to that minimization. You're now minimizing prediction error plus a penalty that grows with the size of the coefficients. The model has to pay a cost for making coefficients large, which keeps the model from fitting noise. You're trading a little bit of bias for a lot less variance.

**Kiffer:** Penalized regression comes in three main flavors. LASSO, ridge regression, and elastic net. Let's spell out each name. LASSO stands for Least Absolute Shrinkage and Selection Operator. The penalty in LASSO is the sum of the absolute values of the coefficients. The key feature in plain words is, the LASSO penalty has a sharp edge that pushes some coefficients exactly to zero.

**Sarah:** And that's why LASSO is called a selection operator. By forcing some coefficients all the way to zero, it effectively drops those variables from the model. So LASSO does variable selection automatically. You give it a hundred candidate predictors, you tune the penalty, and you end up with a much smaller subset of predictors that the algorithm has decided are pulling their weight.

**Kiffer:** Ridge regression uses a different penalty. The sum of squared coefficients. This shrinks all coefficients toward zero, but unlike LASSO, it doesn't push any of them all the way to zero. Every variable stays in the model, just with a smaller coefficient than ordinary regression would give it.

**Sarah:** Ridge is especially useful when you have many predictors that are correlated with each other. Ordinary regression would give each one a wildly unstable coefficient, with huge standard errors. Ridge regression handles the correlation gracefully by shrinking the correlated coefficients together, sharing the credit between them in a stable way.

**Kiffer:** And elastic net is a compromise. It uses a weighted combination of the LASSO penalty and the ridge penalty. So you get some variable selection, like LASSO, and some graceful handling of correlated predictors, like ridge. Elastic net often performs well when LASSO and ridge would each have advantages.

**Sarah:** Quick rule of thumb on when to use which. If you have many predictors and you want a sparse model that uses just a few of them, LASSO. If you have many correlated predictors and you want them all in the model with stabilized coefficients, ridge. If you want a bit of both, elastic net. And if you have no idea, elastic net is often a safe default because it can behave like either of the others depending on how you tune it.

**Kiffer:** And one more thing. Penalized regression introduces a tuning parameter that controls the strength of the penalty. People sometimes call it lambda. You don't get to pick that by hand. You pick it by cross-validation, which we'll talk about in a few minutes.

**Sarah:** And strategy four. Model averaging. Instead of picking one final model and pretending it's the right one, you average predictions across multiple models, weighting each by its evidence in the data.

**Kiffer:** And the most principled version of this is Bayesian model averaging. You consider a set of plausible models. For each one you compute the probability that it's the correct model given the data. And you combine the predictions, weighting each model by its probability.

**Sarah:** When does this make sense? When there's genuine uncertainty about which model is correct. If three different specifications all look reasonable from a subject-matter standpoint, and the data don't strongly favor any one over the others, picking a single model is overconfident. Averaging across them is more honest about the uncertainty.

**Kiffer:** Bayesian model averaging is more demanding computationally and conceptually than the other strategies. You need a Bayesian framework. You need prior probabilities on the candidate models. But it's a really useful tool when model uncertainty is itself a major contributor to uncertainty in your final estimates. Reporting one model and ignoring the alternatives can hide that uncertainty.

**Sarah:** Okay, let's pivot to a topic that comes up across all four strategies. Collinearity. We met this earlier. Let me re-introduce it because it's central to model building.

**Kiffer:** Collinearity, also called multicollinearity, is when two or more predictors in your model are highly correlated with each other. It doesn't bias your coefficients on average. But it inflates standard errors, which makes individual coefficients unstable and confidence intervals wide.

**Sarah:** The intuition is, if two predictors carry almost the same information, the regression has trouble deciding how much credit to give to each one. The total effect of the pair is well-estimated, but the individual contributions are blurry. You see this when small changes in the dataset cause coefficients to jump around dramatically, sometimes flipping sign.

**Kiffer:** And the standard diagnostic for this is the Variance Inflation Factor, abbreviated VIF. The Variance Inflation Factor for a predictor tells you how much the variance of its coefficient is inflated due to its correlation with other predictors in the model. A VIF of one means no inflation. A VIF of four means the variance is four times what it would be without collinearity.

**Sarah:** Two thresholds people commonly use. A VIF greater than ten is widely treated as a flag that demands action. A VIF greater than five deserves attention and a closer look. Below five is generally considered acceptable. These are rules of thumb, not laws of physics, but they're a useful starting point.

**Kiffer:** And what do you do when collinearity is a problem? Three main options the lesson highlights. One, remove redundant predictors. If two variables are essentially measuring the same thing, just drop one. Pick whichever one has the better measurement properties or fewer missing values.

**Sarah:** Two, combine correlated predictors using principal components. By that I mean Principal Components Analysis, abbreviated PCA. PCA takes a set of correlated predictors and constructs a smaller set of uncorrelated components that capture most of the variation. You use the components in the regression instead of the original variables. The cost is interpretability. A coefficient on a principal component is harder to explain than a coefficient on the original variable.

**Kiffer:** Three, use ridge regression. As we said earlier, the ridge penalty handles correlated predictors gracefully by shrinking them together. So if your goal is prediction and your only problem is collinearity, ridge can solve it without you having to drop variables.

**Sarah:** Now let's talk about nonlinear relationships. This is where the functional form of a continuous predictor stops being a straight line.

**Kiffer:** And these are everywhere in epidemiology. The classic example is the J-shaped relationship between alcohol intake and mortality. Light drinkers sometimes appear to have lower mortality than abstainers, and heavy drinkers have higher mortality. That's a curve, not a straight line. Forcing a straight line through that data would give you a misleading summary of the relationship.

**Sarah:** Three main tools for handling nonlinearity. Polynomials, splines, and categorization. Let's walk through each.

**Kiffer:** Polynomials add power terms of your predictor to the model. You include the predictor itself, then its square, optionally its cube, and so on. A quadratic model can capture a U-shaped or inverted-U relationship. A cubic model can capture more complex curves. The advantage is simplicity. You're still in the linear regression framework, you just have extra terms.

**Sarah:** But polynomials have global influence. The shape of the curve at one end of the predictor distribution depends on the data at the other end. They can also be very sensitive to extreme values. And there's the practical headache of collinearity between a predictor and its square, which is usually fixed by centering the predictor before squaring, that is, subtracting the mean before raising to a power.

**Kiffer:** Splines are a different approach. You divide the range of the predictor into segments, and you fit a piecewise function across the segments, with the pieces joined smoothly at points called knots. Within each segment the relationship can curve, but the curve is local. What happens at the high end doesn't distort the fit at the low end.

**Sarah:** And the standard choice in epidemiology is the restricted cubic spline. It's a cubic spline that's constrained to be linear in the tails, beyond the outermost knots. That keeps the curve from doing wild things at the extremes where data are sparse. Restricted cubic splines with three to five knots can capture a wide range of nonlinear relationships and they're well-behaved.

**Kiffer:** Where do you put the knots? Common practice is to put them at fixed percentiles of the predictor distribution, like the tenth, fiftieth, and ninetieth percentiles for a three-knot spline. With more knots you spread them out across more percentiles. That way the knot placement reflects where your data actually live, not arbitrary cut points you imposed by hand.

**Sarah:** Categorization is the third tool, but it's also the one the lesson is most cautious about. Categorization means breaking the continuous variable into bins. Age becomes age groups. Body mass index becomes underweight, normal, overweight, obese. The model then estimates a separate coefficient for each category.

**Kiffer:** And there are several reasons categorization loses information. First, you're assuming everyone within a category has the same effect, which is biologically implausible. A forty-year-old and a fifty-nine-year-old don't really have the same risk just because they're both in the forty to fifty-nine age band.

**Sarah:** Second, the choice of cut points is arbitrary. If you choose them based on the data, you risk biased results. If you choose them based on convention, the conventions might not match the underlying biology. Either way, the cut points carry assumptions you usually can't justify.

**Kiffer:** Third, you're imposing a step function on a relationship that's almost certainly smooth. The biological process doesn't suddenly jump at age sixty. It changes gradually. A smooth functional form like a spline respects that, while categorization throws it away.

**Sarah:** So the lesson's recommendation is, prefer splines to categorization for nonlinear continuous predictors. Use categorization mainly when the audience really needs categorical reporting, or when you're using about five categories purely to control for confounding rather than to describe a relationship.

**Kiffer:** Okay, interactions. This is the next big modeling decision. An interaction term in regression is what we call effect modification earlier in this series. The effect of one predictor depends on the level of another.

**Sarah:** Mathematically you create an interaction by multiplying two predictors and adding the product as a new term in the model. So if your predictors are smoking and maternal age, the interaction is smoking times age. The coefficient on that product term tells you whether the effect of smoking on birth weight changes as maternal age changes.

**Kiffer:** And the lesson is really firm on one rule. Include interactions based on subject-matter rationale, not via fishing. Don't test every possible pairwise interaction across all your predictors and then report the ones that come out significant.

**Sarah:** The reason is the same as with stepwise selection. If you have ten predictors, you have forty-five possible two-way interactions. If you test all of them, you'll find some that look significant by chance. Your apparent discoveries are mostly just noise.

**Kiffer:** So the disciplined approach is, before you fit the model, decide which interactions you have prior reason to suspect. Maybe the effect of smoking on birth weight differs by maternal age, because pregnancy biology changes over the life course. Maybe the effect of an air pollutant differs by underlying cardiovascular health. Pre-specify those and test them. Skip the rest.

**Sarah:** And one rule about interactions that's worth saying out loud. If you include an interaction between two variables, you must also include the main effects of both. Including only the product term, without the components, gives you a model that's hard to interpret and usually misspecified.

**Kiffer:** Three-way interactions, where the effect of one variable depends on the joint level of two others, are even riskier. They're hard to interpret, they need a lot of data to estimate well, and they're prone to over-interpretation. Include them only when you have strong prior reason to expect a three-way effect.

**Sarah:** Now let's talk about model evaluation. Once you've chosen a strategy and fit a model, how do you decide whether the model is any good? And how do you compare it to a competing model?

**Kiffer:** The two most common information criteria are AIC and BIC. AIC is the Akaike Information Criterion, named for the Japanese statistician Hirotugu Akaike. BIC is the Bayesian Information Criterion.

**Sarah:** Both criteria reward a model for fitting the data well and penalize it for having more parameters. The general form is the same. Take a measure of fit, which is the negative of twice the log-likelihood, and add a penalty that grows with the number of parameters in the model. Lower values are better.

**Kiffer:** The difference between AIC and BIC is in the size of the penalty. AIC uses a penalty of two times the number of parameters. BIC uses a penalty of the natural log of the sample size, times the number of parameters. For any sample bigger than about seven, the natural log of the sample size is larger than two, so BIC penalizes complexity more harshly.

**Sarah:** And what that means in practice is that BIC tends to favor more parsimonious models than AIC. If you select by BIC, you'll usually end up with fewer predictors. If you select by AIC, you'll often keep a slightly larger model.

**Kiffer:** Neither one is universally right. The lesson notes that BIC has a Bayesian interpretation. Differences in BIC between models can be translated into approximate evidence ratios. A BIC difference between zero and two is weak evidence. Two to six is positive evidence. Six to ten is strong evidence. Ten or more is very strong evidence.

**Sarah:** AIC is more about predictive accuracy. It's targeting the model that would predict best on new data. So if your goal is prediction, AIC has a slight philosophical lean in that direction. If your goal is selecting the most likely true model from a fixed set, BIC has the edge.

**Kiffer:** And one more important point. AIC and BIC are most useful for comparing non-nested models, where one model isn't simply a subset of another. For nested models, where one model's predictors are contained in another's, you can use a likelihood ratio test or a partial F-test to compare them, which has stronger statistical properties for that specific comparison.

**Sarah:** Now cross-validation. This is the workhorse for evaluating predictive models, and it deserves its own discussion.

**Kiffer:** The basic idea is simple. You split your data into two parts. A training set and a test set. You fit the model on the training set. Then you evaluate it on the test set, which the model has never seen. The performance on the test set is your estimate of how the model would do on truly new data.

**Sarah:** And the reason this matters is that any model fit on a dataset will do well on that dataset, almost by construction. The whole point of fitting is to minimize error on the data in front of you. So in-sample fit is a flattering measure. It tells you how well the model fits the noise in your particular sample, not how well it generalizes.

**Kiffer:** And the more flexible your model, the bigger the gap between in-sample fit and out-of-sample fit. With enough parameters, a model can fit your training data almost perfectly and still predict terribly on new data. That's overfitting. The whole point of cross-validation is to make overfitting visible.

**Sarah:** There are several flavors of cross-validation. The simplest is a single train-test split. Take eighty percent of your data for training and twenty percent for testing. Fit on training. Evaluate on test.

**Kiffer:** More efficient is k-fold cross-validation. You split the data into k equal pieces, often five or ten. You take turns using each piece as the test set while training on the other pieces. You then average the test performance across all the folds. That gives you a more stable estimate because you're using all the data for both training and testing, just at different times.

**Sarah:** And there's leave-one-out cross-validation, which is the extreme case where k equals the sample size. Each individual is used as the test set once, while the model is trained on everyone else. It's expensive but unbiased.

**Kiffer:** Cross-validation is also how you tune the penalty parameter in penalized regression. You try different values of the penalty, evaluate each by cross-validation, and pick the value that gives the best out-of-sample performance. That's the modern recipe for predictive modeling.

**Sarah:** And one really important distinction. Cross-validation is essential for predictive models. For causal inference models, cross-validation is less central, because you're not necessarily aiming for predictive accuracy. You're aiming for an unbiased causal estimate, which is a different target. A causal model that has poor out-of-sample prediction can still be the right answer, if the variables are right and the structure is right.

**Kiffer:** Let's pull the takeaways together. There's a lot in this lesson, so let's keep them concrete.

**Sarah:** Takeaway one. The goal of the analysis dictates the model-building strategy. Description, prediction, or causal inference. Each calls for a different approach. Mixing them up produces models that are technically defensible but that fail to answer the question you actually have.

**Kiffer:** Takeaway two. For causal inference, use purposeful selection guided by a directed acyclic graph. Include confounders. Exclude mediators and colliders. Make decisions based on subject-matter knowledge, not on P-values from your sample. Confounders stay in the model regardless of significance.

**Sarah:** Takeaway three. Stepwise selection, forward, backward, or stepwise, is discouraged for causal inference. The reasons are inflated Type I error, unstable model selection, biased confidence intervals, and the algorithm's inability to distinguish confounders from mediators. If you see stepwise selection in a paper that claims causal effects, that's a red flag worth raising.

**Kiffer:** Takeaway four. For prediction, the modern toolkit is penalized regression. LASSO, the Least Absolute Shrinkage and Selection Operator, when you want a sparse model that drops irrelevant variables. Ridge regression when you have many correlated predictors and want stable coefficients. Elastic net when you want a bit of both.

**Sarah:** Takeaway five. Bayesian model averaging is useful when there's genuine uncertainty about which model is correct. Instead of picking one model, you average across plausible candidates, weighted by their evidence in the data.

**Kiffer:** Takeaway six. Watch for collinearity. Variance Inflation Factor, abbreviated VIF, greater than ten is a flag that demands action. VIF greater than five deserves attention. Solutions include removing redundant predictors, using Principal Components Analysis, or using ridge regression.

**Sarah:** Takeaway seven. Model nonlinear relationships through restricted cubic splines when you can. Polynomials are an alternative but have global influence and tail problems. Categorization loses information and imposes implausible step functions, so it's discouraged for the main exposure of interest.

**Kiffer:** Takeaway eight. Include interactions based on subject-matter rationale, not via fishing across all possible pairs. If you include an interaction term, also include both main effects. Three-way interactions are rarely justified.

**Sarah:** Takeaway nine. Use AIC, the Akaike Information Criterion, and BIC, the Bayesian Information Criterion, to compare non-nested models. Both reward fit and penalize complexity. Lower values are better. BIC penalizes complexity more harshly than AIC and tends to favor smaller models.

**Kiffer:** Takeaway ten. For predictive models, cross-validation is essential. Split your data into training and test sets. Fit on training. Evaluate on test. Use k-fold cross-validation for a more stable estimate. And use cross-validation to tune the penalty parameter in penalized regression.

**Sarah:** And takeaway eleven, the meta-takeaway. Model building is not a single algorithm. It's a sequence of decisions, each shaped by the goal of your analysis, your subject-matter knowledge, and the statistical properties of your candidate models. The best modelers move fluidly between data, theory, and diagnostics. They don't outsource the thinking to a stepwise procedure. The point is that the goal of the analysis dictates the strategy, end to end.

**Kiffer:** Next up is Lesson 5, Logistic Regression. The standard model for binary outcomes. Almost every concept from Lesson 4 carries over. Purposeful selection, penalized regression, splines for non-linearity, interactions, AIC and BIC, cross-validation. The link function changes, but the model-building strategy is the same.

**Sarah:** Thanks for working through this one with us. Take care, everyone.

**Kiffer:** See you in Lesson 5.
