# Lesson 10: Mixed Models for Continuous Data
## v3 Expanded Podcast Transcript

**Sarah:** Welcome back to Office Hours. I'm Sarah.

**Kiffer:** And I'm Kiffer. Today we're working through Lesson 10, Mixed Models for Continuous Data. This is the lesson where the abstract talk about clustered data from Lesson 9 turns into something you can actually fit and interpret.

**Sarah:** And it's worth saying out loud where we are in the arc. Lesson 9 was the diagnosis. Clustering is everywhere in public health data. Patients nested inside clinics. Students nested inside schools. Repeated blood pressures nested inside individuals. Households nested inside neighbourhoods. And ignoring that structure inflates your type one error rate.

**Kiffer:** Right. Lesson 9 ended with a roadmap of methods. Cluster fixed effects. Robust standard errors. Generalized estimating equations. Mixed models. Survey-weighted approaches. Today we take the deepest single branch of that roadmap, the linear mixed model for continuous outcomes, and develop it carefully. The next two lessons keep building on it. Lesson 11 extends the framework to discrete outcomes, binary, count, ordinal. Lesson 12 specializes mixed models to repeated-measures designs.

**Sarah:** So whatever we work through today is going to keep paying forward. Random intercepts, random slopes, REML, BLUPs, contextual effects. All of these come back.

**Kiffer:** Exactly. Let's start with the basic structure of the model. The full name is the linear mixed-effects model, which gets shortened to LME. It's also called a multilevel model, a hierarchical linear model, a variance components model. Different fields use different names for the same thing.

**Sarah:** Why does it have so many names?

**Kiffer:** Because it was reinvented independently in animal science, in education research, in psychometrics, in survey statistics, and in epidemiology. Each field developed its own vocabulary. Hierarchical linear model is the name you'll see in education. Multilevel model is more common in sociology and political science. Mixed model or linear mixed-effects model is what you'll see in biostatistics and epidemiology. Same equations underneath.

**Sarah:** Okay. So what does the equation actually look like? Let's say we have observations on a continuous outcome, like systolic blood pressure, for individual i nested inside cluster j, where j might be a clinic.

**Kiffer:** The simplest version, the random-intercept model, says the outcome for individual i in cluster j equals a fixed-effects part plus a random-effects part plus residual error. The fixed-effects part is just ordinary linear regression. An overall intercept plus the population-averaged slopes for each predictor. Age, smoking status, body mass index, sex, whatever you put in the model.

**Sarah:** Okay, and what about the random-effects part?

**Kiffer:** In the simplest model, that's a single number per cluster, called u sub j. It's how much cluster j's intercept deviates from the overall intercept. So clinic seven might have u equal to plus three, meaning the average patient at clinic seven has systolic blood pressure three points higher than the population average, even after accounting for age, smoking, and so on.

**Sarah:** And clinic twelve might have u equal to minus four. Same patient profile, but predicted four points lower than the population average.

**Kiffer:** Exactly. Each clinic gets its own intercept around the overall average. And then within each clinic, individual patients vary around that clinic-specific level by an amount we call the residual error, epsilon sub i j.

**Sarah:** Two things going on at the same time. Clusters differ on average. Individuals differ within clusters.

**Kiffer:** Right. And the random-intercept model partitions the total variance into exactly those two components. Sigma squared sub g, the between-cluster variance, captures how much clinics differ from each other. Sigma squared, the within-cluster residual variance, captures how much patients within a clinic differ from each other.

**Sarah:** And the intraclass correlation coefficient, which we met in Lesson 9, falls right out of those two numbers.

**Kiffer:** Exactly. The intraclass correlation coefficient equals sigma squared sub g divided by the sum of sigma squared sub g and sigma squared. The proportion of total variance that lives between clusters.

**Sarah:** Let's do a quick worked example. Suppose you fit a null random-intercept model, which means no predictors, just an intercept and a random effect for clinic. And you get sigma squared sub g equals two hundred and sigma squared equals eight hundred.

**Kiffer:** Then the intraclass correlation coefficient is two hundred over one thousand, which is zero point two zero. Twenty percent of the variation in systolic blood pressure is between clinics. Eighty percent is within clinics.

**Sarah:** And as we discussed in Lesson 9, an intraclass correlation coefficient of twenty percent is actually quite large in epidemiology. With cluster sizes of, say, thirty, your design effect would be one plus twenty-nine times zero point two, which is about six point eight.

**Kiffer:** Which means a naive analysis that ignored the clustering would have effective sample size about one-seventh of what it claims. So fitting the mixed model isn't a luxury. It's the difference between sound inference and false confidence.

**Sarah:** Now there's a key assumption embedded here. The cluster-level random effects, the u sub j values, are assumed to come from a Normal distribution with mean zero and variance sigma squared sub g.

**Kiffer:** Right. We're not estimating each clinic's deviation as a free parameter. That would be a fixed-effects model with a separate intercept for each clinic. Instead we're saying the clinic deviations are random draws from a Normal distribution, and what we estimate is the variance of that distribution.

**Sarah:** Which is a really efficient parameterization. If you have thirty clinics and you used cluster fixed effects, you'd be estimating thirty separate intercepts. With a random intercept, you estimate one parameter, the variance, and the model effectively borrows strength across clinics.

**Kiffer:** And that borrowing of strength is what makes mixed models so powerful when some clusters have very few observations. We'll come back to that idea when we talk about BLUPs and shrinkage later.

**Sarah:** Okay, let's talk about random slopes. The random-intercept model lets each cluster have its own baseline. But what if the effect of a predictor itself varies across clusters?

**Kiffer:** That's exactly what random slopes are for. Suppose we're studying the effect of smoking on blood pressure across thirty clinics. A random-intercept model says smokers have, on average, some particular blood pressure increase relative to nonsmokers, and this is the same in every clinic. Only the baseline level differs.

**Sarah:** So the lines for smokers and nonsmokers in each clinic would be parallel, just shifted up or down.

**Kiffer:** Right. A random-slopes model relaxes that. It says the smoking-blood pressure relationship itself can vary by clinic. In some clinics smoking might be associated with a five-point increase. In others, ten points. In others, almost no difference.

**Sarah:** Why might that vary across clinics in real life?

**Kiffer:** Lots of reasons. Patient mix. Some clinics might serve patients who are heavier smokers. Or have more comorbidities that interact with smoking. Or the average duration of smoking history might differ. Or there might be unmeasured confounders that vary across clinics.

**Sarah:** And in a multi-site clinical trial of a blood pressure medication, the random slope on treatment is itself the question of scientific interest. Some sites might show a strong treatment effect, others a weaker one, because of differences in adherence, patient population, clinical practice.

**Kiffer:** Right. The variance of the random slope tells you how much the treatment effect varies across sites. If the variance is essentially zero, the treatment effect is uniform. If it's large, you've got real heterogeneity, and that has scientific and policy implications.

**Sarah:** So how does the equation actually change when we add a random slope?

**Kiffer:** Now each cluster has two random pieces. A random intercept, u sub zero j. And a random slope on the predictor in question, u sub one j. Cluster j's regression line has intercept beta zero plus u sub zero j and slope beta one plus u sub one j. Each clinic has its own line, and those lines are no longer parallel.

**Sarah:** And we have to model the joint distribution of those two random effects. They're typically assumed to come from a bivariate Normal with three parameters. The variance of the random intercept, the variance of the random slope, and the covariance between them.

**Kiffer:** And that covariance has a real interpretation. A positive covariance means clinics with higher baseline blood pressure also tend to have stronger smoking effects. A negative covariance means the opposite. The covariance should generally be estimated rather than fixed at zero, because forcing it to zero can bias the variance estimates and gives the model less flexibility.

**Sarah:** And there's a practical caveat. Random slopes models have more parameters, and identifiability can become a problem. If you don't have enough clusters, or enough observations within clusters, or enough variation in the predictor within clusters, the model may not converge or may give you a singular covariance matrix.

**Kiffer:** Right. The practical advice is, only include a random slope if you have a theoretical reason to think the effect varies across clusters and you have enough data to support it. A common workflow is to fit the random-intercept model first, then test whether adding a random slope improves the fit.

**Sarah:** Let's pause on terminology for a second. Mixed models are sometimes called hierarchical models or multilevel models. Why?

**Kiffer:** Because the data have a hierarchy of levels. Level one is the individual observation. Patient. Student. Repeated measurement. Level two is the cluster. Clinic. School. Person. And predictors can enter at each level.

**Sarah:** Patient age is a level-one predictor. It varies within clinics. Whether the clinic is urban or rural, or how many doctors it employs, those are level-two predictors. They're constant within a clinic but vary across clinics.

**Kiffer:** And the model can include both. You can fit a model with patient age, smoking, body mass index, and sex at level one, plus clinic urbanicity and clinic size at level two, plus a random intercept for clinic. That gives you a clean way to ask, after adjusting for individual patient characteristics, do urban clinics have systematically higher or lower blood pressure than rural ones.

**Sarah:** Now the section three material gets at something quite subtle. Within-cluster versus between-cluster effects, also called contextual effects. This is one of the trickier ideas in the lesson, but it's worth slowing down on because it's a common source of misleading conclusions in epidemiology.

**Kiffer:** The setup is this. Suppose income is your predictor and self-rated health is your outcome, and you have data on individuals nested in neighbourhoods. Income varies both within neighbourhoods, because people in the same neighbourhood have different incomes, and between neighbourhoods, because some neighbourhoods are wealthier on average than others.

**Sarah:** And those two sources of variation can have different effects.

**Kiffer:** Exactly. The within-neighbourhood effect of income on health captures how much your own income matters relative to your neighbours. The between-neighbourhood effect captures how much living in a wealthier neighbourhood matters, beyond your own income.

**Sarah:** And those two effects can be quite different in size. Wealthier neighbourhoods often have better infrastructure, better schools, cleaner air, more health services, less crime. So the between-neighbourhood effect of average income on health might be much larger than the within-neighbourhood effect of personal income.

**Kiffer:** Right. And if you fit a standard regression that just uses individual income, you're estimating some weighted average of those two effects, which doesn't necessarily correspond to either one. That's where the ecological fallacy and the atomistic fallacy come from.

**Sarah:** Quick definitions. The ecological fallacy is when you take a group-level association and wrongly attribute it to individuals. So if you find that wealthier neighbourhoods have lower mortality, and you conclude that giving any individual more money will reduce their personal mortality, you've potentially confused a neighbourhood-level effect with an individual-level effect.

**Kiffer:** And the atomistic fallacy is the reverse. You take an individual-level effect and wrongly assume it scales up to groups. If higher individual income predicts better individual health, that doesn't mean a country with higher average income will have proportionally better average health, because the individual-level effect might saturate or interact with other factors.

**Sarah:** We talked about this earlier on ecological studies. Mixed models with proper centering give you a way to disentangle the two.

**Kiffer:** Right. The mechanic is called group-mean centering. Instead of putting raw income into the model, you split it into two pieces. The neighbourhood mean income, which is constant within a neighbourhood. And the individual deviation from the neighbourhood mean, which captures how an individual differs from their local average.

**Sarah:** So if you live in a neighbourhood with average income of fifty thousand, and your personal income is sixty thousand, your within-neighbourhood deviation is plus ten thousand. The neighbourhood-level predictor is fifty thousand. Both go into the model as separate predictors.

**Kiffer:** Exactly. The coefficient on the within-neighbourhood deviation is the within-cluster effect, beta sub W. The coefficient on the neighbourhood mean is the between-cluster effect, beta sub B. And if those two coefficients differ significantly, you have a contextual effect.

**Sarah:** And the contextual effect itself, the additional benefit of living in a high-income neighbourhood beyond your own income, is the difference between the two.

**Kiffer:** Right. This kind of decomposition is increasingly standard in social epidemiology because it answers a different question than ordinary regression does. You're separating individual-level from contextual-level mechanisms, which has direct implications for whether interventions should target individuals, places, or both.

**Sarah:** Okay, let's switch to the practical side. How do you actually fit one of these models? In R, the workhorse package is lme4.

**Kiffer:** Right. The function is lmer, lowercase l m e r, for linear mixed-effects regression. The syntax is intuitive once you've seen it. You write the fixed-effects part exactly as you would in lm. Outcome tilde age plus smoker plus body mass index plus female. And then you add a parenthesized term for the random effects.

**Sarah:** Like one bar clinic id.

**Kiffer:** Exactly. The expression in parentheses one bar clinic id reads as, give every clinic its own intercept, drawn from a Normal distribution. The one before the bar means random intercept. If you wanted a random slope on smoking too, you'd write one plus smoker bar clinic id, which says give every clinic its own intercept and its own smoking effect, jointly Normal.

**Sarah:** Let's walk through a typical workflow. The lesson has an activity using a dataset called phaa clinics dot csv, with thirty clinics and roughly nine hundred and sixty patients.

**Kiffer:** First step, fit the null model. Just an intercept and a random effect for clinic. Outcome tilde one plus parenthesis one bar clinic id. This gives you the baseline variance partition. Run the icc function from the performance package, and you get the intraclass correlation coefficient directly.

**Sarah:** And that tells you whether clustering is meaningful before you start adding predictors.

**Kiffer:** Right. Second step, add patient-level fixed effects. Age, smoker, body mass index, female. Refit. Look at how much the residual variance shrank. The within-clinic variance should drop because you've explained some of it with patient predictors.

**Sarah:** And the between-clinic variance, sigma squared sub g, may or may not change much. It depends on whether the patient mix differs systematically across clinics.

**Kiffer:** Third step, add cluster-level fixed effects. Clinic urbanicity. Clinic size, perhaps standardized. These should reduce the between-clinic variance further if they explain why clinics differ from each other.

**Sarah:** Fourth step, ask whether the random intercept is even needed. There's a function called ranova in the lmerTest package that runs likelihood ratio tests on the random effects. We'll come back to the boundary issue in a moment.

**Kiffer:** Fifth step, fit a random-slopes model. Add one plus smoker bar clinic id, and then compare it to the random-intercept-only model with anova.

**Sarah:** Now this raises an issue we need to walk through carefully. Inference for fixed effects in mixed models. Plain lme4 famously does not print p-values for the fixed effects.

**Kiffer:** And the reason is principled. In ordinary linear regression, the residual degrees of freedom are easy to compute. Number of observations minus number of parameters. The t-statistic for a coefficient is exactly t-distributed under the null.

**Sarah:** But in mixed models, the right denominator degrees of freedom isn't obvious.

**Kiffer:** Right. You don't have a simple count, because the random effects are doing some intermediate amount of work. So Doug Bates, who wrote lme4, decided that printing a fake p-value would be misleading. He left it out.

**Sarah:** Which annoys students for about two weeks until they discover lmerTest.

**Kiffer:** Right. The lmerTest package is a wrapper that adds approximate degrees of freedom using one of two methods. Satterthwaite, which is the default. Or Kenward-Roger, which is more accurate but slower.

**Sarah:** Walk us through what those approximations actually do.

**Kiffer:** Satterthwaite estimates the effective degrees of freedom for each fixed-effect coefficient by matching the moments of the test statistic to a t distribution. It's based on a calculation involving the variances of the variance components and how they propagate through the standard error.

**Sarah:** Conceptually, it's saying, if I had a t-statistic with this much uncertainty in the denominator, what t distribution should I compare it to.

**Kiffer:** Exactly. And Kenward-Roger goes a step further. It also adjusts the standard errors themselves to account for uncertainty in the variance components, plus computes Satterthwaite-like degrees of freedom on the adjusted statistic. It's the gold standard for small to moderate samples but can be computationally expensive in large datasets.

**Sarah:** So what's the practical recommendation for students writing this up?

**Kiffer:** For most reports, lmerTest's default Satterthwaite is good enough and comes out automatically. If you have a small number of clusters, fewer than thirty or so, switch to Kenward-Roger, which you can request explicitly. If you have thousands of clusters, even plain Wald tests are fine, because the degrees of freedom are effectively infinite.

**Sarah:** Now inference for the random effects, the variance components themselves, is a different story.

**Kiffer:** Right. Suppose you want to test whether sigma squared sub g equals zero. That is, whether you need a random intercept at all. The natural tool is the likelihood ratio test. Fit the model with the random intercept. Fit the model without. Compare twice the log-likelihood difference to a chi-squared distribution.

**Sarah:** But there's a subtlety. The null hypothesis, sigma squared sub g equals zero, is on the boundary of the parameter space. Variances can't be negative.

**Kiffer:** Right. And when the null sits on the boundary, the standard chi-squared reference distribution is too conservative. The actual sampling distribution of the test statistic under the null is a fifty-fifty mixture of a chi-squared with one degree of freedom and a point mass at zero.

**Sarah:** And what does that boundary issue mean in practice when we're computing p-values?

**Kiffer:** It means the correct p-value is half the standard chi-squared p-value. So if you compute a chi-squared with one degree of freedom and get p equals zero point zero six, the boundary-corrected p-value is zero point zero three. The standard test is conservative by a factor of two.

**Sarah:** And the practical advice is, just halve the p-value when you're testing a single variance component on the boundary. The ranova function in lmerTest does this for you automatically.

**Kiffer:** Right. Some people skip the formal test entirely and use information criteria like the Akaike information criterion, AIC, or the Bayesian information criterion, BIC, for model selection. AIC penalizes complexity, BIC penalizes it more strongly. Both are valid for comparing nested mixed models.

**Sarah:** But there's an important rule about which likelihood to use when comparing models.

**Kiffer:** Yes. This is the maximum likelihood versus restricted maximum likelihood question, ML versus REML. They're two different ways to estimate the variance components in a mixed model.

**Sarah:** Walk us through the difference between those two methods.

**Kiffer:** Maximum likelihood, ML, estimates all the parameters at once by maximizing the full likelihood. The problem is, ML doesn't account for the degrees of freedom you've used up estimating the fixed effects. So it produces variance component estimates that are slightly biased downward.

**Sarah:** It's analogous to how the maximum likelihood estimator of a sample variance divides by n, but the unbiased estimator divides by n minus one. Same logic.

**Kiffer:** Exactly. Restricted maximum likelihood, REML, fixes that bias. It restricts the likelihood to a subspace orthogonal to the fixed-effects design, which effectively accounts for the degrees of freedom lost in estimating the fixed effects. The result is less biased variance component estimates, especially when the number of clusters is small.

**Sarah:** So REML is generally preferred for variance estimation. But there's a catch, right?

**Kiffer:** But REML log-likelihoods are not comparable across models with different fixed effects. Because the restricted subspace itself depends on the fixed-effects design, changing the design changes what likelihood you're computing. You'd be comparing apples and oranges.

**Sarah:** So the rule of thumb is, when you're comparing models with different fixed effects, refit them with ML and use ML likelihoods for the comparison. When you're comparing models with the same fixed effects but different random effects, you can use REML.

**Kiffer:** And when you're reporting your final variance component estimates, use REML. Most software, including lme4, defaults to REML for fitting and gives you a one-line option to switch to ML when you need it.

**Sarah:** Practical example. You have model two with patient and clinic predictors and a random intercept. You want to add a random slope for smoking, model three. Both fit with REML. Compare them with anova or with the likelihood ratio test, halving the p-value for the boundary.

**Kiffer:** Now imagine you wanted to test whether body mass index belongs in the fixed-effects part. You'd refit both models with ML, not REML, and compare those ML likelihoods. Different question, different rule.

**Sarah:** Now let's talk about predictions and what comes out the other side. There's a beautiful concept called the BLUP, the best linear unbiased predictor.

**Kiffer:** Right. The random effects, the u sub j values, are not directly observed. They're latent variables. After we fit the model and estimate the variance components, we can compute predicted values for each cluster's random effect, and those predicted values are the BLUPs.

**Sarah:** And BLUPs have this lovely property called shrinkage.

**Kiffer:** Yes. The BLUP for cluster j is a weighted average of the cluster's own raw mean and the overall mean. The weight depends on the cluster size and the intraclass correlation coefficient.

**Sarah:** Walk us through the intuition for that shrinkage formula.

**Kiffer:** The shrinkage factor is sigma squared sub g divided by the sum of sigma squared sub g and sigma squared over m, where m is the cluster size. When m is large, sigma squared over m is small, and the shrinkage factor approaches one. The BLUP basically equals the cluster's raw mean.

**Sarah:** And what happens when the cluster size m is small?

**Kiffer:** Sigma squared over m is large, so the denominator dominates and the shrinkage factor is small. The BLUP is pulled strongly toward the overall mean. The cluster doesn't have enough data to support a strong claim about its own deviation, so the model borrows from the population.

**Sarah:** Quick worked example. Suppose sigma squared sub g equals five, sigma squared equals twenty. For a hospital with one hundred patients, the shrinkage factor is five over five plus zero point two, which is five over five point two, about zero point nine six. The BLUP is essentially the raw hospital mean.

**Kiffer:** For a hospital with only four patients, the shrinkage factor is five over five plus five, which is one half. The BLUP is pulled halfway toward the overall mean. The model is saying, with only four patients, your hospital's raw mean is too noisy to trust on its own.

**Sarah:** And this borrowing of strength is one of the most attractive features of mixed models. Small clusters get stabilized estimates. Large clusters get faithful estimates. Nobody is over-fit.

**Kiffer:** And if you were running a public reporting program for hospitals, this is exactly what you want. A hospital that happened to have one bad outcome in four patients shouldn't be flagged as an outlier the same way a hospital with one bad outcome in four hundred patients would be.

**Sarah:** BLUPs are sometimes called empirical Bayes estimates, by the way.

**Kiffer:** Right. The Bayesian interpretation is, we have a prior on the cluster effects, namely the Normal distribution with mean zero and variance sigma squared sub g. We update that prior with the cluster's own data to get a posterior. The BLUP is the posterior mean. Empirical because we estimated the prior variance from the data instead of fixing it in advance.

**Sarah:** Okay, let's talk about diagnostics. After you've fit a mixed model, how do you check that the model is actually adequate?

**Kiffer:** Mixed models give you residuals at multiple levels of the hierarchy. Level-one residuals, which are the individual-level errors, the difference between each observation and its cluster-specific predicted value. Level-two residuals, which are the BLUPs themselves, representing how each cluster deviates from the overall mean.

**Sarah:** And you really should check both levels, not just one.

**Kiffer:** Right. The recommended workflow is to start at the highest level and work down. First check the BLUPs. Make a Q-Q plot to see whether they're roughly Normal. Look for influential clusters that are far from the bulk. Check whether the BLUPs are correlated with cluster-level predictors that aren't yet in the model, which might indicate omitted variables.

**Sarah:** And then you drop down to the level-one residuals.

**Kiffer:** Right. Plot the residuals against fitted values to check for nonlinearity or heteroscedasticity. Make a Q-Q plot for Normality. Look for outliers within clusters. Plot residuals against each predictor.

**Sarah:** Why work from the top down rather than the other way?

**Kiffer:** Because problems at the cluster level can manifest as apparent problems at the individual level. If one clinic is poorly fit because of unmeasured cluster characteristics, all of its patients might look like outliers at level one. Catching that at level two saves you from chasing ghosts at level one.

**Sarah:** And what are the options if the assumptions are violated?

**Kiffer:** There are several options. Box-Cox transformations of the response. Robust standard errors as a backstop. Switching to a generalized mixed model with a different distributional family, which is what Lesson 11 will introduce.

**Sarah:** On Box-Cox in particular, there's a quirk worth flagging. When you compare models fit on different transformations of the response, you have to use ML, not REML, for the likelihood comparison.

**Kiffer:** Right. Because the response variable itself is changing, REML log-likelihoods aren't comparable across the lambda values you're considering. Use ML, find the best lambda, then refit your final model with REML for the variance component estimates.

**Sarah:** Now there's one big conceptual point I want to make sure we land before we wrap. The fixed-effects coefficients in a mixed model have a specific interpretation, and the random-effects part adds something distinct that ordinary regression cannot give you.

**Kiffer:** Right. The fixed-effects coefficients in an LME, in lmer, give you the average effect across clusters. They're conditional-on-the-cluster effects, but they describe the typical clinic. So if you fit smoking as a fixed effect and the coefficient is six millimetres of mercury, the interpretation is, in a typical clinic, smokers have systolic blood pressure six millimetres of mercury higher than nonsmokers.

**Sarah:** And the random-effects part adds information about how that effect varies.

**Kiffer:** Exactly. If you also fit a random slope on smoking, the variance of the random slope tells you the spread. So you might learn that the smoking effect averages six millimetres of mercury but ranges from about two to ten across clinics. That's a fundamentally different kind of statement than ordinary regression can make.

**Sarah:** Mixed models give you both pieces. The population-level summary, and the cluster-level heterogeneity around it.

**Kiffer:** And that's a lens you can take into the rest of the course. Whenever a mixed model is used, ask both questions. What's the average effect? And how much does it vary across clusters?

**Sarah:** Let me also flag the comparison with generalized estimating equations from Lesson 9, because students always ask. Mixed models give cluster-specific or conditional effects. Generalized estimating equations give population-averaged or marginal effects. For continuous outcomes with the identity link, the two coincide. They produce essentially the same fixed-effects estimates.

**Kiffer:** Right. The big differences emerge for non-Gaussian outcomes, like binary or count outcomes, where conditional and marginal effects can be numerically quite different. We'll discuss that in Lesson 11.

**Sarah:** Let's pull the takeaways together.

**Kiffer:** First. The linear mixed-effects model combines fixed effects, the population-level coefficients shared across clusters, with random effects, which let one or more coefficients vary across clusters. Random intercepts let baselines vary. Random slopes let predictor effects vary. Most applied work uses random intercepts, sometimes with one or two random slopes, depending on the scientific question.

**Sarah:** Second. The variance components, sigma squared sub g for between-cluster variance and sigma squared for within-cluster variance, partition the total variance and yield the intraclass correlation coefficient. Reporting those numbers is part of the scientific contribution. Even before you discuss specific covariates, telling readers how much of the variation lives between clusters is informative.

**Kiffer:** Third. Estimation is by likelihood. Restricted maximum likelihood, REML, is preferred for unbiased variance component estimates. Plain maximum likelihood, ML, is required when comparing models with different fixed effects. R's lme4 package with the lmer function is the standard tool, with lmerTest layered on top for fixed-effect p-values via Satterthwaite or Kenward-Roger.

**Sarah:** Fourth. Inference comes in two flavours. For fixed effects, t-tests with approximate degrees of freedom, plus confidence intervals. For random effects, likelihood ratio tests with the boundary correction, halving the chi-squared p-value.

**Kiffer:** Fifth. Predictions and diagnostics live at multiple levels. BLUPs give cluster-specific predictions with shrinkage that borrows strength from the population, especially helping small clusters. Residuals exist at every level of the hierarchy and should be examined top-down, starting with cluster-level BLUPs and working down to individual residuals. The two layers tell you different things, and you need both.

**Sarah:** And one more conceptual point that ties everything together. The fixed effects describe the average. The random effects describe the variability. Mixed models give you both, which is why they're so well suited to questions that involve heterogeneity, treatment-effect variation, and contextual mechanisms.

**Kiffer:** Whenever someone tells you a treatment effect is six points, the next question should be, six points on average. How much does it vary across sites, across populations, across contexts. A mixed model is one of the most natural ways to answer that follow-up.

**Sarah:** Next up is Lesson 11. Mixed Models for Discrete Data.

**Kiffer:** Right. We extend everything we built today to binary outcomes, count outcomes, and ordinal outcomes. The conceptual structure is the same. Random intercepts, random slopes, contextual effects, BLUPs, model selection. The estimation gets harder, because the likelihood no longer has a closed form. The link function reshapes interpretation. And the choice between conditional and marginal effects, between mixed-model coefficients and GEE coefficients, becomes consequential, in some cases by a factor of two.

**Sarah:** Take care, everyone.

**Kiffer:** See you in Lesson 11.
