# Lesson 11: Mixed Models for Discrete Data
## v3 Expanded Podcast Transcript

**Sarah:** Welcome back to Office Hours. I'm Sarah.

**Kiffer:** And I'm Kiffer. Today we're working through Lesson 11, Mixed Models for Discrete Data. This is the natural sequel to Lesson 10. Last lesson we built linear mixed models for continuous outcomes. This lesson we extend everything we did to outcomes that aren't continuous.

**Sarah:** And by discrete outcomes you mean binary outcomes, count outcomes, ordinal outcomes. Things that don't fit a Normal distribution.

**Kiffer:** Right. Binary, like did the patient get referred to a specialist, yes or no. Count, like how many emergency department visits did this person have last year. Ordinal, like a self-rated health scale from one to five. None of those are continuous outcomes that we'd model with a Normal distribution.

**Sarah:** And the framework we used in Lesson 10, the linear mixed model with a random intercept or random slope, doesn't translate directly. We need a new tool.

**Kiffer:** We do. The tool is the generalized linear mixed model, often abbreviated GLMM. The way to think about it is exactly the way you'd guess from the name. Generalized linear models, GLMs, are what we use when the outcome isn't continuous. Logistic regression is a GLM for binary outcomes. Poisson regression is a GLM for counts. The link function is what makes the model linear on a transformed scale.

**Sarah:** So a generalized linear mixed model is a generalized linear model that also has random effects. We take the linear predictor from the GLM and add a random effect, the same way we did in linear mixed models last lesson.

**Kiffer:** Exactly. If you understood Lesson 10, you already understand the structural part of Lesson 11. The random intercept does the same thing it did before. It captures between-cluster variation. The random slope, when present, lets the effect of a predictor vary across clusters. All of those concepts carry over.

**Sarah:** What changes is the link function and the likelihood.

**Kiffer:** And the interpretation. There's a new interpretive distinction we're going to spend a lot of time on today. It's called the subject-specific versus population-averaged distinction. It does not arise in linear mixed models. It is unavoidable in GLMMs. And it's probably the single biggest conceptual hurdle in this lesson.

**Sarah:** Okay. Let's start with the simplest case. Logistic regression with a random intercept. Walk us through what that looks like.

**Kiffer:** Picture the clinic dataset we've been using since Lesson 9. Patients are nested within clinics. The outcome we used in Lesson 10 was something continuous, like systolic blood pressure. Now imagine the outcome is binary instead. Was the patient referred to a specialist, yes or no. We code yes as one and no as zero.

**Sarah:** And without random effects, you'd just fit ordinary logistic regression. Logit of the probability of referral on the predictors.

**Kiffer:** Right. The logit link maps probabilities, which live between zero and one, onto the real line. So we model the log-odds of referral as a linear combination of predictors. To make this a mixed model, we add a random intercept for clinic. So the log-odds of referral for patient i in clinic j becomes the intercept, plus the predictor effects, plus a clinic-level random intercept that is Normal with mean zero and variance sigma squared g.

**Sarah:** And the role of the random intercept is the same as last lesson. Some clinics have higher overall referral rates than others, even after adjusting for patient case-mix. The random intercept absorbs that clinic-level variation.

**Kiffer:** Exactly. And just like in linear mixed models, observations within the same clinic are correlated, because they share that clinic's random intercept. The model accounts for that correlation correctly.

**Sarah:** And in R, the syntax is almost identical to lmer. Just swap lmer for glmer and add a family argument.

**Kiffer:** Right. In the lme4 package, glmer is the function for generalized linear mixed models. You write referred tilde age plus female plus smoker plus body mass index plus clinic urban plus, in parentheses, one given clinic id. Then family equals binomial. That fits a logistic GLMM.

**Sarah:** And the output looks similar to lmer. Fixed effect estimates, standard errors, a variance for the random intercept.

**Kiffer:** Similar in shape. The fixed effects are now log-odds ratios rather than mean differences. You exponentiate them to get odds ratios. And the random-effect variance is on the log-odds scale, which is harder to interpret directly than a variance on the original outcome scale.

**Sarah:** Which is part of why GLMMs feel less intuitive than linear mixed models. Everything is happening on a transformed scale.

**Kiffer:** That's exactly the right framing. The random effects live on the link-function scale. So in logistic regression they live on the log-odds scale. You always have to remember to exponentiate or back-transform when you're trying to communicate results.

**Sarah:** Okay. Let's hit the conceptual distinction you flagged earlier. Subject-specific versus population-averaged. What's going on there?

**Kiffer:** This is the part of the lesson that I want everyone to slow down for, because it's subtle and it really matters. In a linear mixed model with an identity link, subject-specific and population-averaged effects are the same number. There's no distinction. In a GLMM with a non-linear link, like logit, they are different numbers. That's because of the math of how non-linear functions interact with averaging.

**Sarah:** Okay, give me the intuition first. Then we can do an example.

**Kiffer:** Intuition. The subject-specific effect, sometimes called the conditional or cluster-specific effect, is the effect within a particular cluster, holding the cluster's random effect fixed. So in our clinic example, the subject-specific odds ratio for being a smoker tells you, within a given clinic, how much higher are the odds of referral for a smoker compared to a non-smoker.

**Sarah:** It's the within-clinic effect. Two patients at the same clinic, same age, same body mass index, but one smokes and one doesn't. How do their odds compare?

**Kiffer:** Exactly. The population-averaged effect, sometimes called the marginal effect, is the effect averaged across the whole population of clusters. So you ask, across all clinics, on average, how do the odds of referral compare between smokers and non-smokers?

**Sarah:** And those two questions give different numerical answers when the link function is non-linear. Why?

**Kiffer:** Because of something called non-collapsibility. The odds ratio is non-collapsible across strata. If two clinics have different baseline referral rates, and you compute the odds ratio for smoking within each clinic and then average, you do not get the same number as if you pooled everyone together and computed an unconditional odds ratio. The math just doesn't commute.

**Sarah:** And which one is bigger?

**Kiffer:** Subject-specific is always larger in magnitude than population-averaged. That's a general property in logistic GLMMs. The conditional effect is more extreme. The marginal effect is shrunken toward the null.

**Sarah:** Is there a formula that connects them?

**Kiffer:** There is, and it's a really useful sanity check. Approximately, the population-averaged log-odds equals the subject-specific log-odds divided by the square root of one plus 0.346 times the random-effect variance. The constant 0.346 is specific to the logit link. For the probit link it would be one instead.

**Sarah:** Let's plug numbers into that. The lesson does an example with antibiotic resistance in cattle herds, but let's stay with our clinic referral example.

**Kiffer:** Sure. Suppose we fit the logistic GLMM and the subject-specific log-odds for smoking is 0.85, and the random-intercept variance for clinic is 1.2.

**Sarah:** So the subject-specific odds ratio is e to the 0.85, which is about 2.34. Within a given clinic, smokers have about 2.3 times the odds of being referred compared to non-smokers.

**Kiffer:** Right. Now let's convert. The denominator in the formula is the square root of one plus 0.346 times 1.2, which is the square root of 1.4152, about 1.19. So the population-averaged log-odds is 0.85 divided by 1.19, which is about 0.71. Exponentiate, and the population-averaged odds ratio is about 2.03.

**Sarah:** So the within-clinic effect is 2.34. The across-population effect is 2.03. Both real, but they're answering different questions.

**Kiffer:** And that gap grows as the random-effect variance grows. If the clinics were even more heterogeneous, say variance of three or four, the gap between subject-specific and population-averaged would be much wider.

**Sarah:** Which one do you actually want?

**Kiffer:** It depends entirely on the question. If you're a clinician and you want to know, for my patient sitting in front of me, how does smoking change the odds of needing a referral, you want subject-specific. That's the within-clinic, within-context effect. If you're a policy-maker and you want to know, across the whole population of clinics, on average, what would happen to referral rates if smoking dropped, you want population-averaged.

**Sarah:** And glmer gives you subject-specific by default.

**Kiffer:** It does. Because the random effects are explicitly in the model, the coefficients are interpreted conditional on those random effects being at their average value. To get population-averaged, you'd typically fit a separate model using generalized estimating equations, or GEE for short.

**Sarah:** GEE. Let's name what that is. Generalized estimating equations is a different estimation framework. It doesn't model random effects explicitly. Instead, it models the within-cluster correlation as a nuisance parameter and estimates only the population-averaged regression coefficients.

**Kiffer:** Right. In R, the geepack package gives you geeglm. The syntax is almost identical to glm, but you specify a cluster id and a working correlation structure. Exchangeable is the most common default. The output is population-averaged odds ratios, and they will generally be smaller in magnitude than the glmer subject-specific odds ratios.

**Sarah:** Working correlation. Quick definition. That's GEE's stand-in for the random-effects structure. You're telling it how observations within a cluster relate to each other. Exchangeable means every pair within a cluster has the same correlation, like a random intercept. Independent assumes no within-cluster correlation. Autoregressive means correlation decreases with distance, which is good for longitudinal data with regular spacing. Unstructured estimates a separate correlation for every pair.

**Kiffer:** And the nice property of GEE is that even if you specify the working correlation wrong, your fixed-effect estimates are still consistent. The standard errors get adjusted using a sandwich estimator, which is robust to misspecification of the correlation structure. So GEE is forgiving in a way that GLMMs are not.

**Sarah:** Which is a big practical attraction. You don't have to be exactly right about the correlation structure to get reasonable inference.

**Kiffer:** Right. The trade-off is that GEE doesn't model the random effects, so you can't predict cluster-specific quantities, you don't get an ICC out of it, and you can't easily handle multi-level structures with crossed or nested random effects. It's a trade. Robust population-averaged inference, but less flexibility.

**Sarah:** And it's actually pretty common to fit both and report both.

**Kiffer:** It is. The lesson recommends running glmer for the subject-specific interpretation and then geeglm for the population-averaged interpretation, and putting them side by side. They tell complementary stories. The subject-specific tells you how the predictor operates within a cluster. The population-averaged tells you what you'd see if you aggregated across clusters and looked at the marginal association.

**Sarah:** Two more things on this distinction before we move on. The lesson talks about the median odds ratio and the latent variable intraclass correlation coefficient. Those felt like new vocabulary.

**Kiffer:** Yeah, those are both ways to summarize how much heterogeneity there is between clusters in a logistic GLMM. The median odds ratio, often abbreviated MOR, gives you a feel for how much the odds of the outcome vary between two randomly chosen clusters.

**Sarah:** How does that work concretely?

**Kiffer:** Imagine you randomly pick two clinics out of your sample. You take one patient from each, both with the same covariates. The two patients have different odds of referral, just because they're at different clinics. The median odds ratio is the median of that ratio, taken over all possible pairs of clinics. The formula is e to the square root of two times sigma squared g, all multiplied by 0.6745.

**Sarah:** And what does the number mean in plain language?

**Kiffer:** An MOR of one means no between-clinic variation. Larger values mean more heterogeneity. An MOR of two would mean that, when you compare two random clinics, the median ratio of odds is two. So a typical between-clinic gap is about a doubling of odds. That's substantial.

**Sarah:** And it's nice because it's on the same scale as the fixed-effect odds ratios. You can directly compare. Is the between-clinic heterogeneity comparable to the smoking effect? Or is it larger? Or smaller?

**Kiffer:** Exactly. It puts the random-effect variance on a scale that's interpretable next to the things you care about.

**Sarah:** Let's compute one quickly. With our sigma squared g of 1.2, the median odds ratio is e to the square root of two times 1.2, which is the square root of 2.4, about 1.549, times 0.6745. That product is about 1.045. And e to the 1.045 is about 2.84.

**Kiffer:** Right. So the median odds ratio in our example is about 2.84. The typical between-clinic gap is bigger than the within-clinic smoking effect, which was an odds ratio of 2.34. That's a striking finding. It says, plainly, where you go for care matters more than whether you smoke. Which is the kind of insight you can only see when you take the clustering seriously.

**Sarah:** And the latent variable ICC?

**Kiffer:** The intraclass correlation coefficient, ICC, in linear mixed models was easy. It was the random-intercept variance divided by the total variance. In logistic GLMMs, there's no obvious total variance, because the residual variance isn't a parameter we estimate. The latent-variable ICC handles this by imagining the binary outcome as the result of thresholding an underlying continuous variable.

**Sarah:** So you imagine that there's a hidden continuous risk score for each person, and the person gets referred if that score crosses some threshold.

**Kiffer:** Exactly. And in a logistic model, the underlying error term on that latent scale follows a standard logistic distribution, which has variance pi squared over three, about 3.29. So the latent-variable ICC is sigma squared g divided by sigma squared g plus pi squared over three. It tells you, on the latent scale, what proportion of variance is between clusters.

**Sarah:** And just to ground that. If sigma squared g is, say, 1.2, then the latent ICC is 1.2 divided by 1.2 plus 3.29, which is 1.2 over 4.49, about 0.27. Twenty-seven percent of latent-scale variance is between clinics.

**Kiffer:** Right. That's a meaningful clustering effect. You'd take it seriously.

**Sarah:** Let's pivot to count outcomes. The same framework applies, but the link function is different.

**Kiffer:** Right. For counts, the natural model is Poisson with a log link. So we model log of the expected count as a linear combination of predictors. To make it a GLMM, we add a random intercept on the log scale. Same structure, different link, different distribution.

**Sarah:** And because the link is log, exponentiating gives multiplicative effects on the rate scale.

**Kiffer:** Yes. If the log scale is additive, the original scale is multiplicative. So a random intercept on the log scale becomes a random multiplier on the rate scale. If the random effects are Normal on the log scale, the multipliers exp of u are log-Normal on the rate scale.

**Sarah:** Make that concrete. Imagine we're modeling annual emergency department visits across patients nested within neighborhoods. Some neighborhoods, on average, generate more visits than others, even after we adjust for age and chronic conditions.

**Kiffer:** Right. We'd fit log of expected visits as a function of fixed effects, plus a neighborhood-level random intercept on the log scale. Suppose the random-intercept variance is 0.25. Then on the rate scale, exp of u has a log-Normal distribution. The standard deviation of u is 0.5, so a typical high-rate neighborhood has rate multiplier exp 0.5, about 1.65. A typical low-rate neighborhood has rate multiplier exp negative 0.5, about 0.61. So even after adjusting for everything we've measured, the highest-rate neighborhoods see roughly 1.65 over 0.61, about 2.7 times the visit rate of the lowest-rate neighborhoods. That's the kind of structural inequality the random effect surfaces.

**Sarah:** And there's a nice connection between Poisson GLMMs and the negative binomial distribution, which we covered earlier in the course for overdispersion.

**Kiffer:** Yeah, this is one of those connections that makes a lot of things click. The negative binomial distribution can be derived as a Poisson-gamma mixture. You assume that the Poisson rate parameter itself varies across observations according to a gamma distribution. When you marginalize over that gamma, you get a negative binomial. The variance is greater than the mean, which captures overdispersion.

**Sarah:** So a Poisson GLMM with gamma-distributed random effects on the rate scale is essentially a negative binomial.

**Kiffer:** Effectively, yes. The two perspectives meet there. If you've got count data with overdispersion and you want to write down a clustered model, a Poisson GLMM with Normal random effects on the log scale is the standard approach. That's what most software does. If you instead use gamma random effects on the rate scale, you recover the negative binomial closed form.

**Sarah:** And in practice, when do you reach for which?

**Kiffer:** If your overdispersion is structured, meaning you can attribute it to known clusters like clinics or households or repeated visits to the same patient, you want a Poisson GLMM with random effects for those clusters. If the overdispersion is just generic extra variation without a clear clustering structure, a plain negative binomial is often enough.

**Sarah:** And the lesson notes you can also combine them. Negative binomial GLMMs do exist. They handle overdispersion plus explicit clustering.

**Kiffer:** They do, and they're useful when you have both. Like emergency department visit counts that are overdispersed even within the same patient, plus you want to add a household-level random effect for shared environment.

**Sarah:** What about ordinal outcomes? Things with three or more ordered categories.

**Kiffer:** The standard fixed-effects model for ordinal data is the proportional odds logistic regression, which we covered earlier in the course. That extends naturally to a GLMM by adding random effects to the latent variable underlying the ordinal categories.

**Sarah:** So the latent-variable approach we used for binary outcomes does double duty here.

**Kiffer:** It does. You imagine a continuous latent severity variable, and the observed ordinal category is determined by which thresholds the latent value crosses. The random effect lets the latent severity shift up or down for each cluster. Within a clinic, all patients share that clinic's latent shift.

**Sarah:** Multinomial is harder, though.

**Kiffer:** Multinomial is definitely harder. Each non-reference category gets its own set of regression coefficients, and you typically want correlated random effects across categories. The dimensionality of the random-effects integral grows quickly. There's specialized software for it, but it's less standard than logistic or Poisson GLMMs, and convergence can be a real fight.

**Sarah:** There's also zero-inflated models, which we touched on earlier in the course. How do random effects interact with those?

**Kiffer:** Zero-inflated count models have a count process and a separate point mass at zero, modeling the probability of being a structural zero. You can add random effects to the count part, the zero-inflation part, or both. Same conceptual idea. Each component has its own linear predictor, and each can carry random effects to capture clustering.

**Sarah:** Okay, let's hit estimation. You said GLMMs are harder to fit than linear mixed models, and that there's no closed-form likelihood.

**Kiffer:** Right. This is the part of the lesson that gets technical, but it's worth understanding even if you're not going to derive anything yourself, because it explains why GLMM fits sometimes fail and what to do when they do.

**Sarah:** Walk through the problem. Why is the likelihood not closed form?

**Kiffer:** In a linear mixed model with Normal random effects and Normal residuals, you can integrate out the random effects analytically. Everything is Gaussian, and the marginal likelihood has a closed form. In a GLMM with, say, a binomial outcome and Normal random effects, the joint likelihood is not Gaussian. When you try to integrate out the random effects, the integral has no analytic solution.

**Sarah:** So you have to approximate.

**Kiffer:** You have to approximate. And the menu of approximations is what the lesson lays out. There are several methods, and they trade off accuracy against computational cost.

**Sarah:** Let's go through them in order from cheapest to most accurate.

**Kiffer:** Sure. Cheapest is quasi-likelihood. There are two flavors. Marginal quasi-likelihood, MQL, and penalized quasi-likelihood, PQL. Both work by Taylor-expanding the model and turning it into something that looks like a linear mixed model that you can fit iteratively. MQL gives population-averaged-style estimates. PQL gives subject-specific-style estimates.

**Sarah:** And the catch?

**Kiffer:** The catch is that quasi-likelihood, especially first-order quasi-likelihood, can be markedly biased. Simulation studies have shown this consistently. With binary outcomes, with small clusters, with large random-effect variances, first-order quasi-likelihood can underestimate both fixed effects and variance components substantially. Second-order PQL is better but still not great in tough cases.

**Sarah:** So quasi-likelihood is fast but unreliable.

**Kiffer:** Right. The standard recommendation is to avoid first-order MQL altogether and to use quasi-likelihood only as a last resort when nothing else will fit.

**Sarah:** Next up?

**Kiffer:** Next is the Laplace approximation. The Laplace approximation replaces the integrand with a Normal distribution centered at its mode. It's mathematically equivalent to adaptive Gauss-Hermite quadrature with a single quadrature point.

**Sarah:** And that's what glmer in lme4 uses by default.

**Kiffer:** It is. Laplace is a good practical default. It's much more accurate than quasi-likelihood, and it's fast enough for most applied problems. It can struggle when the random-effects variance is very large or the cluster sizes are very small, but in typical applications it's reasonable.

**Sarah:** Then adaptive Gauss-Hermite quadrature with more points.

**Kiffer:** Right. The next step up is to evaluate the integrand at multiple carefully chosen points, called quadrature points, and take a weighted sum. Adaptive quadrature centers and scales the points based on the mode and curvature of each cluster's integrand, which makes it much more efficient than non-adaptive quadrature.

**Sarah:** And how many points do you need?

**Kiffer:** Default in many packages is around seven. The lesson recommends checking sensitivity. Try seven, then nine, then maybe fifteen. If the estimates are stable, you're fine. If they shift meaningfully, you need more points. The catch is that computational cost grows with the number of random effects. With one random effect and seven points you do seven evaluations per cluster. With two correlated random effects and seven points each, you do forty-nine evaluations per cluster. It scales fast.

**Sarah:** So adaptive quadrature is the gold standard, but it can become impractical.

**Kiffer:** It is, and it can. Especially with multiple random effects or crossed random effects. That's where Markov chain Monte Carlo comes in.

**Sarah:** Markov chain Monte Carlo, MCMC, is the Bayesian alternative.

**Kiffer:** Right. Instead of trying to maximize a likelihood, you put priors on all the parameters and you sample from the posterior distribution using MCMC. There are R packages like brms and rstanarm that wrap up Stan to fit GLMMs Bayesian-style with sensible default priors. MCMC handles complex random-effect structures gracefully and gives you full uncertainty quantification, but it's computationally expensive and has its own learning curve.

**Sarah:** And when you can't get glmer to converge, switching to a Bayesian fit is one of the recommended escape hatches.

**Kiffer:** It is. Often a Bayesian fit will converge where a frequentist fit fails, partly because the priors regularize the estimation, and partly because MCMC just handles awkward likelihood surfaces better than gradient-based optimization.

**Sarah:** Let's talk practical convergence problems. What does it look like when glmer is failing, and what do you do?

**Kiffer:** lme4 will throw warnings. The most common is a warning that the optimizer didn't converge, or that the gradient at the solution is too large, or that the random-effect variance estimate is on the boundary, meaning it's pinned at zero. Those warnings are real. You should not ignore them.

**Sarah:** What's your toolkit for fixing them?

**Kiffer:** Several things to try. First, simplify the random-effects structure. If you have correlated random slopes and intercepts, try independent random effects. If that still fails, drop the random slope and keep only the random intercept. Often the data don't support all the random effects you've asked for.

**Sarah:** Second?

**Kiffer:** Second, switch optimizers. lme4 has multiple optimizer options. The default is bobyqa. You can also try Nelder-Mead, or load the optimx package and try a few alternatives. Sometimes one optimizer succeeds where another fails.

**Sarah:** Third?

**Kiffer:** Third, increase quadrature points if you're using glmer with the nAGQ argument greater than one. Sometimes the issue is approximation error rather than the underlying optimization.

**Sarah:** Fourth?

**Kiffer:** Fourth, switch to a Bayesian fit. As we said, MCMC often converges where frequentist methods fail. Use brms or rstanarm with weakly informative priors and let the chains run long enough.

**Sarah:** And fifth, switch to GEE for the population-averaged effect.

**Kiffer:** Right. If your scientific question is well-served by a population-averaged answer, GEE doesn't require numerical integration at all. It just needs you to specify a working correlation structure. It's robust and it converges easily. So for some questions, GEE is the practical answer when GLMMs won't behave.

**Sarah:** Okay. Let's hit one more topic from the lesson before takeaways. Inference and diagnostics in GLMMs. What's different from linear mixed models?

**Kiffer:** Most things are similar in spirit but trickier in practice. For fixed effects, you use Wald-type tests, the estimate divided by its standard error compared to a Normal distribution. Those work fine when sample sizes are large and the parameters are far from boundaries.

**Sarah:** And when they're not?

**Kiffer:** When estimates are near boundary values, like a variance close to zero or a probability close to zero or one, Wald tests can be misleading. Confidence intervals can include impossible values, and p-values are unreliable. The recommended alternative is a likelihood ratio test or profile likelihood confidence interval, but those require maximum likelihood estimation, not quasi-likelihood.

**Sarah:** So if you've fit with PQL, you don't get good likelihood-based inference.

**Kiffer:** You don't. That's another reason to prefer Laplace or full quadrature when you can. They give you a real likelihood you can use for inference.

**Sarah:** What about residuals and diagnostics?

**Kiffer:** Residuals in GLMMs are harder to read than in linear mixed models. Pearson residuals and deviance residuals are the standard, but their distributions are not as clean as raw residuals in linear models. The DHARMa package in R generates simulated residuals that have a uniform distribution under the correct model, which makes diagnostic plots much more interpretable. I recommend it strongly.

**Sarah:** What kinds of problems do those diagnostics surface?

**Kiffer:** Several common ones. Overdispersion, where the residuals show more variability than the model expects, which often means you should switch from Poisson to negative binomial or add a random effect. Zero inflation, where there are more zeros than the model predicts. Misspecified functional form for a continuous predictor, like a non-linear age effect modeled as linear. And outlier clusters where one clinic or one patient is dragging the fit. DHARMa surfaces all of those visually.

**Sarah:** And the lesson is firm that diagnostics aren't optional. Even if your model converges and your coefficients look sensible, you have to check the fit.

**Kiffer:** Especially in GLMMs, where the estimation is approximate and the likelihood surface can be difficult. Stable estimates do not mean a well-fitting model. You have to look.

**Sarah:** And the lesson also mentions related models that share the random-effects spirit. Random-effects survival models, latent-variable formulations, beta-binomial models for grouped binary data.

**Kiffer:** Right. The beta-binomial in particular is worth knowing about. If your data are grouped binary, like the proportion of animals testing positive in each herd, and you only have group-level predictors, the beta-binomial is a closed-form alternative to a logistic GLMM. No numerical integration. It directly estimates the ICC. It's simple and robust. Its limitation is that it doesn't naturally handle individual-level predictors or multiple hierarchical levels.

**Sarah:** So beta-binomial is a nice tool when you have aggregated data with no individual covariates.

**Kiffer:** Exactly. And for the data structures you'll see earlier in this series, that's not always the case, but it shows up enough that it's worth recognizing.

**Sarah:** Let's pull the takeaways together.

**Kiffer:** Three big ones.

**Sarah:** First. Generalized linear mixed models, GLMMs, extend the mixed-model framework to non-Normal outcomes. Binary outcomes get logistic GLMMs. Counts get Poisson or negative binomial GLMMs. Ordinal outcomes get cumulative-link GLMMs. The structural ideas from Lesson 10, random intercepts, random slopes, variance partitioning, all carry over. What changes is the link function, the likelihood, and crucially, the interpretation of effects.

**Kiffer:** Second. The subject-specific versus population-averaged distinction is fundamental in GLMMs. They are not the same number when the link is non-linear. Subject-specific is the within-cluster effect, what glmer gives you. Population-averaged is the across-population effect, what generalized estimating equations, GEE, gives you. Both are legitimate. Which one you want depends on whether your question is about a typical individual within a cluster or about the population as a whole. Often the right answer is to report both.

**Sarah:** Third. Estimation of GLMMs is harder than estimation of linear mixed models because the likelihood has no closed form. You will encounter Laplace approximation, adaptive Gauss-Hermite quadrature, penalized quasi-likelihood, and Markov chain Monte Carlo as different approximation strategies. Laplace, the default in glmer, is a reasonable starting point. Convergence problems are real, and the toolkit for them is to simplify the random-effects structure, switch optimizers, increase quadrature points, switch to Bayesian fitting, or fall back to GEE if a population-averaged answer is acceptable.

**Kiffer:** And one more thing I'd add as a piece of practical advice. When you fit a GLMM in a real analysis, document everything. The estimation method, the number of quadrature points, any convergence warnings, any sensitivity checks you did. GLMMs are subtle enough that reproducibility depends on those details. Don't just paste a coefficient table and move on. Show your reader that you tested the fit.

**Sarah:** And tie it back to where this fits in the course arc. We've now built mixed models for continuous outcomes in Lesson 10, and mixed models for discrete outcomes in Lesson 11. Together that covers most of the modeling toolbox you need for clustered data.

**Kiffer:** Right. And the last big piece of clustering we haven't tackled is repeated measures. Same person, measured many times. That's the most common form of clustering in applied epidemiology, and it deserves its own lesson, which is exactly where we're going next.

**Sarah:** Lesson 12. Repeated Measures Data.

**Kiffer:** Lesson 12 specializes the mixed-model framework to repeated measures, where the cluster is the individual and the units are the repeated observations on that individual over time. We'll see how within-subject correlation can be modeled either through random effects, the mixed-model approach, or through a structured residual covariance matrix, the marginal or GEE approach. We'll talk about handling missing data, unequal time spacing between visits, and how to test for change over time. It pulls together everything we've built in the past few lessons.

**Sarah:** And Lesson 12 is the final lesson of the course, so it's also where the course wraps up. A capstone moment for the whole modeling arc.

**Kiffer:** Exactly. So Lesson 12 is the natural next step, and the closer. Take care, everyone.

**Sarah:** See you there.