# Lesson 2 — Data Cleaning & Descriptive Analyses (v3 expanded)

*Companion-podcast transcript • Sarah & Kiffer* 
*~5586 words • ~30.0 min audio*

---

**Sarah:** Welcome back to Office Hours. I'm Sarah.

**Kiffer:** And I'm Kiffer. Today we're working through Lesson 2, Data Cleaning and Descriptive Analyses. Lesson 1 was about the structured workflow. Lesson 2 is the lesson where the workflow finally gets executed on actual data.

**Sarah:** And before we get to any of the content, I want to set expectations about how this lesson feels in practice. Because I think when students hear the phrase data cleaning, they imagine a fairly mechanical process. You open the file, you fix some typos, you move on. That is not what this is.

**Kiffer:** Yeah. The honest version is that working analysts spend somewhere between sixty and eighty percent of their time on data preparation, cleaning, and exploration before any modeling happens. That's not because they're slow. That's because the data, when they arrive, are almost never in the shape you need them to be in.

**Sarah:** Sixty to eighty percent. So if you're budgeting a study, modeling is the visible tip and cleaning is the iceberg.

**Kiffer:** Exactly. And there's a related principle in the lesson that I want to lift out, because it justifies the whole exercise. The quality of your analysis can never exceed the quality of your data. You can run the most sophisticated regression in the world, with the best causal diagram, and if the underlying values are wrong, the answer is wrong. You don't recover from bad data with fancier statistics.

**Sarah:** Garbage in, garbage out, but with confidence intervals.

**Kiffer:** Pretty much. So the lesson is structured to take that seriously. There are three big content sections. Section one is data cleaning, which is the detection and repair of errors and the handling of missing values. Section two is descriptive statistics for individual variables, which is how you summarize what's there once it's clean. And section three is exploring relationships between variables, which is where you start looking at how things move together.

**Sarah:** And throughout the lesson there's a running example, the phaa_survey dataset, a simulated public-health survey of about eight hundred adults. We won't drag listeners through every R command, but I'll mention the dataset when it's useful. Let's start with section one.

**Kiffer:** Section one. Data cleaning.

**Sarah:** The lesson opens by reminding you that data travel through a pipeline before they ever reach you. Collection, then entry, then storage, then analysis. Errors can enter at every single one of those stages.

**Kiffer:** And it's worth slowing down on that, because students sometimes assume errors are mostly typos. They are not. At the collection stage, you can have poorly worded survey items, interviewer bias, instruments that drift out of calibration, participants who misunderstand a question. At the entry stage, transcription mistakes when paper forms get keyed in. At the storage stage, version control problems, format incompatibilities, the file someone emailed you last week is not the file someone emailed you this week. And at the analysis stage, miscoded variables, inappropriate transformations, software bugs.

**Sarah:** So when we say data cleaning, we are not just talking about fixing one kind of error. We are talking about a whole class of problems with very different sources.

**Kiffer:** Right. And the first concrete cleaning technique the lesson introduces is the range check. The idea is simple. For each variable, define a plausible biological or measurement range, and flag any value outside it.

**Sarah:** And the lesson gives examples that are deliberately a little extreme so they stick. A height of two hundred and fifty centimeters. A weight of five kilograms for an adult. An age of two hundred.

**Kiffer:** Yeah, and those examples are useful because every one of them is biologically impossible. So they're easy to flag. The harder ones are unusual but plausible. A systolic blood pressure of two hundred and forty. That's extreme. It might be a real measurement on a hypertensive emergency patient, or it might be a decimal-point shift from twenty-four point zero. You can't tell without going back to the source.

**Sarah:** And the lesson distinguishes hard limits from soft limits. Hard limits are the biologically impossible values, the ones you flag as definite errors. Soft limits are the unusual but possible values that you flag for review.

**Kiffer:** Right. And the lesson has a beautiful applied example here that I want to read out loud, because it captures what good cleaning looks like. A cohort study of cardiovascular disease collected blood pressure on twenty-four hundred participants. During cleaning, fifteen systolic values exceeded three hundred millimeters of mercury, which is biologically implausible. The analyst traced twelve of those to a decimal-point shift, where one thousand two hundred had been entered instead of one hundred and twenty point zero. Those twelve were corrected from the original paper forms. Three could not be verified, and were set to missing with a note that said original form illegible, value implausible. And every decision was logged with a date stamp.

**Sarah:** And that last sentence is the one I want students to internalize. Every decision was logged. The audit trail is not optional. We'll come back to that.

**Kiffer:** The second cleaning technique is the logical check, which the lesson sometimes calls a consistency check. Range checks look at one variable at a time. Logical checks look across variables and ask whether the values are mutually compatible.

**Sarah:** Give us some concrete examples so listeners can picture what to look for.

**Kiffer:** Sure. If someone is recorded as a never-smoker, they should not have a non-zero pack-year value. Pack-years is a smoking dose measure that's only defined for people who have smoked. So a never-smoker with five pack-years is logically impossible. If someone is recorded as not having any children, they should not have a parity greater than zero. Parity is the number of pregnancies that reached viable gestational age. If you've got zero children but a parity of three, something is wrong somewhere.

**Sarah:** And the lesson gives a few more from the medical record world. A discharge date that's earlier than an admission date. A date of diagnosis that's earlier than a date of birth, which actually appears in one of the lesson's knowledge checks. Ages that don't match birth dates. Skip patterns in questionnaires that aren't being respected, where a participant answered a follow-up question they should have skipped.

**Kiffer:** And the practical move is, run cross-tabulations between related variables, and inspect the impossible cells. Smoking status by pack-years. Parental status by parity. Admission date and discharge date in a single comparison. Most logical errors will jump out within minutes if you actually look.

**Sarah:** And before we move from logic checks to missing data, the lesson sneaks in a related topic that fits here. Outlier detection. An outlier is an observation that lies far from the bulk of the data. It might be a real extreme value, it might be a data error, or it might be a person from a slightly different population. The cleaning question is, do you flag it, and if so, with what rule.

**Kiffer:** Three rules show up. The interquartile-range rule, where you flag values below the first quartile minus one point five times the interquartile range, or above the third quartile plus one point five times the interquartile range. That's actually the same rule that draws the whiskers on a box plot. The z-score rule, where you flag values whose standardized distance from the mean is greater than three. And visual inspection, where you literally look at a histogram or a box plot and see what jumps out.

**Sarah:** And the interquartile-range rule is generally preferred for skewed data, because it uses percentiles, which are robust. The z-score rule uses the mean and standard deviation, which are themselves pulled around by extreme values. So in skewed data, the z-score can mask outliers, a phenomenon called the masking effect.

**Kiffer:** And the practical rule of thumb the lesson is sharp on. Never delete an outlier just because it is extreme. Investigate it. Decide whether it's a data error you can fix, a real but unusual value you should keep, or evidence that your sample is mixed. Document the decision in the audit trail.

**Sarah:** The third piece is missing data, and the lesson gives this its own substantial treatment, because missing data are everywhere in epidemiology and the analytic implications are big.

**Kiffer:** First step is just quantification. How much is missing, by variable, and by participant. The variable view tells you which fields are problematic. The participant view tells you whether the missingness is concentrated in particular people, which often means a whole questionnaire section was skipped or a follow-up visit was lost.

**Sarah:** And then look for patterns. Are some variables missing more often together? That can tell you something about the data-generating process. If items five through ten of a questionnaire are missing as a block for the same fifty participants, those fifty people probably reached a logical break and stopped, or the page got skipped.

**Kiffer:** Right. And once you've quantified missingness, you have to characterize the mechanism, because the mechanism dictates which analytic strategies are valid. The lesson references the missing-data framework from an earlier lesson, which itself comes from a 1976 paper by Donald Rubin. Three mechanisms.

**Sarah:** Walk through them slowly. These names are confusing because they sound similar but mean different things.

**Kiffer:** Yeah, the names are notoriously confusing. The first one is missing completely at random, abbreviated MCAR. Under missing completely at random, the probability that a value is missing has nothing to do with anything, observed or unobserved. It's pure chance. The classic example is a lab tech who randomly drops some test tubes. Whether your value is missing is unrelated to your other characteristics, and unrelated to what your value would have been.

**Sarah:** And under missing completely at random, complete-case analysis, where you just drop anyone with missing data, is unbiased. You lose precision because the sample is smaller, but the estimate is still pointing at the right place.

**Kiffer:** Second mechanism. Missing at random, abbreviated MAR. The name is misleading, because missing at random does not mean random. It means the probability of missingness depends on observed variables, but not on the missing value itself, once you condition on what you've observed. The example the lesson gives is, older participants are more likely to have missing income data, but among people of the same age, missingness is unrelated to actual income.

**Sarah:** So if you have age, you can model the missingness, and you can recover unbiased estimates. That's the regime where multiple imputation works.

**Kiffer:** Right. Third mechanism. Missing not at random, abbreviated MNAR. Here the probability of missingness depends on the missing value itself, even after conditioning on the observed variables. The classic example is high earners refusing to disclose their income. The reason the value is missing is the value. And missing not at random is the hard case, because the observed data, by definition, do not contain the information you need to fix the bias.

**Sarah:** And the warning the lesson hammers home is that the missing data mechanism is an assumption. You can never verify it from the observed data alone. So you should always conduct sensitivity analyses, where you assume different mechanisms and see how the conclusions shift.

**Kiffer:** And that brings us to the cleaning approaches themselves. Three families. The simplest is listwise deletion, also called complete-case analysis. You drop any participant who has any missing data on any variable in your model. It's easy to implement. The cost is that it loses information, sometimes a lot of information, and it introduces bias unless data are missing completely at random.

**Sarah:** And in a multivariable model with many variables, even small amounts of missingness on individual variables can stack. If you have ten variables, each with five percent missing independently, you can lose forty percent of your sample to listwise deletion.

**Kiffer:** That's right. The second family is single imputation. You fill in missing values with a single best guess. Mean imputation for a continuous variable, replacing missing values with the variable's mean. Median imputation for skewed continuous variables. Mode imputation for categorical variables.

**Sarah:** And single imputation is widely used, and the lesson is critical of it for two reasons. First, it underestimates variance, because every imputed value is the same and the variation those participants would have contributed gets removed. Second, it distorts relationships between variables, because the imputed values carry no information about how the missing variable relates to the others.

**Kiffer:** And it treats the imputed value as if it were known with certainty, which it isn't. So your standard errors come out too small, and your confidence intervals come out too narrow.

**Sarah:** The third family is multiple imputation, and the lesson considers it the gold standard for handling missing data when the proportion of missingness is non-trivial.

**Kiffer:** The idea of multiple imputation is, instead of filling in a single best guess, you create several plausible imputed datasets, typically five to twenty, where each imputed value is drawn from a model of the variable conditional on everything else. You analyze each completed dataset separately, and then you pool the results using a procedure called Rubin's rules. Rubin's rules combine the within-dataset variance and the between-dataset variance, so the final standard error correctly reflects the uncertainty introduced by the missing data.

**Sarah:** And under the missing-at-random assumption, multiple imputation gives you valid inference. Unbiased estimates and standard errors that are honest about what you don't know.

**Kiffer:** And the lesson's recommendation is, fit your primary analysis with multiple imputation, then run sensitivity analyses with different imputation models, including some that explore what happens if the data are actually missing not at random. If your conclusions hold up across the sensitivity analyses, you've got something defensible.

**Sarah:** Last piece of section one is the audit trail. We mentioned it earlier with the blood pressure example, but it's worth saying directly. Every cleaning decision should be recorded. What was changed, why, by whom, and when. The standard the lesson sets is that another analyst should be able to replicate the cleaning process from your raw data using your documentation alone.

**Kiffer:** And the practical version of that, in the R workflow this course uses, is, never overwrite the raw file. Read it in, apply your transformations in a script, save the cleaned version under a new name. The raw file stays untouched. The script is the audit trail.

**Sarah:** And one more piece that lives between cleaning and descriptive analysis is data transformation. When a continuous variable is heavily right-skewed, like income or biomarker concentrations or hospital length of stay, the lesson recommends a log transformation. The natural log compresses the long right tail, so the distribution becomes more symmetric and parametric methods behave better.

**Kiffer:** And there's a specific gotcha with logs. The natural log of zero is undefined. So if your variable contains zeros, you add a small constant first, typically one, and take the natural log of x plus one. That preserves the zeros without breaking the transformation. The square root transformation is the gentler cousin and is often used for count data. And the inverse, one over x, is for very strongly right-skewed variables like reaction times. Pick the transformation that brings the distribution closest to symmetric, and remember that all subsequent interpretations have to account for the transformation.

**Sarah:** Section two. Descriptive statistics for individual variables.

**Kiffer:** Once the data are clean, the first thing you do is describe them. And the descriptive statistics you choose depend on what kind of variable you're looking at. For continuous variables, you want a measure of central tendency and a measure of spread, and ideally a sense of the shape.

**Sarah:** Three measures of central tendency. The mean, the median, and the mode.

**Kiffer:** The arithmetic mean is the sum of all values divided by the number of observations. It uses every data point. It's the basis of most parametric statistics. And it works well for symmetric, approximately normal distributions.

**Sarah:** The median is the middle value when you sort the observations. If you have an even number of observations, it's the average of the two middle values. The median is robust to outliers. Adding or removing an extreme value barely moves the median, whereas it can yank the mean around significantly.

**Kiffer:** And the mode is the most frequently occurring value. The mode is the most useful for categorical data, where the mean and median don't really make sense. It's also occasionally useful in continuous data for spotting digit preference, where participants round to nice numbers, or for identifying bimodal distributions where there are really two peaks.

**Sarah:** And the rule of thumb the lesson lands on is, use the mean for symmetric distributions, the median for skewed distributions or when outliers are present, and the mode for categorical data.

**Kiffer:** Right. And there's a related diagnostic that's useful. If the mean and median are very different, that itself tells you the distribution is skewed. Mean greater than median means right skew, with a long right tail. Mean less than median means left skew, with a long left tail.

**Sarah:** Measures of spread. The standard deviation is the most commonly reported. It's the square root of the average squared deviation from the mean. Same units as the original variable, which makes it easy to interpret.

**Kiffer:** The range is just the maximum minus the minimum. It's intuitive but uses only two data points and is extremely sensitive to outliers.

**Sarah:** And the interquartile range, abbreviated IQR after the first use, is the difference between the seventy-fifth and the twenty-fifth percentile. So it covers the middle fifty percent of the data. Like the median, it's robust to outliers, because percentiles are not pulled around by extreme values.

**Kiffer:** And the reporting convention the lesson sets up, which you'll see in essentially every epidemiology paper, is, for normally distributed variables, report mean plus or minus standard deviation. For skewed distributions, report median with the interquartile range.

**Sarah:** Visualizations for continuous variables. Four key plot types.

**Kiffer:** Histograms. A histogram divides the variable's range into bins and counts the observations per bin. The shape of the histogram tells you whether the distribution is symmetric or skewed, unimodal or bimodal, and whether there are gaps that might indicate erroneous values.

**Sarah:** Box plots. A box plot draws a box from the first quartile to the third quartile, with a line at the median, and whiskers extending out to the range or some multiple of the interquartile range. Individual outliers beyond the whiskers are drawn as separate points. Box plots are particularly good for comparing distributions across groups.

**Kiffer:** Density plots. A density plot is essentially a smoothed histogram. Instead of bins with discrete heights, you get a continuous curve that estimates the underlying density of the variable. They're nice for comparing the shape of two or three distributions on the same axis.

**Sarah:** And the fourth one is the quantile-quantile plot, abbreviated as the Q-Q plot. A quantile-quantile plot puts the quantiles of your observed data on one axis and the quantiles of a theoretical distribution, usually the normal distribution, on the other axis. If your data follow that theoretical distribution, the points fall on the diagonal line. Systematic departures from the diagonal indicate non-normality.

**Kiffer:** And the quantile-quantile plot is essentially the visual answer to the question, is this variable normal enough that I can use parametric methods. The lesson notes that there are also formal tests for normality, like the Shapiro-Wilk test for samples under five thousand, but with very large samples those tests will reject normality even for trivial departures. So visual assessment with the quantile-quantile plot should always accompany the formal test.

**Sarah:** Now categorical variables. The descriptive moves are simpler. A frequency table, which is just the count and percentage in each category. A bar chart, which shows the same information visually. And a mosaic plot when you have two categorical variables and you want to see the joint distribution.

**Kiffer:** And the lesson is sharp on a stylistic point that I want to defend, because students sometimes push back on it. Pie charts are usually a bad choice. They make it hard for readers to compare proportions, because human visual perception is much better at comparing the lengths of bars than the angles of pie slices.

**Sarah:** And the failure case is the pie chart with five or six similar-sized slices. You cannot tell at a glance which is largest. A bar chart with the same data takes one glance.

**Kiffer:** Almost always, bar charts are better. The exception is when you have two or three categories with very different sizes, and you specifically want to communicate that they sum to the whole. Even then, a stacked bar chart usually works as well or better.

**Sarah:** Section three. Exploring relationships between variables.

**Kiffer:** This is where descriptive analysis starts getting more interesting, because you're moving from one-variable summaries to two-variable explorations. And the choice of tools depends on what kinds of variables you're pairing up. There are three combinations. Two continuous, two categorical, and one of each.

**Sarah:** Two continuous variables. The visualization is the scatter plot. Each observation is a point with one variable on the x-axis and the other on the y-axis. The summary statistic is the correlation coefficient.

**Kiffer:** Two flavors of correlation, and they go with different shapes of relationship. Pearson correlation measures the strength of the linear association between two variables. The coefficient ranges from negative one, which is a perfect negative linear relationship, through zero, which is no linear relationship, to positive one, which is a perfect positive linear relationship.

**Sarah:** And Spearman correlation is the rank-based alternative. Instead of using the raw values, Spearman ranks each variable from smallest to largest and computes Pearson correlation on the ranks. That makes Spearman robust to outliers, and it picks up monotonic but non-linear relationships, where the variables move in the same direction together but not in a straight line.

**Kiffer:** So if a scatter plot shows a clear curve where both variables increase together but the relationship bends, Pearson will underestimate the strength of the association and Spearman will catch it. Use Pearson when the relationship looks linear. Use Spearman when it's monotonic but curved, or when you have outliers.

**Sarah:** And a basic interpretation rule. A correlation around zero point one is weak. Around zero point three is moderate. Around zero point five or higher is strong. Those are rough thresholds, and they vary by field, but they're a starting point.

**Kiffer:** And the warning that we say in every Office Hours episode, but that we have to say again. Correlation is not causation. A scatter plot with a strong correlation tells you the variables are associated. It doesn't tell you which one causes the other, or whether either causes the other, or whether some third variable causes both. That question gets answered by the causal diagram and the regression.

**Sarah:** Two categorical variables. The basic move is the cross-tabulation, which is just a contingency table showing the joint distribution of the two variables. For a two-by-two table you'd see four cells. For a five-by-three table you'd see fifteen.

**Kiffer:** And the formal test of association is the chi-squared test of independence. The chi-squared test compares the observed cell counts to the cell counts you'd expect if the two variables were independent of each other. A small p-value means the observed pattern is unlikely under independence, which means the variables are associated.

**Sarah:** And the chi-squared test has a sample-size assumption. The expected counts in each cell should be at least five, roughly. When you have small samples or sparse cells, the chi-squared approximation breaks down.

**Kiffer:** And the alternative for small samples is Fisher's exact test, which computes the exact probability of seeing the observed table under independence, without relying on a large-sample approximation. It's the right tool when you've got a two-by-two with very small counts.

**Sarah:** And mosaic plots are the visualization that pairs with cross-tabulations. The area of each tile is proportional to the cell count, and the tiles can be shaded to highlight cells that contribute most to the chi-squared statistic. They're a nice way to see association in a categorical pair.

**Kiffer:** Third combination. One continuous variable and one categorical variable. The natural visualization is the stratified box plot, where you draw a separate box plot for each level of the categorical variable, on the same axis.

**Sarah:** And the descriptive summary is the mean or median of the continuous variable within each category. So if you're looking at systolic blood pressure by smoking status, you'd report the mean systolic in current smokers, in former smokers, and in never-smokers.

**Kiffer:** And the formal tests come in two flavors depending on how many groups you've got. For a two-group comparison, the t-test. The t-test asks whether the means of the two groups differ by more than chance. There's an assumption of approximately normal distributions within groups, which the quantile-quantile plot from earlier helps you assess.

**Sarah:** And for more than two groups, ANOVA, which spelled out is Analysis of Variance. ANOVA is a generalization of the t-test to multiple groups. It tests whether the means differ across any of the groups, by comparing the variance between group means to the variance within groups.

**Kiffer:** If the variance between groups is large relative to the variance within groups, the means are probably really different. If they're similar, the differences are probably noise. And if the overall ANOVA is significant, you typically follow up with post-hoc tests to figure out which specific pairs of groups differ.

**Sarah:** And there's a non-parametric counterpart for skewed distributions, the Kruskal-Wallis test, which does the same job as ANOVA but on the ranks rather than the raw values.

**Kiffer:** Now there's a warning the lesson is sharp on, and that we have to honor before we close section three. The garden of forking paths.

**Sarah:** We covered this earlier, but it's worth saying again. The garden of forking paths is the idea that if you make many small analytic decisions in response to what you see in the data, the chances that you eventually find a significant result by chance are much higher than the nominal p-value suggests. Each fork in the road, each comparison you decide to run because something looked interesting, multiplies the implicit number of tests.

**Kiffer:** And exploratory data analysis, which is what we're doing in this lesson, is exactly the kind of activity where the garden of forking paths is the deepest. You look at a scatter plot, it looks interesting, you compute a correlation, that's significant, so you stratify by sex, that's also significant in one of the strata, you write that up. Without realizing it, you've run twenty implicit tests.

**Sarah:** So the lesson's stance is, exploratory analysis is for orientation. It is not for hypothesis testing. You use exploratory analysis to understand your data, to spot problems, to inform the model specification, to check the causal diagram. You do not use it to identify a significant pattern and then write that up as if it were the pre-specified analysis.

**Kiffer:** Save the formal tests for the analyses that were pre-specified in your protocol, informed by your causal diagram. The descriptive analyses in this lesson are setting up the inferential analyses in the rest of the course, starting with Lesson three on linear regression. They are not the inferential analyses themselves.

**Sarah:** And practically, what that looks like is, your descriptive table goes in the paper, and the descriptive plots go in your supplementary materials, but the p-values that anchor the conclusions all come from the pre-registered models. Not from cross-tabs that surfaced an interesting pattern in week two of cleaning.

**Kiffer:** Before we close section three, there's one more piece of the lesson that bridges into the regression work coming up, and that's building scales from multi-item questionnaires.

**Sarah:** Good catch. Walk us through it.

**Kiffer:** Right. Public-health surveys often ask several items that together measure a latent construct. Depression, anxiety, social support, perceived stress. The lesson uses the seven depression items and the five anxiety items in the phaa survey as the worked example. Before you can use those items as a single scale in a regression, you have to answer two questions. Do the items hang together, and how many underlying factors are they actually measuring.

**Sarah:** The first question is internal consistency, and the standard tool is Cronbach's alpha. Alpha is a number between zero and one that captures how strongly the items correlate with each other. A raw alpha at or above zero point seven is acceptable. Zero point eight or higher is good. If alpha is low, you've got items that are not measuring the same thing, and summing them into a scale will not work.

**Kiffer:** The second question is dimensionality, which is answered with exploratory factor analysis. The practical move is, you run a parallel analysis on a scree plot, and you keep the factors whose eigenvalues are above the parallel line. For the depression items in this dataset, that's one factor. For the depression and anxiety items combined, it's two factors. And in the two-factor solution, the depression items load on one factor and the anxiety items load on the other, which is what you want to see.

**Sarah:** And the rule of thumb for loadings is, anything above zero point four contributes meaningfully to the factor. If an item loads below that, you'd consider dropping it or treating it as a separate construct. Once both checks pass, you sum or average the items into a derived scale variable, and you carry that forward as a single predictor or covariate in the regression models in Lessons three and beyond. That's the bridge from descriptive cleaning into modeling.

**Kiffer:** Alright. Let's pull the takeaways.

**Sarah:** Takeaway one. Cleaning is most of the work. Sixty to eighty percent of analyst time, before any modeling happens. The quality of your analysis cannot exceed the quality of your data.

**Kiffer:** Takeaway two. Range checks for individual variables. Define hard limits where values are biologically impossible, and soft limits where values are unusual but possible. Investigate before you delete. The decimal-point shift example, where one thousand two hundred should have been one hundred and twenty point zero, is the canonical case where a flagged value led you back to a fixable error.

**Sarah:** Takeaway three. Logical checks across variables. Pack-years should not exist for never-smokers. Parity should be zero for non-parents. Discharge dates should follow admission dates. Run cross-tabulations and inspect the impossible cells.

**Kiffer:** Takeaway four. Missing data analysis. Quantify the missingness by variable and by participant. Look for patterns. Then characterize the mechanism. Missing completely at random, where missingness is unrelated to anything. Missing at random, where missingness depends on observed variables but not on the missing value itself. Missing not at random, where missingness depends on the missing value, even after conditioning.

**Sarah:** Takeaway five. The mechanism is an assumption that cannot be verified from the data. Sensitivity analyses with different mechanisms are how you defend the results.

**Kiffer:** Takeaway six. The three handling strategies. Listwise deletion is simple, loses information, and is unbiased only under missing completely at random. Single imputation, like mean imputation, underestimates variance and distorts relationships. Multiple imputation is the gold standard under missing at random, because Rubin's rules pool the variability across imputations and the standard errors come out honest.

**Sarah:** Takeaway seven. Audit trail. Every cleaning decision logged. Never overwrite the raw data. The script is the audit.

**Kiffer:** Takeaway eight. Descriptive statistics for individual variables. Mean and standard deviation for symmetric continuous. Median and interquartile range for skewed continuous. Frequencies and percentages for categorical.

**Sarah:** Takeaway nine. The four key plots for continuous variables. Histograms for shape, box plots for medians and quartiles and outliers, density plots for smoothed shape comparisons, and quantile-quantile plots for assessing normality.

**Kiffer:** Takeaway ten. Pie charts are almost always inferior to bar charts. Use bar charts.

**Sarah:** Takeaway eleven. Two continuous variables. Scatter plots and correlation. Pearson for linear, Spearman for monotonic but non-linear or for skewed data. And correlation is not causation.

**Kiffer:** Takeaway twelve. Two categorical variables. Cross-tabulations, chi-squared test of independence for adequate sample sizes, Fisher's exact test for small samples, and mosaic plots for visualization.

**Sarah:** Takeaway thirteen. One continuous and one categorical. Stratified box plots. Means or medians within categories. The t-test for two-group comparisons. Analysis of Variance, abbreviated ANOVA, for multiple-group comparisons.

**Kiffer:** Takeaway fourteen. Building scales. When several survey items measure one underlying construct, check internal consistency with Cronbach's alpha and check dimensionality with exploratory factor analysis before you sum the items into a scale score. An alpha at or above zero point eight is good. A clean one-factor solution with loadings above zero point four supports treating the items as a single scale.

**Sarah:** Takeaway fifteen. The garden of forking paths warning. Exploratory analysis is for orientation, not for hypothesis testing. Save formal tests for pre-specified analyses informed by your causal diagram. Descriptive plots go in supplementary materials. The p-values that anchor conclusions come from pre-registered models.

**Kiffer:** And takeaway sixteen, which is more of a habit than a fact. Plot before you fit. A box plot of systolic blood pressure by smoking status takes one second and tells you whether the comparison you're about to run in Lesson three is driven by a few extreme values or by a real shift in the center of the distribution. Always plot first.

**Sarah:** Next up is Lesson three. Linear Regression. Now that you've cleaned the data and described it, the regression framework starts. The continuous outcomes you summarized here become the dependent variable. The predictors you cleaned become the independent variables. And the causal diagram from Lesson one tells you which predictors belong in the adjustment set. The pieces all start fitting together.

**Kiffer:** Take care, everyone.

**Sarah:** See you there.
