# Lesson 1 — A Structured Approach to Data Analysis (v3 expanded)

*Companion-podcast transcript • Sarah & Kiffer* 
*~5400 words • ~29 min audio*

---

**Sarah:** Welcome to Office Hours, the companion podcast. I'm Sarah, and this is the first episode of a new chapter, so let me set the stage. This is the analytic-methods part of the series, building on the evidence-evaluation and study-design material that came before.

**Kiffer:** And I'm Kiffer. Yeah, this one has a different feel from the earlier material. The first part of the series was about reading evidence. The second was about designing studies. This part is the analytic-methods toolkit. This is where we actually build out the regression and modeling toolkit.

**Sarah:** And the orange R boxes that have been showing up throughout the earlier material finally come into their own here. Linear regression. Logistic regression. Multinomial and ordinal models. Poisson and negative binomial for counts. Survival analysis. Mixed models for clustered and longitudinal data. And increasingly causal inference tools for observational data.

**Kiffer:** All of those are coming. But Lesson 1 isn't any one of them yet. Lesson 1 is the lesson before the toolkit. It's called A Structured Approach to Data Analysis, and the entire premise is that before you fit a single model, there is a workflow you need to have in place.

**Sarah:** And the lesson is honest about why this lesson exists at all. There is a really strong temptation, especially for students who have been waiting two courses to get to the analysis, to jump straight into the sophisticated model that will produce the ultimate answer.

**Kiffer:** Yeah. And the textbook this course is built on, Dohoo and colleagues, Methods in Epidemiologic Research, is just blunt about it. That rarely works out. The results will be wrong, because important preliminary steps got skipped. Data analysis is iterative. You move forward, you discover something about the data, you back up several steps, you try again. A structured template won't be the only way to do this, but it'll be applicable in most situations.

**Sarah:** Four sections in the lesson. Section 1 is why a structured approach, plus the causal diagram and the project skeleton. Section 2 is data coding, entry, and file management. Section 3 is program files, editing, and verification. And Section 4 is data processing and unconditional associations.

**Kiffer:** Let's start with Section 1. The lesson opens with one principle and then makes it concrete. The principle is, before you start any work with your data, construct a plausible causal diagram of the problem you're investigating.

**Sarah:** And the term causal diagram, in this course, means a directed acyclic graph. The acronym is DAG. Directed because the arrows have direction, acyclic because the arrows can't form loops, and graph in the mathematical sense, just nodes and arrows.

**Kiffer:** And students have met DAGs before. An earlier lesson introduced them informally. An earlier lesson went deeper. So this is a refresher with a sharper purpose.

**Sarah:** Three structural pieces still do all the work. Walk through them.

**Kiffer:** Piece one. The fork. You have a variable C with arrows going out to both X and Y. So C is a common cause of X and Y. The shape looks like X is being pointed at by C, and Y is being pointed at by C. C is the confounder. The rule is, adjust for it.

**Sarah:** And just to define confounder for anyone who needs it. A confounder is a variable that is associated with both your exposure and your outcome, and is producing some of the apparent association between them. If you don't adjust for it, the exposure-outcome estimate is biased.

**Kiffer:** Piece two. The chain. X points to M, M points to Y. M is a mediator. It sits on the causal path between exposure and outcome. The rule is, do not adjust if you want the total effect of X on Y, because adjusting for the mediator blocks the path you actually care about.

**Sarah:** And piece three. The collider. X points to Z, Y also points to Z. Z is a common effect of X and Y. The rule there is, do not adjust, and watch for it in selection.

**Kiffer:** And what those three pieces let you do, in this part of the series, is fix two things before you fit anything. The estimand and the adjustment set.

**Sarah:** Define estimand for me. That word is going to come up a lot in this course.

**Kiffer:** Estimand is the causal quantity you are trying to estimate. Not the estimate, which is the number that comes out of your model, and not the estimator, which is the formula. The estimand is the thing in the world you are trying to learn about. Something like the average causal effect of education on health, in this population, at this time.

**Sarah:** And the adjustment set is the set of variables that go on the right-hand side of the regression. The covariates.

**Kiffer:** Right. The DAG tells you what should be in the adjustment set. Adjust for confounders. Don't adjust for mediators if you want the total effect. Don't adjust for colliders. Both decisions get made on the DAG, before you write any model code.

**Sarah:** The lesson then puts numbers on this with a worked mediation example. Walk us through it slowly because I think this is the clearest demonstration in the lesson of why a DAG matters.

**Kiffer:** Sure. The DAG is education causes income, income causes health. That's a chain. Plus a smaller direct path from education to health that does not go through income. So the total effect of education on health has two pieces. The indirect path through income, and the direct path that bypasses income.

**Sarah:** And this is exactly the structure Baron and Kenny formalized in their 1986 paper. Three regressions.

**Kiffer:** Yeah. Reuben Baron and David Kenny published the paper in the Journal of Personality and Social Psychology, and it became one of the most cited methodology papers in the social sciences. The procedure goes like this. Regression one. Regress the outcome, health, on the exposure, education. The coefficient on education in that regression is the total effect.

**Sarah:** Regression two. Regress the mediator, income, on the exposure, education. The coefficient is conventionally called a. That's the path from education to income.

**Kiffer:** Regression three. Regress the outcome on both the exposure and the mediator. So health on education and income, jointly. The coefficient on income is called b. The coefficient on education in this regression is called c-prime, and it represents the direct effect of education on health, holding income constant.

**Sarah:** And the indirect effect, the part that goes through income, is the product a times b. Or equivalently the total minus the direct, c minus c-prime.

**Kiffer:** And in the lesson's simulated example, where we know the truth because we built the data, the total effect of education on health comes out around 0.6. The direct effect, c-prime, is about 0.3. The indirect effect through income is about 0.3. So roughly half of education's effect on health travels through income.

**Sarah:** Two cautions before we move on, because the lesson is sharp on these and I want to make sure they land.

**Kiffer:** First caution. The estimate of the indirect effect is only as credible as the DAG. If you have an unmeasured confounder of income and health, the indirect estimate is biased even though the regressions converge cleanly and the standard errors look fine. The math doesn't know about the confounder you didn't measure.

**Sarah:** And second caution. Baron and Kenny assumes there is no exposure-mediator interaction. So no synergy or antagonism between education and income in their joint effect on health. In real data that assumption can be violated.

**Kiffer:** Which is why the modern alternative is the mediation package in R. There's a function called mediate, and it gives you proper bootstrap confidence intervals for the indirect effect, and it handles exposure-mediator interaction correctly.

**Sarah:** Define bootstrap quickly because that is going to come up over and over in this course.

**Kiffer:** Bootstrap is a resampling method. You take your dataset, you draw a random sample with replacement of the same size as the original, you compute your estimate on that resample. Then you do that a thousand times. The distribution of those thousand estimates approximates the sampling distribution of your estimator, and you can read off a confidence interval from the percentiles. It is computationally heavy but mathematically light. You don't need a closed-form variance formula.

**Sarah:** And the mediation package was developed by Kosuke Imai and colleagues. The methodological foundations come out of work by Judea Pearl on causal inference and Tyler VanderWeele on mediation specifically. Each has written extensively on the assumptions you need to make for a mediation estimate to have a causal interpretation.

**Kiffer:** And the mediate function output gives you four lines. The ACME, which stands for Average Causal Mediation Effect, is the indirect path through income. The ADE, the Average Direct Effect, is the direct path. The total effect is the sum. And the proportion mediated tells you what fraction of the total runs through the mediator.

**Sarah:** In the lesson's run, ACME is about 0.30, ADE is about 0.30, total is about 0.60, and proportion mediated is about 0.50. So half of education's effect on health goes through income.

**Kiffer:** Right. And the punchline of the whole worked example is, the DAG fixed the estimand and the adjustment set before any data got fit. The R code is just the implementation.

**Sarah:** Then the lesson pivots from the DAG to the project skeleton. The structured approach starts with structured files.

**Kiffer:** Yeah. The convention the lesson recommends is one RStudio project per study. RStudio is the standard interface for working in R. It's a free environment that bundles your script editor, your console, your file browser, your plot viewer, and your version control all in one window.

**Sarah:** And inside the RStudio project, the lesson recommends a specific folder structure. Walk through it.

**Kiffer:** Folder one. data slash raw. This holds your original data files, exactly as they were collected or received. The rule on this folder is, never overwrite. Treat the contents as read-only. Whatever you do downstream, you can always come back to data slash raw and rebuild from there.

**Sarah:** Folder two. data slash processed. This holds the cleaned, derived datasets that come out of your scripts. Everything in this folder is reproducible, in the sense that anyone with the raw data and your scripts could regenerate it from scratch.

**Kiffer:** Folder three. R. This is where the analysis scripts live. The lesson recommends numbering them. Script 01 loads, script 02 cleans, script 03 builds the analytic dataset, and so on. The numbering forces you to be explicit about the order of operations.

**Sarah:** And folder four. output slash figures. Where the plots go. Numbered files, png or pdf, ready to drop into a paper or a poster.

**Kiffer:** Plus the project file itself. A dot Rproj file at the top level. Double-clicking it reopens the whole environment. RStudio remembers what scripts were open, what working directory was set, what packages were loaded.

**Sarah:** Then the analysis itself runs on the tidyverse. Which is a collection of R packages that share a common design philosophy. List them out.

**Kiffer:** Five core packages. Dplyr is for data manipulation. Filter rows, select columns, group, summarize, mutate. Ggplot2 is for graphics. The grammar of graphics implementation. Readr is for file input output. Reading csv files, writing csv files, with sensible defaults. Tidyr is for reshaping. Wide to long, long to wide. And stringr is for text manipulation.

**Sarah:** And one more package the lesson really emphasizes. The here package.

**Kiffer:** Here is a small package, but it solves one specific and very annoying problem. File paths. If you write code that says, read the file at C colon backslash users backslash kiffer backslash documents backslash project backslash data, that code will break the moment anyone else opens the project on their machine. The here package replaces that with a function called here that resolves paths relative to the project root, automatically.

**Sarah:** So your project becomes movable. You can email it to a colleague, or push it to a server, or hand it off to a future student, and the file paths just work.

**Kiffer:** Right. And that is the foundation. DAG plus project skeleton. Section 1 in a sentence is, draw the picture before you fit the model, and put the files in a place where future-you can find them.

**Sarah:** Section 2. Data coding, entry, and file management. The unglamorous middle of the workflow.

**Kiffer:** Yeah, and the lesson is explicit that almost all of this is carried forward from the earlier questionnaire design lesson. The same coding rules apply.

**Sarah:** Let's run through them. First. Single missing-value codes.

**Kiffer:** Pick a code that cannot possibly be a real value. The lesson recommends a large negative number like negative 999. The reason is, if you accidentally treat that code as a real value in an analysis, the answer will be obviously wrong. A mean age of negative 700 is a flag. A mean age that's a little high because some missing values got coded as 99 is invisible.

**Sarah:** Second. Consistent coding conventions. The example the lesson uses is binary variables. Zero for no, one for yes. Always. Across every variable in the dataset.

**Kiffer:** And that consistency pays off in two places. One, when you write code, you don't have to remember which variable is coded which way. Two, when a regression coefficient comes out, you know immediately what its sign means. Positive means the yes-coded variable is associated with more outcome.

**Sarah:** Third. Code on the form, not during data entry.

**Kiffer:** This one matters more than people realize. The codes should be printed on the data collection form itself, next to each response option. So the person filling in the form circles a number that already corresponds to a code. If you let the data entry person decode handwritten responses on the fly, you will get errors. They'll misread, they'll improvise, they'll make different decisions on different days.

**Sarah:** Fourth. Standardized variable naming.

**Kiffer:** Variable names should be descriptive but short. All lowercase. Words separated by underscores. So smoking status becomes smoke_status. Body mass index becomes bmi. No spaces. No special characters. No mixed case, because some statistical software is case-sensitive and that's a recipe for bugs.

**Sarah:** And the lesson notes a useful trick from Dohoo. If a name is getting too long, you can drop the vowels. Water cistern becomes wtr_cstrn. It looks weird but it's still readable, and it sorts predictably.

**Kiffer:** Fifth. Date format. Use ISO 8601.

**Sarah:** Spell that out. ISO is the International Organization for Standardization, and 8601 is the specific standard for representing dates and times. The format is year-month-day, with four-digit year, two-digit month, two-digit day.

**Kiffer:** So a date like May 13th 2026 would be written 2026 dash 05 dash 13. The reason this matters is that ISO dates sort correctly when you sort alphabetically. Three-slash-five-slash-twenty-six is ambiguous. Is that March 5th or May 3rd? Twenty-six could be 1926 or 2026. ISO 8601 has none of those problems.

**Sarah:** And then there's the documentation piece that ties all of these coding decisions together. The data dictionary.

**Kiffer:** Yeah. A data dictionary is a separate document, usually a spreadsheet, that lists every variable in your dataset. For each variable, the dictionary gives you the variable name, the variable label, what kind of variable it is, the allowed values or value range, the meaning of any codes, and the source of the data.

**Sarah:** Why does this matter so much?

**Kiffer:** Because without a data dictionary, the dataset becomes opaque to anyone who didn't collect it. And that includes future-you, six months from now, when you've forgotten that variable q14r recoded was the answer to a question about whether the participant ever lived within a kilometer of an industrial site. The data dictionary is what keeps the dataset interpretable.

**Sarah:** There's another piece in Section 2 that I want to flag, because it shows up in the R box. The principle of versioning files instead of overwriting them.

**Kiffer:** Yeah. The convention the lesson recommends is, when you do a substantial transformation of a dataset, save the result as a new file with a sequential number in the name. So bp01 might be the original, bp02 might be the version with missing values dropped, bp03 might be the version with derived variables added.

**Sarah:** And the rule is, never overwrite the previous version. Each step gets its own file.

**Kiffer:** Plus a small file log. A text file that records, for each numbered file, when it was created and what's in it. Number of observations, number of variables, what was changed since the previous version.

**Sarah:** And the elegant point the lesson makes is that if your transformations live in a script rather than in clicks, this whole versioning system happens automatically. The script is the log. The file names mirror the script's outputs.

**Kiffer:** Which is the bridge into Section 3. Program files, editing, and verification. Where the discipline gets tightened.

**Sarah:** And the central distinction in Section 3 is interactive mode versus program mode.

**Kiffer:** Interactive mode is when you sit at the console and type a command, hit return, look at the result, type another command. It's exploratory. It's how you poke at a dataset to understand what's in it. Useful for that.

**Sarah:** But the lesson is sharp. Interactive mode should not be used for any of the real processing or analysis. Because you can't reconstruct what you did. Two weeks later, you can't remember whether you dropped 47 rows or 74. You can't remember whether the variable was log-transformed before you fit the model or after.

**Kiffer:** Program mode is the alternative. Every command goes into a script. Numbered, commented, saved. The script can be re-run end to end. Six months from now, you can hand the script and the raw data to a colleague, and they can reproduce every figure and every number you reported.

**Sarah:** And the lesson is uncompromising on a related point. Don't manipulate data in spreadsheets.

**Kiffer:** Yeah. There are two reasons. First, spreadsheets are dangerous because of the sort command. If you sort one column without selecting the others, you destroy the alignment of records. Every row becomes garbage and you may not notice for weeks.

**Sarah:** Second, even if you sort safely, the changes you make in a spreadsheet aren't recorded. There's no log. If you delete a column, change a value, recode a category, those edits live in the file with no history.

**Kiffer:** And then the related rule. Don't manually edit datasets. If you find an error in the raw data, the temptation is to open the file, fix it, save it. Don't. Document the error in your script. Have the script make the correction in code. Then your raw data stays untouched and the correction is reproducible and auditable.

**Sarah:** Which leads to the next pillar of Section 3. Version control.

**Kiffer:** Git is the standard tool. Git is a system for tracking changes in files over time. Every time you finish a meaningful piece of work, you commit. The commit captures the state of the project, with a short message describing what changed. You can go back to any previous commit, see exactly what was different, recover anything you accidentally broke.

**Sarah:** And Git is what's running underneath, locally, on your machine. The hosting platforms are GitHub and GitLab. They are websites that store your Git repositories in the cloud and add collaboration features.

**Kiffer:** GitHub is owned by Microsoft and it's the most widely used. GitLab is the main open source alternative. Both let you share code with collaborators, track issues, review changes before they get merged. For a public health analysis, the workflow looks like this. You commit locally as you work. Once a day or once a session, you push your commits up to the remote. Your collaborator pulls them down. Both of you stay in sync.

**Sarah:** And version control replaces a whole class of bad habits. The folder full of files named final, final_v2, final_FINAL, final_FINAL_actually. Every one of those is a Git commit waiting to happen.

**Kiffer:** The third piece of Section 3 is verification. Once you have a dataset, before you fit any models, you have to verify it. Three classes of check.

**Sarah:** First. Range checks. For every continuous variable, look at the minimum and maximum. Or the five smallest and five largest values. Are they plausible?

**Kiffer:** Right. Age 200. Body mass index 4. Systolic blood pressure 25. Those are flags. Either typos, or coding errors, or missing-value codes that didn't get translated to NA. Find them, trace them back, fix them in the script.

**Sarah:** And for categorical variables, the equivalent check is the frequency distribution. How many observations are in each category? Are there any unexpected categories that shouldn't be there? Like a sex variable with three levels when there should be two.

**Kiffer:** Plus a histogram for continuous variables. Just to look at the shape of the distribution. Are there obvious clusters in the wrong place? A spike at zero that suggests a coding artifact? A long thin tail of clearly wrong values?

**Sarah:** Second. Logical checks. Cross-variable consistency.

**Kiffer:** Yeah. If a respondent answered no to ever smoking, their pack-years should be zero. If a respondent said they have never been pregnant, their number of pregnancies should be zero or missing. If a respondent's date of diagnosis is before their date of birth, something is wrong.

**Sarah:** And those checks have to be coded explicitly. The dataset doesn't enforce them automatically.

**Kiffer:** Third. Cross-tabulations to spot impossible combinations. Especially for categorical pairs. If you cross-tabulate sex by pregnancy status and you get a non-zero count of pregnant males, that's a coding error somewhere.

**Sarah:** And the practical advice is, do verification before any analysis. Because if there are problems in the data, every analysis you run on top of it inherits them.

**Kiffer:** Section 4. Data processing and unconditional associations. This is where you finally start to do something that looks like analysis, but it's still pre-modeling.

**Sarah:** The first decision in Section 4 is the analytic outcome. What kind of variable is your outcome, and what regression family does that imply?

**Kiffer:** Four cases. Continuous outcome. Linear regression. Binary outcome. Logistic regression. Counts. Poisson regression, or negative binomial if there's overdispersion. Time-to-event. Survival analysis.

**Sarah:** Define those quickly because they're going to be the spine of the rest of the course.

**Kiffer:** Linear regression models the mean of a continuous outcome as a linear function of the predictors. Logistic regression models the log-odds of a binary outcome. Poisson regression models the log of an expected count. Survival analysis models time until an event happens, while handling the fact that some people are still being followed when the study ends, which is called censoring.

**Sarah:** And the lesson is clear that the choice of regression family follows from the outcome, and the choice has to be made before you start fitting models. You don't get to pick the regression family based on which one gives the prettiest answer.

**Kiffer:** And there's a verification piece for the outcome too. If you planned linear regression, is the outcome distribution approximately suitable? If you planned Poisson, are the mean and variance approximately equal? If they're not, that's a flag for negative binomial. If you planned a multinomial model with three categories and one category has only seven observations, you might collapse to two categories. The data inform what's feasible.

**Sarah:** Then the predictor side. The lesson covers two main moves. Dummy coding for categorical predictors. And centering and scaling for continuous predictors.

**Kiffer:** Let's walk through dummy coding first, because it shows up in almost every regression we'll fit this term.

**Sarah:** If you have a categorical predictor with k categories, you create k minus one binary indicator variables. Each indicator is one if the observation is in that category, zero otherwise. One category is the reference, and its effect gets absorbed into the intercept.

**Kiffer:** So race or ethnicity with five categories becomes four indicators, with the fifth as reference. The coefficient on each indicator tells you the difference in outcome between that category and the reference, holding everything else constant.

**Sarah:** And R does this automatically when you use a factor variable in a regression. But understanding what's happening matters, because the choice of reference category determines what the coefficients mean.

**Kiffer:** Centering. For a continuous predictor, you subtract the mean. So the new variable is zero at the average value of the original. After centering, the intercept of the regression has a meaningful interpretation. It's the predicted outcome at the average value of the predictor, not at zero, which might be impossible.

**Sarah:** And scaling is dividing by the standard deviation. After centering and scaling, the predictor has mean zero and standard deviation one. The coefficient then represents the change in outcome for a one standard deviation change in the predictor.

**Kiffer:** Centering and scaling matter especially when you have interactions in the model, or when you want to compare effect sizes across predictors that are on very different scales. Coefficient on income in dollars versus coefficient on age in years. After scaling, both are in standard deviations, and they're directly comparable.

**Sarah:** Then unconditional associations. And let me make sure the term lands. Unconditional means not conditioning on anything else. Bivariate. Just one predictor against the outcome at a time. Before you fit any multivariable model with all predictors together.

**Kiffer:** And the reason you do this first is that unconditional associations tell you a lot of things you need to know before you start adjusting. They tell you the direction of each predictor-outcome association. They tell you the rough magnitude. They tell you the functional form. Whether the association looks linear or curved or step-shaped. They flag predictors with very weak or very strong associations.

**Sarah:** What tools does the lesson recommend?

**Kiffer:** Three combinations. Two continuous variables. Use a correlation coefficient and a scatter plot. Plus a simple linear regression of one on the other. The correlation gives you a number. The scatter plot shows you the shape. Both matter.

**Sarah:** One continuous and one categorical. Use one-way ANOVA, where ANOVA stands for analysis of variance, or simple linear regression with the categorical predictor as a factor. Plus box plots to show the distribution of the continuous variable within each level of the categorical.

**Kiffer:** And two categorical variables. Cross-tabulation, plus a chi-squared test, where the Greek letter chi looks like a slanted X. The cross-tabulation gives you the counts in every combination. The chi-squared test gives you a p-value for whether the two variables are associated.

**Sarah:** And the lesson is sharp on a second use of unconditional associations. Looking at predictor pairs, not just predictor-outcome pairs.

**Kiffer:** Yeah. Because if two predictors are very strongly correlated with each other, you've got potential collinearity. Define that quickly.

**Sarah:** Collinearity is when two or more predictors carry essentially the same information. In a regression with both, the coefficients become unstable. Standard errors blow up. The estimated effect of either variable can flip sign if you remove the other. The model fits, but the individual coefficients are not interpretable.

**Kiffer:** And catching collinearity at the unconditional stage, by looking at correlations between pairs of predictors, lets you decide what to do before it sabotages the multivariable model. Drop one of the redundant predictors. Combine them into a composite. Use a dimension reduction method. Decide on the strategy before fitting.

**Sarah:** And the third use of unconditional associations is checking for confounding. The DAG identified the confounders in advance. Unconditional associations let you verify that those confounders behave as the DAG predicted.

**Kiffer:** Right. Look at the unconditional association between each putative confounder and the exposure. Look at the unconditional association between the confounder and the outcome. If both associations are present, the confounder is a real candidate for adjustment. If one is essentially zero, the variable might not need adjustment in this dataset, even though the DAG suggests it could be a confounder in principle.

**Sarah:** Then the last thing in Section 4 is the analysis log.

**Kiffer:** Yeah. The lesson recommends keeping a running document, separate from the scripts, where you record every model you fit, what it produced, and your interpretation. Date stamped. With links to the script and the output file.

**Sarah:** And why does that matter, given that the scripts already record what you did?

**Kiffer:** Because the scripts record the what. The analysis log records the why. Three months from now, when your supervisor asks why you decided to drop the interaction term in model six, the script tells you that you did it. The log tells you why. You looked at the cross-product, the coefficient was tiny, the standard error was huge, and removing it didn't change the main estimates. That kind of decision-trail is what makes the analysis defensible during peer review.

**Sarah:** And the lesson recommends analyzing in blocks. Not just dumping every analysis into one giant script.

**Kiffer:** Right. One block per substantive question. One log file per script. Same name, different extension. So script 04_unconditional.R produces 04_unconditional.log. The pairing is automatic.

**Sarah:** Okay. Let me try to pull the takeaways together for the assessment.

**Kiffer:** Yeah, please do.

**Sarah:** Takeaway one. The DAG comes first. Before any fitting, you draw the directed acyclic graph. Forks are confounders, adjust for them. Chains contain mediators, don't adjust if you want the total effect. Colliders are common effects, don't adjust. The DAG fixes your estimand and your adjustment set before you write any model code.

**Kiffer:** Takeaway two. The mediation example shows the DAG translating into a regression strategy. Three regressions for Baron and Kenny. Or the mediation package in R, with bootstrap confidence intervals, which is the modern preferred tool. Either way, the indirect effect is only as credible as the DAG.

**Sarah:** Takeaway three. Project skeleton. One RStudio project. Folders for raw data, processed data, scripts, and figures. The tidyverse for the actual analysis. And the here package for robust file paths.

**Kiffer:** Takeaway four. Coding standards. One missing-value code, like negative 999. Consistent zero-no, one-yes binary coding. Codes printed on the form, not improvised at entry. Lowercase variable names with underscores. ISO 8601 date format, year-month-day. And a data dictionary that documents every variable.

**Sarah:** Takeaway five. Reproducibility. Program mode, not interactive. No spreadsheet edits. No manual changes to raw data. Every transformation in code. Versioned files, not overwrites. Git for change tracking, GitHub or GitLab for hosting.

**Kiffer:** Takeaway six. Verification. Range checks. Logical checks across variables. Cross-tabulations for impossible combinations. Histograms for continuous variables, frequency tables for categorical. Do all this before you fit anything.

**Sarah:** Takeaway seven. Analytic outcome and predictor processing. Decide the regression family from the outcome. Continuous gets linear, binary gets logistic, counts get Poisson or negative binomial, time-to-event gets survival. Dummy code categorical predictors. Center and scale continuous predictors when interpretation calls for it.

**Kiffer:** Takeaway eight. Unconditional associations come before multivariable models. Predictor-outcome pairs to see direction, magnitude, and functional form. Predictor-predictor pairs to spot collinearity. Confounder relationships to verify the DAG. Cross-tabs and chi-squared for categorical pairs, scatter plots and correlation for continuous pairs, box plots for mixed pairs.

**Sarah:** And takeaway nine. Keep an analysis log. The scripts capture the what, the log captures the why. Date-stamped, paired with each script, separate from but parallel to the code. That decision-trail is what makes the analysis defensible six months from now.

**Kiffer:** And one orienting note before we close. this part has a different feel from the earlier material. Those parts were broader. Reading evidence, designing studies. This course is narrower and more technical. We are spending most of our time in regression and modeling territory, with R as the medium throughout. The good news is that everything we do here has been set up by the previous two courses. The bad news is that the lessons are denser. There's more to actually do, week to week.

**Sarah:** And one practical note. The reflection questions in this lesson are not optional. The final assessment for this material will assume you've sketched a DAG of your own research question, and that you've thought through what your project skeleton would look like. Sit with those reflections. They are the working version of what we just talked through.

**Kiffer:** Plus the small but important habit. Open RStudio. Make a new project. Build the four folders. Install the tidyverse and here. Even if you don't have data yet, having that scaffold sitting there means that when you do have data, the friction to get started is gone.

**Sarah:** Next up is Lesson 2. Data Cleaning and Descriptive Analyses. We move from setup to the first round of actual analysis. Detecting outliers, addressing missing values, producing the descriptive table that every analysis report needs. The Table 1 of a paper. The structured workflow we built today is what makes that work tractable.

**Kiffer:** Take care, everyone. Draw a DAG this week, even a rough one, and the rest of the course will feel a lot less abstract.

**Sarah:** See you in Lesson 2.
