A Structured Approach to Data Analysis
Exploratory Data Analysis For Epidemiology
Kiffer G. Card, PhD, Faculty of Health Sciences, Simon Fraser University
Learning objectives for this lesson:
- Construct a causal diagram before beginning data analysis
- Establish a system for managing data-collection sheets, files, and variables
- Apply best practices for data coding, entry, and verification
- Process outcome and predictor variables appropriately for analysis
- Evaluate unconditional associations between variables
- Set up a systematic approach for keeping track of analyses
This course was developed by Kiffer G. Card, PhD, as a companion to Dohoo, I. R., Martin, S. W., & Stryhn, H. (2012). Methods in Epidemiologic Research. VER Inc.
Glossary — Key Terms, People & Concepts
📚 Reference page — available throughout the lesson
This glossary collects the key concepts, people, and ideas you will meet in this lesson. Use it as a reference while you work through the material, or as a review before assessments. Type in the search box to filter entries.
Introduction & Data Collection
Introduction and Overview
Welcome to HSCI 410. If you've come from HSCI 230 and 341, you've spent two courses learning how to read epidemiological evidence (230) and how to design and conduct epidemiologic studies (341). HSCI 410 picks up the next link in the chain: how to analyse the data those studies produce. The R skills you've been practicing in the orange boxes throughout 230 and 341 scale up here into the full statistical machinery of modern public-health analysis — linear, logistic, multinomial, ordinal, count, and survival regression; mixed models for clustered and longitudinal data; and increasingly causal-inference tools for observational data. This first lesson sets the foundation: a disciplined workflow for taking raw data from collection through to analysis-ready files. The structured approach has become especially important since the wider scientific community recognised a reproducibility crisis in published research (Ioannidis, 2005; Open Science Collaboration, 2015; Wikipedia: Replication crisis) and called for a manifesto of reproducible practice (Munafò et al., 2017). Across four content sections we walk through this in order: introduction and data collection (Section 1), data coding, entry, and file management (Section 2), program files, editing, and verification (Section 3), and data processing plus the first unconditional associations (Section 4).
Learning Objectives
- Explain why a structured, iterative analytic workflow outperforms diving straight into modelling.
- Sketch a causal diagram that distinguishes outcomes, predictors, confounders, and intervening variables.
- Identify the components of a usable data-collection sheet for primary data.
- Recognise where Section 1 sits in the larger pipeline that runs through Sections 2–4.
Why a Structured Approach?
When starting the analysis of a complex dataset, it is very helpful to have a structured approach in mind. For most people, there is a strong tendency to jump straight into the sophisticated analysis that will provide the ultimate answer. This rarely works out—the results will inevitably be wrong because important preliminary steps were skipped.
Key Principle
Data analysis is an iterative process which often requires that you back up several steps as you gain more insight into your data — an idea Tukey championed when he distinguished exploratory from confirmatory work (Peng, 2011). A structured template, while not the only approach, will be applicable in most situations and will serve to guide your initial efforts; tidy-data conventions (Wickham, 2014) make each iteration cheaper.
Start with a Causal Diagram
Before you start any work with your data, it is essential to construct a plausible causal diagram of the problem you are about to investigate. This will help identify:
- Which variables are important outcomes and predictors
- Which are potential confounders
- Which might be intervening variables between your main predictors and outcomes
Practical Tip
Keep this causal diagram in mind throughout the entire data-analysis process. With large datasets, it will not be possible to include all predictors as separate entities. This can be handled by including blocks of variables (e.g., demographic characteristics) in the diagram instead of listing each variable.
Quick Refresher — DAGs from HSCI 341
When we say “causal diagram” in 410, we mean a directed acyclic graph (DAG): nodes for variables, directed arrows for direct causal effects, no cycles. You met these in HSCI 341 Lesson 1 — the three structural pieces still do all the work:
- Fork (X ← C → Y) — C is a confounder. Adjust for it.
- Chain (X → M → Y) — M is a mediator. Do not adjust if you want the total effect.
- Collider (X → Z ← Y) — Do not adjust, and watch for it in selection.
The DAG fixes your estimand (what causal quantity you are estimating) and your adjustment set (what goes on the right-hand side of the regression) before you fit anything. In 410 we use it as the bridge from a research question to a regression model.
From DAG to Regression: A Mediation Example
One of the clearest places to see this bridge is mediation. A DAG of the form X → M → Y with a residual direct path X → Y is the qualitative claim; the Baron & Kenny (1986) procedure introduced in 341 puts numbers on the direct and indirect components. Below we do that fitting in R, on simulated data so you can verify the answer against the truth.
mediation package)
The DAG: education → income → health, with education → health directly. We will (i) run Baron & Kenny’s three regressions by hand, then (ii) replicate the result with the mediation package, which gives proper bootstrap confidence intervals for the indirect effect.
# install.packages(c("mediation", "dagitty"))
library(mediation)
# 1. Simulate data that match the DAG: education -> income -> health,
# with a smaller direct path education -> health.
set.seed(410)
n <- 800
education <- rnorm(n)
income <- 0.6 * education + rnorm(n) # path a
health <- 0.3 * education + 0.5 * income + rnorm(n) # direct + path b
dat <- data.frame(education, income, health)
# 2. Baron & Kenny by hand --------------------------------------------------
# Step 1: total effect c (health on education)
coef(lm(health ~ education, data = dat))["education"]
# Step 2: a (income on education)
fit_M <- lm(income ~ education, data = dat)
# Step 3: direct c' (education) and b (income), from health on both
fit_Y <- lm(health ~ education + income, data = dat)
coef(fit_Y) # c' on education, b on income
# Indirect effect = a * b (or equivalently c - c')
a <- coef(fit_M)["education"]
b <- coef(fit_Y)["income"]
a * b
# 3. Same answer, with bootstrap CIs, via the mediation package -------------
med <- mediate(fit_M, fit_Y,
treat = "education",
mediator = "income",
boot = TRUE, sims = 1000)
summary(med)
How to read this. The ACME (Average Causal Mediation Effect) is the indirect path through income; the ADE is the direct path; their sum is the total effect. About half of education’s effect on health travels through income. Two cautions worth carrying forward: (1) the estimate of the indirect effect is only as credible as the DAG — if there is an unmeasured confounder of income and health, the indirect estimate is biased even though the regressions converge cleanly; (2) Baron & Kenny assumes no exposure-mediator interaction, which is why mediation::mediate() is the preferred tool when in doubt.
R Reflect on what you just ran
Use the questions below to interpret the output you produced. Look at your console / plot before answering.
1. Compare the total effect c from lm(health ~ education) with the direct effect c' (the education coefficient in fit_Y). Which is larger, and what does the difference imply about the role of income?
lm(health ~ education) is roughly 0.60, while the direct effect c' from fit_Y is about 0.30 — the total is larger. The difference (~0.30) is the indirect effect operating through income; income carries about half of education's effect on health. This is the classical Baron & Kenny mediation signal: a substantial drop from c to c' when the mediator is added to the model.2. Multiply a * b by hand. How close is this product to the bootstrapped ACME from summary(med)? Does the 95% CI for ACME exclude zero?
fit_Y); a×b = 0.30. The bootstrapped ACME from summary(med) is reported as 0.301 with 95% CI (0.244, 0.358) — almost exactly the product, with the CI clearly excluding zero. The agreement validates that the by-hand and package estimates match; the CI confirms statistical significance.3. The output reports a Prop. Mediated of about 0.50. Translate that into a sentence about education, income, and health. What would change about your interpretation if the 95% CI for ACME crossed zero?
The structured approach starts with structured files. The convention below — one RStudio project, one folder per stage, one numbered script per task — scales from a homework assignment to a journal-ready paper.
# Create directories from R (or by hand). Run once at project start.
dir.create("data/raw", recursive = TRUE)
dir.create("data/processed", recursive = TRUE)
dir.create("R"); dir.create("output/figures", recursive = TRUE)
# tidyverse: dplyr (manipulation), ggplot2 (graphics), readr (file IO),
# tidyr (reshape), stringr (text). Install once.
# install.packages("tidyverse")
library(tidyverse)
library(here) # robust file paths
# A canonical pipeline: read -> clean -> save -> analyse
raw <- read_csv(here("data/raw/cohort.csv"))
clean <- raw |>
filter(!is.na(outcome)) |>
mutate(age_grp = cut(age, c(0, 30, 50, 70, Inf)),
smoker = factor(smoker, levels = c("No", "Yes")))
write_csv(clean, here("data/processed/cohort_clean.csv"))
# Sketch a DAG to anchor the analysis (HSCI 341 Lesson 1)
# library(dagitty)
# g <- dagitty("dag { smoker -> outcome ; age -> smoker ; age -> outcome }")
The pipe operator |> (or %>%) is the workhorse of the tidyverse: it chains verb -> verb -> verb so analysis reads top to bottom. Combined with here() for paths, your project is movable, shareable, and version-control-friendly out of the box.
R Reflect on what you just ran
Use the questions below to interpret the output you produced. Look at your console / plot before answering.
1. After running dir.create() four times, what folder structure now exists in your project? Why is keeping data/raw/ separate from data/processed/ a defensible choice?
dir.create() calls create: data/raw/, data/processed/, R/, and output/figures/. Keeping data/raw/ separate from data/processed/ is defensible because raw data is the irreplaceable artefact — the source of truth that should never be modified. Processed data are derived from raw; if a cleaning bug is discovered, you can re-derive from raw, but if you overwrote the raw file, the bug is permanent. The separation enforces the rule "raw is read-only; processed is regenerable."2. Trace the pipeline raw |> filter(...) |> mutate(...). What two columns get created in the clean object that did not exist in raw?
clean: age_grp (a factor with intervals 0–30, 30–50, 50–70, 70+) and smoker (a factor with levels "No" and "Yes" with a specified reference). The first is a categorisation of continuous age; the second is a re-encoded version of an existing variable with controlled level order so regression coefficients align with the chosen reference.3. Why does the script use here("data/raw/cohort.csv") instead of an absolute path like "C:/Users/.../cohort.csv"? Give one practical scenario where this matters.
here() resolves paths relative to the project root, so the same script works on any machine, in any user account, in any operating system — as long as the project structure is unchanged. Absolute paths break instantly when the project is moved, shared with a collaborator on a different OS, or run inside a Docker container. Concrete scenario: you send your code to a co-author for review. They unzip the project on their Mac at ~/projects/this_study/; your hardcoded C:/Users/.../ would fail immediately, while here("data/raw/cohort.csv") just works.Managing Data-Collection Sheets
It is important to establish a permanent storage system for all original data-collection sheets (survey forms, data-collection forms, etc.) that makes it easy to retrieve individual sheets if they are needed during the analysis.
Do not remove originals from your file. If you need a specific sheet for use at another location, make a photocopy. Never ship the original to another location without first making copies of all forms.
Set up a system for recording the insertion of data-collection sheets into the file so that you know how many remain to be collected before further work begins.
Once all forms have been collected, scan through all sheets for their completeness before doing anything else. If there are omissions, returning to the data source to complete the data will be more likely to succeed if done soon after collection rather than weeks or months later.
1. What should you construct before beginning any work with your data?
2. Why is data analysis described as an “iterative process”?
3. What should you do if you find omissions in data-collection sheets?
Reflection
Think of a research question you are interested in. Sketch out (describe) a causal diagram showing the key outcome, main predictors, potential confounders, and any intervening variables. How does this diagram help you plan your analysis?
Data Coding, Entry & File Management
Introduction and Overview
Section 1 set up the conceptual workflow and the discipline of starting from a causal diagram before you touch any data. Section 2 turns to the practical side: how do you encode raw responses, enter them into a workable file, organise files across a project, and keep track of what every variable in your dataset means? These are unglamorous tasks, but they determine whether the analysis you eventually run is reproducible. Organising data so that each variable is a column and each observation a row — the "tidy data" convention — makes downstream analysis dramatically easier (Wickham, 2014; Wikipedia: Tidy data).
Learning Objectives
- Apply coding conventions for missing values, numeric codes, and avoiding compound codes.
- Plan a data-entry workflow that minimises transcription error.
- Lay out a project folder structure that distinguishes raw, processed, and analysis files.
- Build and maintain a variable codebook that any collaborator could open and use.
Data Coding
Before entering data into a computer, careful coding is essential. Good coding practices prevent errors that can cascade throughout an entire analysis.
Data Entry
Some important issues to consider when entering your data into a computer file:
Double-data entry, followed by comparison of the 2 files to detect any inconsistencies, is preferable to single-data entry. This dramatically reduces the error rate in your dataset.
Spreadsheets are a convenient tool for initial data entry, but they must be used with extreme caution. It is possible to sort individual columns, which could destroy your entire dataset with one inappropriate “sort” command. Custom data-entry software provides a greater margin of safety.
As soon as the data-entry process has been completed, save the original data files in a safe location. In large, expensive trials, keep a copy of all originals stored in another location. Convert your data to the format your statistical software uses as soon as possible.
Keeping Track of Files
It is important to have a system for keeping track of all your files. Key recommendations:
- Assign a logical name with a 2-digit numerical suffix (e.g., brazil01). A 2-digit suffix allows you to have 99 versions that still sort correctly when listed alphabetically.
- When data manipulations are carried out, save the file with a new name (the next available number). Do not change data and then overwrite the file.
- Keep a simple log of files created with information about the contents (e.g., number of observations and variables).
bp01.odc (27/09/07) — Original blood pressure study data; spreadsheet; 1 record per measurement. 1092 obs, 8 vars.
bp01.dta (28/09/07) — Original file; Stata format. 1092 obs, 8 vars.
bp02.dta (30/09/07) — 45 records with missing values dropped. 1047 obs, 8 vars.
The "save a new version, don't overwrite" rule is automatic if your transformations live in a script. The file log becomes the script's git history.
library(tidyverse)
# Read raw, never modify in place
bp_raw <- read_csv("data/raw/bp01.csv")
# Tidy: drop incomplete rows, build derived variables, lock factors
bp_clean <- bp_raw |>
drop_na(systolic, diastolic, age) |>
mutate(
age_ct = age - mean(age), # centred
age_ctsq = age_ct^2, # quadratic term
age_c3 = cut(age, c(0, 35, 55, Inf),
labels = c("young", "middle", "older")),
htn = factor(systolic >= 140 | diastolic >= 90,
levels = c(FALSE, TRUE),
labels = c("normotensive", "hypertensive"))
)
# Persist as a new versioned file - and a small log line
write_csv(bp_clean, "data/processed/bp02.csv")
cat("bp02.csv", format(Sys.Date()), nrow(bp_clean), "obs",
"\n", file = "data/file_log.txt", append = TRUE)
## At any time you can rebuild bp02 from bp01 by re-running this script.
This is the difference between “data analysis” and “data engineering”. A clicked-together SPSS workflow can't be re-derived. A script can — six months from now, by a colleague, on a different computer.
R Reflect on what you just ran
Use the questions below to interpret the output you produced. Look at your console / plot before answering.
1. The mutate() call creates four new variables (age_ct, age_ctsq, age_c3, htn). For each, state in one phrase what kind of variable it is (continuous, categorical, derived) and why a future analyst would want it pre-built.
age_ct is a continuous variable (mean-centred age) — useful so a regression intercept reflects the population mean. age_ctsq is a derived continuous variable (centred age squared) — allows the model to fit non-linear age effects without high collinearity with linear age. age_c3 is a categorical (factor) variable (three age groups) — convenient for stratified summaries and clinical-grouping interpretation. htn is a derived categorical (binary) variable — allows easy contingency analyses and clinically intuitive subgroup reports. Pre-building these in the cleaning script means downstream analysis code references them by name instead of duplicating the recoding logic.2. The threshold for htn is systolic >= 140 | diastolic >= 90. If you raised the systolic cutoff to 150, would the prevalence of "hypertensive" go up or down? What does this tell you about the sensitivity of categorical recodes to threshold choice?
3. The script writes bp02.csv rather than overwriting bp01.csv. Describe one error this rule would prevent that an SPSS point-and-click workflow would not.
bp02.csv instead of overwriting bp01.csv preserves an audit trail of derivations. SPSS point-and-click workflows commonly overwrite the working dataset, so a bug discovered three steps later cannot be undone — the original derived state is gone. With versioned outputs, you can re-run from any intermediate point, diff one version against another, and verify the consequence of a single cleaning decision. Concrete bug-prevention: a labelling error in bp02 (e.g., reversed levels) is recoverable because bp01 remains intact to be re-derived.Keeping Track of Variables
Even a relatively focused study can give rise to a large number of variables once transformed and recoded variables have been created. Recommendations include:
Use short but informative names and have all related variables start with the same name. Long names can be shortened by removing vowels (e.g., wtr_cstrn for “water cistern”). If your statistics program is case sensitive, use ONLY lower-case letters. At some point, prepare a master list of all variables.
| Variable | Description |
|---|---|
age | Original data (in years) |
age_ct | Age after centring by subtraction of the mean |
age_ctsq | Quadratic term (age_ct squared) |
age_c2 | Age categorised into 2 categories (young vs old) |
age_c3 | Age categorised into 3 categories |
1. Why should you never use compound codes?
2. What is the advantage of double-data entry?
3. When data manipulations are carried out, what should you do with the file?
Reflection
Describe a file-naming and version-control system you would use for a dataset in your own research area. How would you organise the variable names for a study with demographic, clinical, and outcome variables?
YYYYMMDD_studyname_dataset_version.ext (e.g., 20260516_smoking_cohort_clean_v03.csv). Version control with Git for code and DVC (Data Version Control) for data, plus weekly snapshots to a cloud backup. Variable naming: lowercase snake_case throughout; demographic prefix dem_ (e.g., dem_age, dem_sex, dem_education); clinical prefix cli_ (cli_bp_systolic, cli_hba1c); outcome prefix out_ (out_mi_5y, out_cvd_death). Avoid spaces and special characters; date suffix for derived variables (out_mi_5y_v2); a codebook (markdown + machine-readable JSON/YAML) accompanies the dataset documenting type, valid range, missing-data codes, and derivation logic. This is the difference between a dataset a stranger can reproduce in 6 months and one only the original analyst can navigate.Program Files, Data Editing & Verification
Introduction and Overview
Section 2 organised your files and variables. Section 3 turns to the analyses themselves: working in program mode rather than interactively (so the work is reproducible), editing data systematically rather than ad hoc, and verifying that the dataset behaves as expected before you draw any inferences from it. The discipline you build here is what separates a defensible analysis from a fragile one, and it has been argued that reproducibility is now the minimum standard for evaluating computational research (Peng, 2011).
Learning Objectives
- Contrast interactive and program-mode workflows and justify why programs are required for reproducible work.
- Edit data through scripted, documented steps rather than ad hoc fixes to the raw file.
- Run systematic verification checks on ranges, types, and internal consistency.
- Decide when verification is sufficient to move on to substantive analysis.
Program Mode vs. Interactive Processing
Statistical programs can be used in an interactive mode (selecting items from menus or typing in a command) or in program mode (compiling a series of commands into a program and then running it).
Interactive mode is very useful for exploring your data and trying out analyses. However, it should not be used for any of the “real” processing and/or analysis because it is very difficult to keep a clear record of steps taken. Consequently, it is difficult or impossible to reconstruct the analyses you have completed.
Program mode is the recommended approach. You compile the commands into a program and then run it. These program files can be saved and used to reconstruct any analyses you have carried out. Key tips: name files logically, structure the program to be easy to follow, use sequential indents, and document the file thoroughly with comments.
Critical Rule
Do all of the analyses in your statistical program. Don’t start doing basic statistics in a spreadsheet. You are going to need the statistical program eventually, and it will be much easier to keep track of all your analyses if they are all done there.
Data Editing
Before beginning any analyses, spend time editing your data. The most important components are:
Data Verification
Before you start any analyses, you must verify that your data are correct. This can be combined with data processing and involves going through all of your variables, one-by-one.
- Determine the number of valid observations and the number of missing values
- Check the maximum and minimum values (or the 5 smallest and 5 largest) to make sure they are reasonable; if they are not, find the error, correct it, and repeat the process
- Prepare a histogram of the data to get an idea of the distribution and see if it looks reasonable
- Determine the number of valid observations and the number of missing values
- Obtain a frequency distribution to see if the counts in each category look reasonable (and to make sure there are no unexpected categories)
1. Why should program mode be preferred over interactive mode for “real” data analysis?
2. When verifying continuous variables, what should you examine first?
3. What is the purpose of attaching labels to categorical variable values?
Reflection
Imagine you receive a dataset where a colleague entered data interactively in a spreadsheet with no documentation. What steps would you take to clean, verify, and prepare the data for analysis? What problems might you encounter?
Data Processing & Unconditional Associations
Introduction and Overview
Sections 1–3 produced a verified, well-documented dataset. Section 4 finally takes the step every student is impatient to make: the actual analysis. We start with how to process outcome and predictor variables for analysis, handle multilevel data structure, and run the “unconditional associations” (single-predictor descriptions) that are the necessary first look at any dataset before you touch a multivariable model.
Learning Objectives
- Process outcome variables to fit the planned analysis (categorical, continuous, count, or rate).
- Process predictor variables, including categorisation, scaling, and recoding decisions.
- Recognise multilevel structure in your dataset and the implications for later modelling.
- Run and interpret unconditional (single-predictor) associations as the first analytic step.
- Keep an analytic log that allows you to reconstruct every decision you made.
Processing the Outcome Variable(s)
While verifying data, you can also start processing your outcome variable(s). Review the stated goals of the study to determine the format(s) which best suits the goal(s). Consider the following based on outcome type:
Is the distribution of outcomes across categories acceptable? For example, if you planned a multinomial regression with a 3-category outcome, but very few observations fall in one of the categories, you might want to recode it to a 2-category variable.
Does the variable have the characteristics necessary for the planned analysis? If linear regression is planned, is the distribution approximately normal? If not, explore transformations. Note: It is the normality of the residuals which is ultimately important, but if the original variable is far from normal and there are no strong predictors, the residuals are unlikely to be normal.
If Poisson regression is planned, are the mean and variance of the distribution approximately equal? If not, consider negative binomial regression or alternative analytic approaches.
What proportion of the observations are censored? You might also want to generate a simple graph of the empirical hazard function to get an idea what shape it has.
Processing Predictor Variables
It is important to go through all predictor variables to determine how they will be handled:
- Missing values: Are there many? If so, you might need to abandon plans to use that predictor, or conduct 2 analyses (one on the subset where the predictor is present and one on the full dataset ignoring the predictor).
- Distribution: For continuous variables, is there a reasonable representation over the whole range of values? If not, it might be necessary to categorise the variable.
- Categorical variables: Are all categories reasonably well represented? If not, you might have to combine categories.
Multilevel Data
If your data are multilevel (e.g., blood pressure measurements within individuals within centres), evaluate the hierarchical structure:
Key Questions for Multilevel Data
What is the average (and range) number of observations at one level in each higher-level unit? Are individuals uniquely identified within a hierarchical level? It is often useful to create one unique identifier for each observation in the dataset.
Unconditional Associations
Before proceeding with any multivariable analyses, it is important to evaluate unconditional associations within the data. These serve as the foundation for building more complex models.
| Variable Types | Analytical Approach |
|---|---|
| Two continuous variables | Correlation coefficient, scatterplot, simple linear regression |
| One continuous + one categorical | One-way ANOVA, simple linear or logistic regression |
| Two categorical variables | Cross-tabulation and χ² test |
When evaluating unconditional associations, pay attention to:
- Associations between predictors and outcome: Determine if there is any association at all; determine the functional form (is it linear?); get a simple picture of the strength and direction.
- Associations between pairs of predictors: Look for potential collinearity problems (highly correlated predictors).
- Confounding variables: Evaluate associations between the confounding variables and the key predictors of interest and the outcome.
Keeping Track of Your Analyses
Before starting the more substantial analysis, set up a system for keeping track of your results:
1. Why should you evaluate unconditional associations before multivariable analyses?
2. If a continuous outcome variable is far from normally distributed, what should you do?
3. What is the appropriate analytical approach for evaluating the association between two categorical variables?
Reflection
Consider a dataset with 15 predictor variables and one continuous outcome. Describe the sequence of unconditional analyses you would carry out before fitting any multivariable models. How would you handle a predictor that has 30% missing values?
A Structured Approach to Data Analysis — Final Assessment
Bringing It All Together
This lesson laid out a structured workflow for taking a dataset from collection to first analysis. Section 1 insisted that you start with a causal diagram and a clear separation of outcomes, predictors, confounders, and intervening variables — before you touch the data. Section 2 turned that intent into discipline at the level of files: thoughtful coding, a project folder you could hand to a collaborator, and a codebook that captures every variable. Section 3 made the workflow reproducible by moving you out of point-and-click interactive mode into program-mode scripts, with systematic editing and verification. Section 4 closed the loop by processing outcomes and predictors, surfacing multilevel structure, and producing the unconditional associations that should always precede a multivariable model.
The thread running through all four sections is that an analysis is only as trustworthy as the steps that came before it. Every later lesson in HSCI 410 — linear regression, model building, logistic regression, count data, survival, mixed models — assumes you arrive at the modelling step with a clean, documented, and well-understood dataset. The structured approach is what makes the rest of the course possible. It also positions you to interpret p-values, confidence intervals, and effect sizes responsibly when you eventually report them (Wasserstein & Lazar, 2016; Greenland et al., 2016), and to avoid the analytic flexibility that drives false-positive results (Simmons, Nelson, & Simonsohn, 2011; see also Wikipedia: John Tukey on the historical roots of exploratory data analysis).
Key Takeaways from Lesson 1
- Always start with a causal diagram: it forces you to declare your outcome, predictors, confounders, and intervening variables before the data can mislead you.
- Data analysis is iterative; expect to back up several steps as you learn more about your data.
- Coding and file-management decisions made in the first hour shape every analysis you run afterward — treat them as part of the analysis, not pre-work.
- Use program mode, not interactive clicks: a script you can re-run is the only honest record of what you did.
- Verify the dataset (ranges, types, consistency) before you trust any descriptive or inferential output.
- Run unconditional associations first; they reveal data problems and effect sizes that a multivariable model will hide.
Final Reflection
Reflecting on the entire lesson, what do you consider the most important step in the structured approach to data analysis? How would you apply this structured approach to a dataset you are currently working with or plan to work with in the future?
1. What is the first step recommended before beginning any data analysis?
2. Why should you avoid starting analyses in a spreadsheet?
3. What coding value should NOT be assigned to missing data?
4. Why is a 2-digit numerical suffix recommended for file names?
5. What is the danger of using the “sort” command in a spreadsheet for data entry?
6. What is the primary purpose of evaluating unconditional associations between pairs of predictors?
7. When processing a categorical outcome with 3 categories, when might you recode it to 2 categories?
8. What approach should be used to document what a program file does?
9. For verifying a continuous variable, what visual tool is recommended?
10. What is the appropriate analysis for the association between one continuous and one categorical variable?
11. If a predictor variable has many missing values, what options are available?
12. What should you do with log files from your analyses?
13. Why is interactive mode still useful despite its limitations?
14. When evaluating confounding variables, what should you specifically look for?
15. What does the chapter suggest you should do if a count/rate outcome’s mean and variance are not approximately equal?