A Structured Approach to Data Analysis

Exploratory Data Analysis For Epidemiology

Learning objectives for this lesson:

Construct a causal diagram before beginning data analysis
Establish a system for managing data-collection sheets, files, and variables
Apply best practices for data coding, entry, and verification
Process outcome and predictor variables appropriately for analysis
Evaluate unconditional associations between variables
Set up a systematic approach for keeping track of analyses

This course was developed by Dr. Kiffer G. Card, Faculty of Health Sciences, Simon Fraser University based on Dohoo, I. R., Martin, S. W., & Stryhn, H. (2012). Methods in Epidemiologic Research. VER Inc.

Reference

Glossary: Key Terms, People & Concepts

📚 Reference page, available throughout the lesson

This glossary collects the key concepts, people, and ideas you will meet in this lesson. Use it as a reference while you work through the material, or as a review before assessments. Type in the search box to filter entries.

Key Concepts & Ideas

Research Question A focused, answerable statement that frames an analysis. Usually structured around a population, exposure (or determinant), comparator, and outcome (PECO/PICO). A clear research question drives the entire analytic plan.

Analytic Plan A pre-specified roadmap that links the research question to data collection, variables, statistical models, and decision rules. Reduces ad hoc decisions and protects against fishing expeditions.

Hypothesis A testable prediction about the relationship between variables. The null hypothesis (H₀) typically states no association; the alternative (H₁) states an association exists.

Exposure The factor whose effect on the outcome is of primary interest (e.g., a treatment, behaviour, or environmental agent). Sometimes called the predictor or independent variable.

Outcome The health state or event being predicted or explained (e.g., disease, recovery, death). Also called the dependent variable or response.

Covariate Any variable other than the primary exposure that may influence the outcome. May be a confounder, mediator, effect modifier, or simply a precision variable.

Confounder A variable associated with both the exposure and the outcome that is not on the causal pathway between them. Failing to adjust for confounders biases effect estimates.

DAG (Directed Acyclic Graph) A diagram of variables (nodes) and directed causal arrows (edges) with no cycles. Used to encode assumed causal structure and identify which variables to adjust for.

Data Dictionary / Codebook A document listing every variable in a dataset with its name, definition, type, allowable values, units, and coding rules. Essential for reproducibility and collaboration.

Reproducibility The ability of others (or future-you) to re-run an analysis on the same data and obtain the same results, given the code and documentation.

Data Verification Checking entered data against source records to confirm accuracy. Includes double-entry, range checks, consistency checks, and cross-tabulation against expected patterns.

Multilevel Data Observations nested within higher-level units (e.g., patients within clinics, students within schools). Requires methods that account for clustering and within-unit correlation.

Methods & Statistical Concepts

MCAR (Missing Completely at Random) Missingness is unrelated to any observed or unobserved variable. Complete-case analysis is unbiased but loses efficiency.

MAR (Missing at Random) Missingness depends only on observed variables. Multiple imputation and likelihood-based methods can give unbiased estimates.

MNAR (Missing Not at Random) Missingness depends on the unobserved value itself. Requires sensitivity analyses or explicit modeling assumptions; cannot be fixed by standard imputation.

Unconditional Association The crude (unadjusted) relationship between two variables, ignoring other covariates. Useful as a screening step before multivariable modeling.

Program File / Script A saved file containing analysis code (e.g., a .R script) that can be re-executed to reproduce results. Preferred over interactive (point-and-click) processing.

Data Coding Translating raw responses (e.g., “Yes”, “No”, “Don't know”) into numeric or categorical values suitable for analysis, following pre-specified rules.

No matching entries. Try a different search term.

Section 1

Introduction & Data Collection

⏱ Estimated time: 15 minutes

Lesson 1 · HSCI 410

A Structured Approach to Data Analysis

From reading evidence and designing studies to analysing the data those studies produce.

Why it matters now

Reproducibility as a professional expectation

An analysis is only as trustworthy as the steps that came before it.Lesson 1 throughline

The replication crisis of the 2010s (Ioannidis, 2005; Open Science Collaboration, 2015; Munafò et al., 2017) raised the bar. Structured, scripted, documented workflows are now the expected standard in published public-health research.

Section 1 of 4

Introduction & Data Collection

A structured, iterative workflow and the causal diagram as your first analytic move.

The core principle

Analysis is iterative, not linear

Jumping straight to the sophisticated model almost always produces wrong results.

Tukey's distinction between exploratory and confirmatory work is the intellectual anchor. A structured template makes every iteration cheaper and more defensible.

The essential first step

Three DAG structures

Fork

C is a confounder. Adjust for it.

Chain

M is a mediator. Do not adjust for total effect.

Collider

Do not adjust for Z. Conditioning opens a spurious path.

DAG to regression

Mediation: education → income → health

Baron & Kenny decomposition

\[ \underbrace{\color{#0B7B6B}{a} \times \color{#C2410C}{b}}_{\text{indirect (ACME)}} + \underbrace{\color{#6D28D9}{c'}_{\phantom{x}}}_{\text{direct (ADE)}} = \underbrace{\color{#1D4ED8}{c}}_{\text{total}} \qquad 0.30 + 0.30 \approx 0.60 \]

a effect of exposure on mediator b effect of mediator on outcome c' direct effect (ADE) c total effect

About half of the total effect of education on health flows through income (proportion mediated ≈ 0.50).

Key caution: the indirect estimate is only as credible as the DAG. An unmeasured confounder on the income→health path biases the result even when the regressions look clean.

Project setup

A reproducible project skeleton

  .

  +-- R/                  numbered scripts: 01_load.R, 02_clean.R

  +-- data/raw/          read-only; never overwritten

  +-- data/processed/    regenerated entirely from R/

  +-- output/figures/     numbered figures for the paper

  +-- project.Rproj       one click reopens the environment

Pipelines read top to bottom with the pipe operator; here() keeps file paths portable across machines.

Data collection

Managing data-collection sheets

Protect originals

Never remove originals from the file. Ship photocopies only. The original is the irreplaceable record.

Track progress

Record insertion of each form so you know how many remain to be collected before further work begins.

Scan for completeness

Once all forms are in, check every sheet for omissions before any other step. Act on gaps promptly.

Carry forward

What to take into the next section

Draw the DAG first. It fixes what you are estimating and what goes in the model.
Expect iteration. Backing up several steps is normal and necessary.
Protect originals and scan for completeness before any analysis begins.

Introduction and Overview

Earlier courses in this series covered how to read epidemiological evidence (230) and how to design and conduct epidemiologic studies (341). This course addresses the next link in the chain: how to analyse the data those studies produce. The R skills introduced in the orange boxes throughout 230 and 341 scale up here into the full statistical machinery of modern public-health analysis, including linear, logistic, multinomial, ordinal, count, and survival regression; mixed models for clustered and longitudinal data; and causal-inference tools for observational data. This first lesson sets the foundation: a disciplined workflow for taking raw data from collection through to analysis-ready files. The structured approach has become especially important since the wider scientific community recognised a reproducibility crisis in published research (Ioannidis, 2005; Open Science Collaboration, 2015) and called for a manifesto of reproducible practice (Munafò et al., 2017). Across four content sections we walk through this in order: introduction and data collection (this section), data coding, entry, and file management (a later section), program files, editing, and verification (a later section), and data processing plus the first unconditional associations (a later section).

Learning Objectives

Explain why a structured, iterative analytic workflow outperforms diving straight into modelling.
Sketch a causal diagram that distinguishes outcomes, predictors, confounders, and intervening variables.
Identify the components of a usable data-collection sheet for primary data.
Recognise where this section sits in the larger pipeline that runs through later sections.

Why a Structured Approach?

When starting the analysis of a complex dataset, it is very helpful to have a structured approach in mind. For most people, there is a strong tendency to jump straight into the sophisticated analysis that will provide the ultimate answer. This rarely works out, because the results will be wrong when important preliminary steps were skipped.

Key Principle

Data analysis is an iterative process which often requires that you back up several steps as you gain more insight into your data, an idea Tukey developed when he distinguished exploratory from confirmatory work (Peng, 2011). A structured template, while not the only approach, will be applicable in most situations and will serve to guide your initial efforts; tidy-data conventions (Wickham, 2014) make each iteration cheaper.

Start with a Causal Diagram

Before you start any work with your data, it is essential to construct a plausible causal diagram of the problem you are about to investigate. This will help identify:

Which variables are important outcomes and predictors
Which are potential confounders
Which might be intervening variables between your main predictors and outcomes

Practical Tip

Keep this causal diagram in mind throughout the entire data-analysis process. With large datasets, it will not be possible to include all predictors as separate entities. This can be handled by including blocks of variables (e.g., demographic characteristics) in the diagram instead of listing each variable.

Quick refresher: DAGs from an earlier course

When we say “causal diagram” in 410, we mean a directed acyclic graph (DAG): nodes for variables, directed arrows for direct causal effects, no cycles. You met these in an earlier course, and the three structural pieces still do all the work:

Fork (X ← C → Y): C is a confounder. Adjust for it.
Chain (X → M → Y): M is a mediator. Do not adjust if you want the total effect.
Collider (X → Z ← Y): Do not adjust, and watch for it in selection.

The DAG fixes your estimand (what causal quantity you are estimating) and your adjustment set (what goes on the right-hand side of the regression) before you fit anything. In 410 we use it as the bridge from a research question to a regression model.

From DAG to Regression: A Mediation Example

One of the clearest places to see this bridge is mediation. A DAG of the form X → M → Y with a residual direct path X → Y is the qualitative claim; the Baron & Kenny (1986) procedure introduced in 341 puts numbers on the direct and indirect components. Intuitively, the indirect effect is the part of education's benefit that appears only because more education tends to raise income, and higher income in turn improves health; the direct effect is whatever is left once that income pathway is set aside. Below we do that fitting in R, on simulated data so you can verify the answer against the truth.

The indirect effect is the product a × b; the direct effect is the path that does not pass through income. Their sum is the total effect.

R Fitting a mediation model in R (Baron & Kenny + the mediation package)

The DAG: education → income → health, with education → health directly. We will (i) run Baron & Kenny’s three regressions by hand, then (ii) replicate the result with the mediation package, which gives proper bootstrap confidence intervals for the indirect effect.

# install.packages(c("mediation", "dagitty"))
library(mediation)

# 1. Simulate data that match the DAG: education -> income -> health,
#    with a smaller direct path education -> health.
set.seed(410)
n         <- 800
education <- rnorm(n)
income    <- 0.6 * education + rnorm(n)              # path a
health    <- 0.3 * education + 0.5 * income + rnorm(n)  # direct + path b
dat       <- data.frame(education, income, health)

# 2. Baron & Kenny by hand --------------------------------------------------
#    Step 1: total effect c  (health on education)
coef(lm(health ~ education, data = dat))["education"]

#    Step 2: a (income on education)
fit_M  <- lm(income ~ education, data = dat)

#    Step 3: direct c' (education) and b (income), from health on both
fit_Y  <- lm(health ~ education + income, data = dat)
coef(fit_Y)                     # c' on education, b on income

# Indirect effect = a * b  (or equivalently c - c')
a <- coef(fit_M)["education"]
b <- coef(fit_Y)["income"]
a * b

# 3. Same answer, with bootstrap CIs, via the mediation package -------------
med <- mediate(fit_M, fit_Y,
                treat    = "education",
                mediator = "income",
                boot     = TRUE, sims = 1000)
summary(med)

Console output (truncated)

Estimate 95% CI Lower 95% CI Upper p-value ACME (indirect) 0.301 0.244 0.358 <0.001 ADE (direct) 0.296 0.225 0.366 <0.001 Total Effect 0.597 0.521 0.671 <0.001 Prop. Mediated 0.504 0.418 0.588 <0.001

How to read this. The ACME (Average Causal Mediation Effect) is the indirect path through income; the ADE is the direct path; their sum is the total effect. About half of education’s effect on health travels through income. Two cautions worth carrying forward: (1) the estimate of the indirect effect is only as credible as the DAG. If there is an unmeasured confounder of income and health, the indirect estimate is biased even though the regressions converge cleanly; (2) Baron & Kenny assumes no exposure-mediator interaction, which is why mediation::mediate() is the preferred tool when in doubt.

R Reflect on what you just ran

Use the questions below to interpret the output you produced. Look at your console / plot before answering.

1. Compare the total effect c from lm(health ~ education) with the direct effect c' (the education coefficient in fit_Y). Which is larger, and what does the difference imply about the role of income?

Model answerThe total effect c from lm(health ~ education) is roughly 0.60, while the direct effect c' from fit_Y is about 0.30, so the total is larger. The difference (~0.30) is the indirect effect operating through income; income carries about half of education's effect on health. This is the classical Baron & Kenny mediation signal: a substantial drop from c to c' when the mediator is added to the model.

2. Multiply a * b by hand. How close is this product to the bootstrapped ACME from summary(med)? Does the 95% CI for ACME exclude zero?

Model answera = 0.6 (income on education) and b = 0.5 (income on health from fit_Y); a×b = 0.30. The bootstrapped ACME from summary(med) is reported as 0.301 with 95% CI (0.244, 0.358), almost exactly the product, with the CI clearly excluding zero. The agreement validates that the by-hand and package estimates match; the CI confirms statistical significance.

3. The output reports a Prop. Mediated of about 0.50. Translate that into a sentence about education, income, and health. What would change about your interpretation if the 95% CI for ACME crossed zero?

Model answerProp. Mediated ≈ 0.50: about half of the total effect of education on health flows through income, while the other half is a direct effect through other mechanisms, such as health-information access, health-system navigation, and health-promoting environments. If the 95% CI for ACME crossed zero, the indirect path through income would not be statistically significant; the interpretation would shift to "we cannot rule out that income contributes nothing to the education-health link in this dataset." Substantive caution: ACME's credibility is bounded by the no-unmeasured-confounding assumption on the M→Y path.

Saved.

R A reproducible project skeleton in RStudio

The structured approach starts with structured files. The convention below, with one RStudio project, one folder per stage, and one numbered script per task, scales from a homework assignment to a journal-ready paper.

# Create directories from R (or by hand). Run once at project start.
dir.create("data/raw",        recursive = TRUE)
dir.create("data/processed",  recursive = TRUE)
dir.create("R");  dir.create("output/figures", recursive = TRUE)

# tidyverse: dplyr (manipulation), ggplot2 (graphics), readr (file IO),
# tidyr (reshape), stringr (text). Install once.
# install.packages("tidyverse")
library(tidyverse)
library(here)                                # robust file paths

# A canonical pipeline: read -> clean -> save -> analyse
raw <- read_csv(here("data/raw/cohort.csv"))
clean <- raw |>
  filter(!is.na(outcome)) |>
  mutate(age_grp = cut(age, c(0, 30, 50, 70, Inf)),
         smoker  = factor(smoker, levels = c("No", "Yes")))
write_csv(clean, here("data/processed/cohort_clean.csv"))

# Sketch a DAG to anchor the analysis (see the earlier DAG course)
# library(dagitty)
# g <- dagitty("dag { smoker -> outcome ; age -> smoker ; age -> outcome }")

Conventions worth defending

. +-- R/ <- analysis scripts (numbered: 01_load.R, 02_clean.R) +-- data/raw/ <- never overwritten; treated as read-only +-- data/processed/ <- generated, fully reproducible from R/ +-- output/figures/ <- numbered .png/.pdf for the paper +-- project.Rproj <- one click reopens the whole environment

The pipe operator |> (or %>%) is the workhorse of the tidyverse: it chains verb -> verb -> verb so analysis reads top to bottom. Combined with here() for paths, your project is movable, shareable, and version-control-friendly out of the box.

R Reflect on what you just ran

Use the questions below to interpret the output you produced. Look at your console / plot before answering.

1. After running dir.create() four times, what folder structure now exists in your project? Why is keeping data/raw/ separate from data/processed/ a defensible choice?

Model answerThe four dir.create() calls create: data/raw/, data/processed/, R/, and output/figures/. Keeping data/raw/ separate from data/processed/ is defensible because raw data is the irreplaceable artefact, the source of truth that should never be modified. Processed data are derived from raw; if a cleaning bug is discovered, you can re-derive from raw, but if you overwrote the raw file, the bug is permanent. The separation enforces the rule "raw is read-only; processed is regenerable."

2. Trace the pipeline raw |> filter(...) |> mutate(...). One brand-new column is added by mutate() and one existing column is re-encoded in place. Which is which?

Model answerOnly one brand-new column appears in clean: age_grp, a factor with age bands (0–30, 30–50, 50–70, 70+) built by cutting continuous age. The mutate() call also rewrites smoker, but that column already existed in raw; factor(smoker, levels = c("No", "Yes")) re-encodes it in place so the model's reference level is "No". So the pipeline adds one column (age_grp) and transforms one (smoker); filter() only drops rows and creates no columns.

3. Why does the script use here("data/raw/cohort.csv") instead of an absolute path like "C:/Users/.../cohort.csv"? Give one practical scenario where this matters.

Model answerhere() resolves paths relative to the project root, so the same script works on any machine, in any user account, in any operating system, as long as the project structure is unchanged. Absolute paths break instantly when the project is moved, shared with a collaborator on a different OS, or run inside a Docker container. Concrete scenario: you send your code to a co-author for review. They unzip the project on their Mac at ~/projects/this_study/; your hardcoded C:/Users/.../ would fail immediately, while here("data/raw/cohort.csv") just works.

Saved.

Managing Data-Collection Sheets

It is important to establish a permanent storage system for all original data-collection sheets (survey forms, data-collection forms, etc.) that makes it easy to retrieve individual sheets if they are needed during the analysis.

Protect your originals

Do not remove originals from your file. If you need a specific sheet for use at another location, make a photocopy. Never ship the original to another location without first making copies of all forms.

Track collection progress

Set up a system for recording the insertion of data-collection sheets into the file so that you know how many remain to be collected before further work begins.

Scan for completeness

Once all forms have been collected, scan through all sheets for their completeness before doing anything else. If there are omissions, returning to the data source to complete the data will be more likely to succeed if done soon after collection rather than weeks or months later.

Knowledge check: this section

1. What should you construct before beginning any work with your data?

A frequency distribution of all variables A regression model of the primary outcome A plausible causal diagram of the problem under investigation

Before working with your data, you should construct a plausible causal diagram. This identifies which variables are important outcomes and predictors, which are potential confounders, and which might be intervening variables.

2. Why is data analysis described as an “iterative process”?

Because it only needs to be done once Because it often requires backing up several steps as you gain more insight into your data Because statistical software requires multiple iterations to converge

Data analysis is iterative because as you gain more insight into your data, you often need to revisit earlier steps, revise your approach, and re-examine your variables and models.

3. What should you do if you find omissions in data-collection sheets?

Return to the data source to complete the data as soon as possible Delete those records from the dataset Impute the values using statistical methods

Returning to the data source to complete missing data will more likely be successful if done soon after the data were initially collected, rather than weeks or months later when the analysis has begun.

Reflection

Think of a research question you are interested in. Sketch out (describe) a causal diagram showing the key outcome, main predictors, potential confounders, and any intervening variables. How does this diagram help you plan your analysis?

Model answerPick a question (e.g., does air-pollution exposure during pregnancy lower birth weight?). DAG: prenatal PM2.5 exposure → birth weight, with confounders maternal age, SES, smoking, prenatal-care utilisation, neighbourhood environment, and pre-pregnancy BMI all pointing into both exposure and outcome. Intervening variables: placental function and maternal hypertension on the causal path. The DAG helps plan the analysis by (a) identifying the minimal sufficient adjustment set (using dagitty: the set of confounders to control for to identify the direct effect of PM2.5); (b) flagging mediators that must NOT be adjusted for if the total effect is the question (placental function); (c) flagging potential colliders (e.g., gestational age) that must be considered carefully; (d) supporting a mediation analysis if the indirect path through placental function is the question. The DAG turns a list of "things to control for" into a structured causal claim.

Reflection saved!

* Complete the quiz and reflection to continue.

Section 2

Data Coding, Entry & File Management

⏱ Estimated time: 20 minutes

Section 2 of 4

Data Coding, Entry & File Management

Encoding responses, organising project files, and building a variable codebook.

Coding conventions

Three rules that prevent cascading errors

Missing values

Assign a code that is impossible as a real response (e.g., −999). Never use a value that could be legitimate data.

Numeric codes

Use numbers for all variables, including those originally recorded as text, so the software handles them uniformly.

No compound codes

One variable, one piece of information. A code that bundles sex and ethnicity (1 = male Caucasian, 2 = female Caucasian) breaks the moment you want to analyse either alone.

Data entry

Double-entry and the spreadsheet hazard

Double-entry

Two independent entries, then a programmatic comparison. The most effective way to reduce transcription error.

Spreadsheet caution

Sorting a single column destroys row alignment across the whole dataset. Custom software removes this risk.

Save originals immediately. Store a backup in a second location. Convert to your statistical software's format promptly.

File management

Versioning and the file log

  bp01.dta   (28/09) : original; 1092 obs, 8 vars

  bp02.dta   (30/09) : 45 missing-value records dropped; 1047 obs, 8 vars

  bp03.dta   (02/10) : age centred, htn factor added; 1047 obs, 11 vars

Two-digit suffix: 99 versions sort correctly. Never overwrite. The log is your audit trail.

Variable codebook

Naming conventions and the master list

  age       Original age in years

  age_ct     Age centred by subtracting the mean

  age_ctsq   Quadratic term (age_ct squared)

  age_c2     Age categorised into 2 groups

  age_c3     Age categorised into 3 groups

Group related variables by prefix; shorten long names by removing vowels (wtr_cstrn). One-line description per variable is the minimum standard.

Carry forward

What to take into the next section

Coding decisions made in the first hour shape every analysis that follows.
Double-entry, versioned files, and a log are the practical backbone of reproducibility.
A variable codebook captures what was measured, which is often more useful than the code itself.

Introduction and Overview

An earlier section set up the conceptual workflow and the discipline of starting from a causal diagram before you touch any data. This section turns to the practical side: how do you encode raw responses, enter them into a workable file, organise files across a project, and keep track of what every variable in your dataset means? These are unglamorous tasks, but they determine whether the analysis you eventually run is reproducible. Organising data so that each variable is a column and each observation a row (the tidy data convention) makes downstream analysis dramatically easier (Wickham, 2014).

Learning Objectives

Apply coding conventions for missing values, numeric codes, and avoiding compound codes.
Plan a data-entry workflow that minimises transcription error.
Lay out a project folder structure that distinguishes raw, processed, and analysis files.
Build and maintain a variable codebook that any collaborator could open and use.

Data Coding

Before entering data into a computer, careful coding is essential. Good coding practices prevent errors that can cascade throughout an entire analysis.

Missing ValuesClick to explore

Numeric CodesClick to explore

No Compound CodesClick to explore

Data Entry

Some important issues to consider when entering your data into a computer file:

Double-data entry

Double-data entry, followed by comparison of the 2 files to detect any inconsistencies, is preferable to single-data entry. This dramatically reduces the error rate in your dataset.

Caution with spreadsheets

Spreadsheets are a convenient tool for initial data entry, but they must be used with extreme caution. It is possible to sort individual columns, which could destroy your entire dataset with one inappropriate “sort” command. Custom data-entry software provides a greater margin of safety.

Save and back up immediately

As soon as the data-entry process has been completed, save the original data files in a safe location. In large, expensive trials, keep a copy of all originals stored in another location. Convert your data to the format your statistical software uses as soon as possible.

Keeping Track of Files

It is important to have a system for keeping track of all your files. Key recommendations:

Assign a logical name with a 2-digit numerical suffix (e.g., brazil01). A 2-digit suffix allows you to have 99 versions that still sort correctly when listed alphabetically.
When data manipulations are carried out, save the file with a new name (the next available number). Do not change data and then overwrite the file.
Keep a simple log of files created with information about the contents (e.g., number of observations and variables).

Example: File Log for a Blood Pressure Study

bp01.odc (27/09/07): Original blood pressure study data; spreadsheet; 1 record per measurement. 1092 obs, 8 vars.

bp01.dta (28/09/07): Original file; Stata format. 1092 obs, 8 vars.

bp02.dta (30/09/07): 45 records with missing values dropped. 1047 obs, 8 vars.

R A reproducible recoding pipeline (no overwrites, ever)

The "save a new version, don't overwrite" rule is automatic if your transformations live in a script. The file log becomes the script's git history.

library(tidyverse)

# Read raw, never modify in place
bp_raw <- read_csv("data/raw/bp01.csv")

# Tidy: drop incomplete rows, build derived variables, lock factors
bp_clean <- bp_raw |>
  drop_na(systolic, diastolic, age) |>
  mutate(
    age_ct    = age - mean(age),                              # centred
    age_ctsq  = age_ct^2,                                       # quadratic term
    age_c3    = cut(age, c(0, 35, 55, Inf),
                    labels = c("young", "middle", "older")),
    htn       = factor(systolic >= 140 | diastolic >= 90,
                       levels = c(FALSE, TRUE),
                       labels = c("normotensive", "hypertensive"))
  )

# Persist as a new versioned file - and a small log line
write_csv(bp_clean, "data/processed/bp02.csv")
cat("bp02.csv", format(Sys.Date()), nrow(bp_clean), "obs",
    "\n", file = "data/file_log.txt", append = TRUE)

## At any time you can rebuild bp02 from bp01 by re-running this script.

This is the difference between “data analysis” and “data engineering”. A clicked-together SPSS workflow cannot be re-derived. A script can be, six months from now, by a colleague, on a different computer.

R Reflect on what you just ran

Use the questions below to interpret the output you produced. Look at your console / plot before answering.

1. The mutate() call creates four new variables (age_ct, age_ctsq, age_c3, htn). For each, state in one phrase what kind of variable it is (continuous, categorical, derived) and why a future analyst would want it pre-built.

Model answerage_ct is a continuous variable (mean-centred age); centring is useful so a regression intercept reflects the population mean. age_ctsq is a derived continuous variable (centred age squared) that allows the model to fit non-linear age effects without high collinearity with linear age. age_c3 is a categorical (factor) variable (three age groups), convenient for stratified summaries and clinical-grouping interpretation. htn is a derived categorical (binary) variable that allows easy contingency analyses and clinically intuitive subgroup reports. Pre-building these in the cleaning script means downstream analysis code references them by name instead of duplicating the recoding logic.

2. The threshold for htn is systolic >= 140 | diastolic >= 90. If you raised the systolic cutoff to 150, would the prevalence of "hypertensive" go up or down? What does this tell you about the sensitivity of categorical recodes to threshold choice?

Model answerRaising the systolic cutoff to 150 would lower the prevalence of hypertensive classification (fewer people would meet the threshold). This illustrates how sensitive categorical recodes are to threshold choice: a 10 mmHg shift can move the prevalence by 5–15 percentage points. The reproducibility lesson: any categorical cut-point should be defended by reference to clinical guidelines (or explicit alternative cut-points checked in sensitivity analyses), and the analysis script should expose the threshold as a named constant rather than buried in a formula.

3. The script writes bp02.csv rather than overwriting bp01.csv. Describe one error this rule would prevent that an SPSS point-and-click workflow would not.

Model answerWriting bp02.csv instead of overwriting bp01.csv preserves an audit trail of derivations. SPSS point-and-click workflows commonly overwrite the working dataset, so a bug discovered three steps later cannot be undone, because the original derived state is gone. With versioned outputs, you can re-run from any intermediate point, diff one version against another, and verify the consequence of a single cleaning decision. Concrete bug-prevention: a labelling error in bp02 (e.g., reversed levels) is recoverable because bp01 remains intact to be re-derived.

Saved.

Keeping Track of Variables

Even a relatively focused study can give rise to a large number of variables once transformed and recoded variables have been created. Recommendations include:

Use short but informative names and have all related variables start with the same name. Long names can be shortened by removing vowels (e.g., wtr_cstrn for “water cistern”). If your statistics program is case sensitive, use ONLY lower-case letters. At some point, prepare a master list of all variables.

Variable	Description
`age`	Original data (in years)
`age_ct`	Age after centring by subtraction of the mean
`age_ctsq`	Quadratic term (age_ct squared)
`age_c2`	Age categorised into 2 categories (young vs old)
`age_c3`	Age categorised into 3 categories

Knowledge check: this section

1. Why should you never use compound codes?

Because compound codes take up more storage space Because compound codes are difficult to type Because each variable should only code one piece of information; combining creates analysis problems

Only code one piece of information in a single variable. Compound codes (e.g., 1=male Caucasian, 2=female Caucasian) make it extremely difficult to separate and analyse each characteristic independently.

2. What is the advantage of double-data entry?

It allows comparison of two files to detect inconsistencies and reduce errors It doubles the sample size It allows the data to be entered twice as fast

Double-data entry, followed by comparison of the 2 files to detect any inconsistencies, is preferable to single-data entry because it dramatically reduces the entry error rate.

3. When data manipulations are carried out, what should you do with the file?

Overwrite the original file to save space Save it with a new name (the next available number) and do not overwrite the original Delete the original and keep only the modified version

Save the file with a new name (the next available number in your naming convention) so you always have a record of all versions and can trace back to the original data if needed.

Reflection

Describe a file-naming and version-control system you would use for a dataset in your own research area. How would you organise the variable names for a study with demographic, clinical, and outcome variables?

Model answerFile naming: YYYYMMDD_studyname_dataset_version.ext (e.g., 20260516_smoking_cohort_clean_v03.csv). Version control with Git for code and DVC (Data Version Control) for data, plus weekly snapshots to a cloud backup. Variable naming: lowercase snake_case throughout; demographic prefix dem_ (e.g., dem_age, dem_sex, dem_education); clinical prefix cli_ (cli_bp_systolic, cli_hba1c); outcome prefix out_ (out_mi_5y, out_cvd_death). Avoid spaces and special characters; date suffix for derived variables (out_mi_5y_v2); a codebook (markdown + machine-readable JSON/YAML) accompanies the dataset documenting type, valid range, missing-data codes, and derivation logic. This is the difference between a dataset a stranger can reproduce in 6 months and one only the original analyst can navigate.

Reflection saved!

* Complete the quiz and reflection to continue.

Section 3

Program Files, Data Editing & Verification

⏱ Estimated time: 15 minutes

Section 3 of 4

Program Files, Data Editing & Verification

Program-mode scripts, systematic data editing, and verification before analysis.

Program vs. interactive

Why the script is the record

Interactive mode

Useful for exploration. Produces no durable record of steps. Cannot reconstruct what was done.

Program mode

Commands compiled into a script that can be saved, re-run, and shared. The audit trail is automatic.

All analyses, including descriptive statistics, should live in the script. Mixed spreadsheet-and-program workflows break the audit trail.

Data editing

Three components of scripted editing

Label variables

Attach a short description to each variable name so output is self-explanatory without the codebook open.

Label categories

Attach text labels to numeric codes: sex coded 0/1 displays as "male" and "female", never bare numbers.

Missing-value codes

Standardise every missing indicator (999, blank, full stop) to the single convention your software recognises as missing.

Verification

Check before you trust

Continuous variables

Count valid observations and missing values. Check 5 smallest and 5 largest for implausibility. Plot a histogram to assess distribution shape.

Categorical variables

Count valid observations and missing values. Frequency distribution across all categories. Confirm no unexpected categories have appeared.

Carry forward

What to take into the next section

Program mode makes every analytic step reproducible by default.
Systematic editing belongs in the cleaning script, not scattered across analysis files.
Verification catches data problems that sophisticated modelling would only hide, not fix.

Introduction and Overview

An earlier section organised your files and variables. This section turns to the analyses themselves: working in program mode rather than interactively (so the work is reproducible), editing data systematically rather than ad hoc, and verifying that the dataset behaves as expected before you draw any inferences from it. The discipline you build here is what separates a defensible analysis from a fragile one, and it has been argued that reproducibility is now the minimum standard for evaluating computational research (Peng, 2011).

Learning Objectives

Contrast interactive and program-mode workflows and justify why programs are required for reproducible work.
Edit data through scripted, documented steps rather than ad hoc fixes to the raw file.
Run systematic verification checks on ranges, types, and internal consistency.
Decide when verification is sufficient to move on to substantive analysis.

Program Mode vs. Interactive Processing

Statistical programs can be used in an interactive mode (selecting items from menus or typing in a command) or in program mode (compiling a series of commands into a program and then running it).

Interactive mode is very useful for exploring your data and trying out analyses. However, it should not be used for any of the “real” processing and/or analysis because it is very difficult to keep a clear record of steps taken. Consequently, it is difficult or impossible to reconstruct the analyses you have completed.

Program mode is the recommended approach. You compile the commands into a program and then run it. These program files can be saved and used to reconstruct any analyses you have carried out. Key tips: name files logically, structure the program to be easy to follow, use sequential indents, and document the file thoroughly with comments.

Critical Rule

Do all of the analyses in your statistical program. Don’t start doing basic statistics in a spreadsheet. You are going to need the statistical program eventually, and it will be much easier to keep track of all your analyses if they are all done there.

Data Editing

Before beginning any analyses, spend time editing your data. The most important components are:

Labelling VariablesClick to explore

Labelling CategoriesClick to explore

Missing Value CodesClick to explore

Data Verification

Before you start any analyses, you must verify that your data are correct. This can be combined with data processing and involves going through all of your variables, one-by-one.

For continuous variables

Determine the number of valid observations and the number of missing values
Check the maximum and minimum values (or the 5 smallest and 5 largest) to make sure they are reasonable; if they are not, find the error, correct it, and repeat the process
Prepare a histogram of the data to get an idea of the distribution and see if it looks reasonable

For categorical variables

Determine the number of valid observations and the number of missing values
Obtain a frequency distribution to see if the counts in each category look reasonable (and to make sure there are no unexpected categories)

Knowledge check: this section

1. Why should program mode be preferred over interactive mode for “real” data analysis?

Because interactive mode is slower Because program mode creates a record of all steps, making analyses reproducible Because interactive mode cannot perform complex analyses

Program mode compiles commands into a program file that can be saved and reused, making it possible to reconstruct and reproduce all analyses. Interactive mode makes it difficult to keep a clear record of steps taken.

2. When verifying continuous variables, what should you examine first?

The number of valid observations, missing values, and the minimum/maximum values The regression coefficients The correlation with the outcome variable

For continuous variables, the first verification step is to determine the number of valid observations and missing values, then check the minimum and maximum values (or the 5 smallest and 5 largest) to make sure they are reasonable.

3. What is the purpose of attaching labels to categorical variable values?

To increase the file size for better storage To allow the data to be used in a different statistical program To provide meaningful descriptions of what each numeric code represents

Categorical variables should have meaningful labels attached to each category (e.g., sex coded as 0 or 1 should have labels “male” and “female” attached) so that output is immediately interpretable.

Reflection

Imagine you receive a dataset where a colleague entered data interactively in a spreadsheet with no documentation. What steps would you take to clean, verify, and prepare the data for analysis? What problems might you encounter?

Model answerSteps: (1) Inventory: list all variables, their declared types, missing-value patterns, and value ranges; flag any variable with no obvious purpose. (2) Range checks: identify implausible values (age = 999, blood pressure = 0) and recode to NA with documentation. (3) Consistency checks: cross-variable plausibility (pregnancy in men, dates of birth after dates of death). (4) De-duplication: check for duplicated participant IDs and resolve. (5) Recoding: standardise binary and categorical variables to consistent encodings. (6) Code-up an audit trail: every cleaning decision logged in a script that re-creates the cleaned dataset from raw. (7) Documentation: build a codebook from scratch. Problems likely encountered: free-text entries in numeric fields ("unknown", "N/A", "-"), mixed date formats, encoding errors (Excel auto-converting gene names to dates), inconsistent missing-data codes (.,-,99,999), and meaningless variables you cannot interpret. The audit trail is the single most valuable artefact: it makes the cleaning reproducible and reviewable.

Reflection saved!

* Complete the quiz and reflection to continue.

Section 4

Data Processing & Unconditional Associations

⏱ Estimated time: 20 minutes

Section 4 of 4

Data Processing & Unconditional Associations

Processing outcomes and predictors, multilevel structure, and first associations before multivariable modelling.

Outcome processing

Matching the variable to the model class

Categorical

Sparse categories may need collapsing before multinomial regression.

Continuous

Approximately normal? If not, explore transformations. Normality of residuals is what ultimately matters.

Count / rate

Poisson assumption

\[ \color{#0B7B6B}{\mathbb{E}[Y]} \approx \color{#C2410C}{\operatorname{Var}(Y)} \]

E[Y] mean of the count outcome Var(Y) variance of the count outcome

Overdispersion signals negative binomial.

Time-to-event

Check proportion censored. Plot the empirical hazard before choosing a parametric or Cox model.

Predictor processing

Missing values, distributions, and sparse categories

Missing values

Drop the predictor, run parallel analyses, or use multiple imputation. The choice must be documented and defended.

Continuous spread

Reasonable variation across the range? If most values cluster near a boundary, consider transformation or categorisation.

Sparse categories

Combine thin cells. Sparse categories distort estimates and cause convergence failures in logistic models.

Multilevel structure

Characterise the hierarchy before modelling

Ignoring clustering produces standard errors that are too small. Mixed models address this in later lessons.

Unconditional associations

One predictor at a time, before the multivariable model

Variable types	Method	What to examine
Two continuous	Correlation, scatterplot, simple linear regression	Direction, strength, linearity
Continuous + categorical	One-way ANOVA, simple regression	Group differences, effect size
Two categorical	Cross-tabulation, chi-squared	Cell counts, unexpected categories

Carry forward

The analytic log and the road to modelling

Analyse in blocks

Univariate summaries, then bivariate associations, then multivariable models. Each block in a labelled script section.

Log every decision

Date-stamp outputs. Record what each run produced and what it decided. This log makes the analysis defensible under review.

The structured workflow ends here. The modelling begins in the next lesson, on a dataset that is clean, documented, and understood.

Introduction and Overview

Earlier sections produced a verified, well-documented dataset. This section finally takes the step every student is impatient to make: the actual analysis. We start with how to process outcome and predictor variables for analysis, handle multilevel data structure, and run the “unconditional associations” (single-predictor descriptions) that are the necessary first look at any dataset before you touch a multivariable model.

Learning Objectives

Process outcome variables to fit the planned analysis (categorical, continuous, count, or rate).
Process predictor variables, including categorisation, scaling, and recoding decisions.
Recognise multilevel structure in your dataset and the implications for later modelling.
Run and interpret unconditional (single-predictor) associations as the first analytic step.
Keep an analytic log that allows you to reconstruct every decision you made.

Processing the Outcome Variable(s)

While verifying data, you can also start processing your outcome variable(s). Review the stated goals of the study to determine the format(s) which best suits the goal(s). Consider the following based on outcome type:

Categorical outcomes

Is the distribution of outcomes across categories acceptable? For example, if you planned a multinomial regression with a 3-category outcome, but very few observations fall in one of the categories, you might want to recode it to a 2-category variable.

Continuous outcomes

Does the variable have the characteristics necessary for the planned analysis? If linear regression is planned, is the distribution approximately normal? If not, explore transformations. Note: It is the normality of the residuals which is ultimately important, but if the original variable is far from normal and there are no strong predictors, the residuals are unlikely to be normal.

Count / rate outcomes

If Poisson regression is planned, are the mean and variance of the distribution approximately equal? The Poisson model assumes they are; when the variance runs well above the mean (a pattern called overdispersion), the fitted standard errors come out too small and the p-values look stronger than the data warrant. If that happens, consider negative binomial regression or alternative analytic approaches.

Time-to-event outcomes

What proportion of the observations are censored? You might also want to generate a simple graph of the empirical hazard function to get an idea what shape it has.

Processing Predictor Variables

It is important to go through all predictor variables to determine how they will be handled:

Missing values: Are there many? If so, you might need to abandon plans to use that predictor, or conduct 2 analyses (one on the subset where the predictor is present and one on the full dataset ignoring the predictor).
Distribution: For continuous variables, is there a reasonable representation over the whole range of values? If not, it might be necessary to categorise the variable.
Categorical variables: Are all categories reasonably well represented? If not, you might have to combine categories.

Multilevel Data

If your data are multilevel (e.g., blood pressure measurements within individuals within centres), evaluate the hierarchical structure:

Key Questions for Multilevel Data

What is the average (and range) number of observations at one level in each higher-level unit? Are individuals uniquely identified within a hierarchical level? It is often useful to create one unique identifier for each observation in the dataset.

Why this matters: observations from the same person, or the same centre, resemble one another more than observations drawn at random, so a model that ignores the clustering behaves as if it holds more independent information than it really does. The result is standard errors that are too small and confidence intervals that are too narrow. Mixed models, which arrive later in the course, are built for exactly this structure.

Unconditional Associations

Before proceeding with any multivariable analyses, it is important to evaluate unconditional associations within the data, that is, the crude one-predictor-at-a-time relationships examined with nothing else held constant. These serve as the foundation for building more complex models.

Variable Types	Analytical Approach
Two continuous variables	Correlation coefficient, scatterplot, simple linear regression
One continuous + one categorical	One-way ANOVA, simple linear or logistic regression
Two categorical variables	Cross-tabulation and χ² test

When evaluating unconditional associations, pay attention to:

Associations between predictors and outcome: Determine if there is any association at all; determine the functional form (is it linear?); get a simple picture of the strength and direction.
Associations between pairs of predictors: Look for potential collinearity problems (highly correlated predictors).
Confounding variables: Evaluate associations between the confounding variables and the key predictors of interest and the outcome.

Keeping Track of Your Analyses

Before starting the more substantial analysis, set up a system for keeping track of your results:

Analyse in BlocksClick to explore

Keep Log FilesClick to explore

Label & DateClick to explore

Knowledge check: this section

1. Why should you evaluate unconditional associations before multivariable analyses?

Because multivariable analysis is not necessary if unconditional associations exist Because they reveal the strength and direction of associations and identify potential collinearity and confounding Because unconditional associations are the final step in data analysis

Evaluating unconditional associations before multivariable analyses helps you understand the basic relationships in your data, identify collinearity, detect confounding, and determine the functional form of relationships, all of which inform the complex models you will subsequently build.

2. If a continuous outcome variable is far from normally distributed, what should you do?

Explore transformations that might normalise the distribution Remove the variable from the analysis Always convert it to a categorical variable

If the continuous outcome is not approximately normally distributed, you should explore transformations which might normalise the distribution. It is ultimately the normality of the residuals that is important, but a far-from-normal variable with no strong predictors will produce non-normal residuals.

3. What is the appropriate analytical approach for evaluating the association between two categorical variables?

Correlation coefficient and scatterplot One-way ANOVA Cross-tabulation and chi-squared test

For associations between two categorical variables, cross-tabulation and the chi-squared (χ²) test are the appropriate analytical approaches. These are particularly useful for identifying unexpected observations.

Reflection

Consider a dataset with 15 predictor variables and one continuous outcome. Describe the sequence of unconditional analyses you would carry out before fitting any multivariable models. How would you handle a predictor that has 30% missing values?

Model answerFor 15 predictors and a continuous outcome, unconditional analyses before multivariable modelling: (1) Univariate summary for each predictor (mean, SD, range, n missing) and the outcome (histogram, Q-Q plot for normality). (2) Bivariate associations: scatterplot or boxplot of each predictor against the outcome, with Pearson/Spearman correlation or appropriate group-comparison test; report effect size and 95% CI. (3) Pairwise correlations among predictors to detect collinearity. (4) Visualise distributions by stratifying covariates against potential modifiers (e.g., by sex, age group). For the predictor with 30% missing: (a) characterise the missingness pattern (random, related to other variables, related to outcome); (b) if MCAR, complete-case analysis may suffice but loses power; (c) if MAR, use multiple imputation by chained equations (MICE) with sensitivity analyses; (d) if MNAR is plausible, run pattern-mixture or selection models to bound the inference. Never silently drop the variable or impute the mean; both bias the result.

Reflection saved!

* Complete the quiz and reflection to continue.

HSCI 410 · Lesson 1

Exploratory Data Analysis For Epidemiology

A Structured Approach to Data Analysis

Learning objectives for this lesson:

Glossary: Key Terms, People & Concepts

Introduction & Data Collection

A Structured Approach to Data Analysis

Reproducibility as a professional expectation

Introduction & Data Collection

Analysis is iterative, not linear

Three DAG structures

Fork

Chain

Collider

Mediation: education → income → health

A reproducible project skeleton

Managing data-collection sheets

Protect originals

Track progress

Scan for completeness

What to take into the next section

Introduction and Overview

Learning Objectives

Why a Structured Approach?

Key Principle

Start with a Causal Diagram

Practical Tip

Quick refresher: DAGs from an earlier course

From DAG to Regression: A Mediation Example

R Reflect on what you just ran

R Reflect on what you just ran

Managing Data-Collection Sheets

Reflection

Data Coding, Entry & File Management

Data Coding, Entry & File Management

Three rules that prevent cascading errors

Missing values

Numeric codes

No compound codes

Double-entry and the spreadsheet hazard

Double-entry

Spreadsheet caution

Versioning and the file log

Naming conventions and the master list

What to take into the next section

Introduction and Overview

Learning Objectives

Data Coding

Data Entry

Keeping Track of Files

R Reflect on what you just ran

Keeping Track of Variables

Reflection

Program Files, Data Editing & Verification

Program Files, Data Editing & Verification

Why the script is the record

Interactive mode

Program mode

Three components of scripted editing

Label variables

Label categories

Missing-value codes

Check before you trust

Continuous variables

Categorical variables

What to take into the next section

Introduction and Overview

Learning Objectives

Program Mode vs. Interactive Processing

Critical Rule

Data Editing

Data Verification

Reflection

Data Processing & Unconditional Associations

Data Processing & Unconditional Associations

Matching the variable to the model class

Categorical

Continuous

Count / rate

Time-to-event