HSCI 410 — Lesson 1

A Structured Approach to Data Analysis

Exploratory Data Analysis For Epidemiology

Kiffer G. Card, PhD, Faculty of Health Sciences, Simon Fraser University

Learning objectives for this lesson:

  • Construct a causal diagram before beginning data analysis
  • Establish a system for managing data-collection sheets, files, and variables
  • Apply best practices for data coding, entry, and verification
  • Process outcome and predictor variables appropriately for analysis
  • Evaluate unconditional associations between variables
  • Set up a systematic approach for keeping track of analyses

This course was developed by Kiffer G. Card, PhD, as a companion to Dohoo, I. R., Martin, S. W., & Stryhn, H. (2012). Methods in Epidemiologic Research. VER Inc.

Reference

Glossary — Key Terms, People & Concepts

📚 Reference page — available throughout the lesson

This glossary collects the key concepts, people, and ideas you will meet in this lesson. Use it as a reference while you work through the material, or as a review before assessments. Type in the search box to filter entries.

Key Concepts & Ideas
Research Question A focused, answerable statement that frames an analysis. Usually structured around a population, exposure (or determinant), comparator, and outcome (PECO/PICO). A clear research question drives the entire analytic plan.
Analytic Plan A pre-specified roadmap that links the research question to data collection, variables, statistical models, and decision rules. Reduces ad hoc decisions and protects against fishing expeditions.
Hypothesis A testable prediction about the relationship between variables. The null hypothesis (H₀) typically states no association; the alternative (H₁) states an association exists.
Exposure The factor whose effect on the outcome is of primary interest (e.g., a treatment, behaviour, or environmental agent). Sometimes called the predictor or independent variable.
Outcome The health state or event being predicted or explained (e.g., disease, recovery, death). Also called the dependent variable or response.
Covariate Any variable other than the primary exposure that may influence the outcome. May be a confounder, mediator, effect modifier, or simply a precision variable.
Confounder A variable associated with both the exposure and the outcome that is not on the causal pathway between them. Failing to adjust for confounders biases effect estimates.
DAG (Directed Acyclic Graph) A diagram of variables (nodes) and directed causal arrows (edges) with no cycles. Used to encode assumed causal structure and identify which variables to adjust for.
Data Dictionary / Codebook A document listing every variable in a dataset with its name, definition, type, allowable values, units, and coding rules. Essential for reproducibility and collaboration.
Reproducibility The ability of others (or future-you) to re-run an analysis on the same data and obtain the same results, given the code and documentation.
Data Verification Checking entered data against source records to confirm accuracy. Includes double-entry, range checks, consistency checks, and cross-tabulation against expected patterns.
Multilevel Data Observations nested within higher-level units (e.g., patients within clinics, students within schools). Requires methods that account for clustering and within-unit correlation.
Methods & Statistical Concepts
MCAR (Missing Completely at Random) Missingness is unrelated to any observed or unobserved variable. Complete-case analysis is unbiased but loses efficiency.
MAR (Missing at Random) Missingness depends only on observed variables. Multiple imputation and likelihood-based methods can give unbiased estimates.
MNAR (Missing Not at Random) Missingness depends on the unobserved value itself. Requires sensitivity analyses or explicit modeling assumptions; cannot be fixed by standard imputation.
Unconditional Association The crude (unadjusted) relationship between two variables, ignoring other covariates. Useful as a screening step before multivariable modeling.
Program File / Script A saved file containing analysis code (e.g., a .R script) that can be re-executed to reproduce results. Preferred over interactive (point-and-click) processing.
Data Coding Translating raw responses (e.g., “Yes”, “No”, “Don't know”) into numeric or categorical values suitable for analysis, following pre-specified rules.
No matching entries. Try a different search term.
Section 1

Introduction & Data Collection

⏱ Estimated time: 15 minutes

Introduction and Overview

Welcome to HSCI 410. If you've come from HSCI 230 and 341, you've spent two courses learning how to read epidemiological evidence (230) and how to design and conduct epidemiologic studies (341). HSCI 410 picks up the next link in the chain: how to analyse the data those studies produce. The R skills you've been practicing in the orange boxes throughout 230 and 341 scale up here into the full statistical machinery of modern public-health analysis — linear, logistic, multinomial, ordinal, count, and survival regression; mixed models for clustered and longitudinal data; and increasingly causal-inference tools for observational data. This first lesson sets the foundation: a disciplined workflow for taking raw data from collection through to analysis-ready files. The structured approach has become especially important since the wider scientific community recognised a reproducibility crisis in published research (Ioannidis, 2005; Open Science Collaboration, 2015; Wikipedia: Replication crisis) and called for a manifesto of reproducible practice (Munafò et al., 2017). Across four content sections we walk through this in order: introduction and data collection (Section 1), data coding, entry, and file management (Section 2), program files, editing, and verification (Section 3), and data processing plus the first unconditional associations (Section 4).

Learning Objectives

  • Explain why a structured, iterative analytic workflow outperforms diving straight into modelling.
  • Sketch a causal diagram that distinguishes outcomes, predictors, confounders, and intervening variables.
  • Identify the components of a usable data-collection sheet for primary data.
  • Recognise where Section 1 sits in the larger pipeline that runs through Sections 2–4.

Why a Structured Approach?

When starting the analysis of a complex dataset, it is very helpful to have a structured approach in mind. For most people, there is a strong tendency to jump straight into the sophisticated analysis that will provide the ultimate answer. This rarely works out—the results will inevitably be wrong because important preliminary steps were skipped.

Key Principle

Data analysis is an iterative process which often requires that you back up several steps as you gain more insight into your data — an idea Tukey championed when he distinguished exploratory from confirmatory work (Peng, 2011). A structured template, while not the only approach, will be applicable in most situations and will serve to guide your initial efforts; tidy-data conventions (Wickham, 2014) make each iteration cheaper.

Start with a Causal Diagram

Before you start any work with your data, it is essential to construct a plausible causal diagram of the problem you are about to investigate. This will help identify:

  • Which variables are important outcomes and predictors
  • Which are potential confounders
  • Which might be intervening variables between your main predictors and outcomes

Practical Tip

Keep this causal diagram in mind throughout the entire data-analysis process. With large datasets, it will not be possible to include all predictors as separate entities. This can be handled by including blocks of variables (e.g., demographic characteristics) in the diagram instead of listing each variable.

Quick Refresher — DAGs from HSCI 341

When we say “causal diagram” in 410, we mean a directed acyclic graph (DAG): nodes for variables, directed arrows for direct causal effects, no cycles. You met these in HSCI 341 Lesson 1 — the three structural pieces still do all the work:

  • Fork (X ← C → Y) — C is a confounder. Adjust for it.
  • Chain (X → M → Y) — M is a mediator. Do not adjust if you want the total effect.
  • Collider (X → Z ← Y) — Do not adjust, and watch for it in selection.

The DAG fixes your estimand (what causal quantity you are estimating) and your adjustment set (what goes on the right-hand side of the regression) before you fit anything. In 410 we use it as the bridge from a research question to a regression model.

From DAG to Regression: A Mediation Example

One of the clearest places to see this bridge is mediation. A DAG of the form X → M → Y with a residual direct path X → Y is the qualitative claim; the Baron & Kenny (1986) procedure introduced in 341 puts numbers on the direct and indirect components. Below we do that fitting in R, on simulated data so you can verify the answer against the truth.

R Fitting a mediation model in R (Baron & Kenny + the mediation package)

The DAG: education → income → health, with education → health directly. We will (i) run Baron & Kenny’s three regressions by hand, then (ii) replicate the result with the mediation package, which gives proper bootstrap confidence intervals for the indirect effect.

# install.packages(c("mediation", "dagitty"))
library(mediation)

# 1. Simulate data that match the DAG: education -> income -> health,
#    with a smaller direct path education -> health.
set.seed(410)
n         <- 800
education <- rnorm(n)
income    <- 0.6 * education + rnorm(n)              # path a
health    <- 0.3 * education + 0.5 * income + rnorm(n)  # direct + path b
dat       <- data.frame(education, income, health)

# 2. Baron & Kenny by hand --------------------------------------------------
#    Step 1: total effect c  (health on education)
coef(lm(health ~ education, data = dat))["education"]

#    Step 2: a (income on education)
fit_M  <- lm(income ~ education, data = dat)

#    Step 3: direct c' (education) and b (income), from health on both
fit_Y  <- lm(health ~ education + income, data = dat)
coef(fit_Y)                     # c' on education, b on income

# Indirect effect = a * b  (or equivalently c - c')
a <- coef(fit_M)["education"]
b <- coef(fit_Y)["income"]
a * b

# 3. Same answer, with bootstrap CIs, via the mediation package -------------
med <- mediate(fit_M, fit_Y,
                treat    = "education",
                mediator = "income",
                boot     = TRUE, sims = 1000)
summary(med)
Console output (truncated)
Estimate 95% CI Lower 95% CI Upper p-value ACME (indirect) 0.301 0.244 0.358 <0.001 ADE (direct) 0.296 0.225 0.366 <0.001 Total Effect 0.597 0.521 0.671 <0.001 Prop. Mediated 0.504 0.418 0.588 <0.001

How to read this. The ACME (Average Causal Mediation Effect) is the indirect path through income; the ADE is the direct path; their sum is the total effect. About half of education’s effect on health travels through income. Two cautions worth carrying forward: (1) the estimate of the indirect effect is only as credible as the DAG — if there is an unmeasured confounder of income and health, the indirect estimate is biased even though the regressions converge cleanly; (2) Baron & Kenny assumes no exposure-mediator interaction, which is why mediation::mediate() is the preferred tool when in doubt.

R Reflect on what you just ran

Use the questions below to interpret the output you produced. Look at your console / plot before answering.

1. Compare the total effect c from lm(health ~ education) with the direct effect c' (the education coefficient in fit_Y). Which is larger, and what does the difference imply about the role of income?

Model answerThe total effect c from lm(health ~ education) is roughly 0.60, while the direct effect c' from fit_Y is about 0.30 — the total is larger. The difference (~0.30) is the indirect effect operating through income; income carries about half of education's effect on health. This is the classical Baron & Kenny mediation signal: a substantial drop from c to c' when the mediator is added to the model.

2. Multiply a * b by hand. How close is this product to the bootstrapped ACME from summary(med)? Does the 95% CI for ACME exclude zero?

Model answera = 0.6 (income on education) and b = 0.5 (income on health from fit_Y); a×b = 0.30. The bootstrapped ACME from summary(med) is reported as 0.301 with 95% CI (0.244, 0.358) — almost exactly the product, with the CI clearly excluding zero. The agreement validates that the by-hand and package estimates match; the CI confirms statistical significance.

3. The output reports a Prop. Mediated of about 0.50. Translate that into a sentence about education, income, and health. What would change about your interpretation if the 95% CI for ACME crossed zero?

Model answerProp. Mediated ≈ 0.50: about half of the total effect of education on health flows through income, while the other half is a direct effect (other mechanisms — health-information access, health-system navigation, health-promoting environments). If the 95% CI for ACME crossed zero, the indirect path through income would not be statistically significant; the interpretation would shift to "we cannot rule out that income contributes nothing to the education-health link in this dataset." Substantive caution: ACME's credibility is bounded by the no-unmeasured-confounding assumption on the M→Y path.
Saved.
R A reproducible project skeleton in RStudio

The structured approach starts with structured files. The convention below — one RStudio project, one folder per stage, one numbered script per task — scales from a homework assignment to a journal-ready paper.

# Create directories from R (or by hand). Run once at project start.
dir.create("data/raw",        recursive = TRUE)
dir.create("data/processed",  recursive = TRUE)
dir.create("R");  dir.create("output/figures", recursive = TRUE)

# tidyverse: dplyr (manipulation), ggplot2 (graphics), readr (file IO),
# tidyr (reshape), stringr (text). Install once.
# install.packages("tidyverse")
library(tidyverse)
library(here)                                # robust file paths

# A canonical pipeline: read -> clean -> save -> analyse
raw <- read_csv(here("data/raw/cohort.csv"))
clean <- raw |>
  filter(!is.na(outcome)) |>
  mutate(age_grp = cut(age, c(0, 30, 50, 70, Inf)),
         smoker  = factor(smoker, levels = c("No", "Yes")))
write_csv(clean, here("data/processed/cohort_clean.csv"))

# Sketch a DAG to anchor the analysis (HSCI 341 Lesson 1)
# library(dagitty)
# g <- dagitty("dag { smoker -> outcome ; age -> smoker ; age -> outcome }")
Conventions worth defending
. +-- R/ <- analysis scripts (numbered: 01_load.R, 02_clean.R) +-- data/raw/ <- never overwritten; treated as read-only +-- data/processed/ <- generated, fully reproducible from R/ +-- output/figures/ <- numbered .png/.pdf for the paper +-- project.Rproj <- one click reopens the whole environment

The pipe operator |> (or %>%) is the workhorse of the tidyverse: it chains verb -> verb -> verb so analysis reads top to bottom. Combined with here() for paths, your project is movable, shareable, and version-control-friendly out of the box.

R Reflect on what you just ran

Use the questions below to interpret the output you produced. Look at your console / plot before answering.

1. After running dir.create() four times, what folder structure now exists in your project? Why is keeping data/raw/ separate from data/processed/ a defensible choice?

Model answerThe four dir.create() calls create: data/raw/, data/processed/, R/, and output/figures/. Keeping data/raw/ separate from data/processed/ is defensible because raw data is the irreplaceable artefact — the source of truth that should never be modified. Processed data are derived from raw; if a cleaning bug is discovered, you can re-derive from raw, but if you overwrote the raw file, the bug is permanent. The separation enforces the rule "raw is read-only; processed is regenerable."

2. Trace the pipeline raw |> filter(...) |> mutate(...). What two columns get created in the clean object that did not exist in raw?

Model answerThe pipeline creates two new columns in clean: age_grp (a factor with intervals 0–30, 30–50, 50–70, 70+) and smoker (a factor with levels "No" and "Yes" with a specified reference). The first is a categorisation of continuous age; the second is a re-encoded version of an existing variable with controlled level order so regression coefficients align with the chosen reference.

3. Why does the script use here("data/raw/cohort.csv") instead of an absolute path like "C:/Users/.../cohort.csv"? Give one practical scenario where this matters.

Model answerhere() resolves paths relative to the project root, so the same script works on any machine, in any user account, in any operating system — as long as the project structure is unchanged. Absolute paths break instantly when the project is moved, shared with a collaborator on a different OS, or run inside a Docker container. Concrete scenario: you send your code to a co-author for review. They unzip the project on their Mac at ~/projects/this_study/; your hardcoded C:/Users/.../ would fail immediately, while here("data/raw/cohort.csv") just works.
Saved.

Managing Data-Collection Sheets

It is important to establish a permanent storage system for all original data-collection sheets (survey forms, data-collection forms, etc.) that makes it easy to retrieve individual sheets if they are needed during the analysis.

Protect your originals

Do not remove originals from your file. If you need a specific sheet for use at another location, make a photocopy. Never ship the original to another location without first making copies of all forms.

Track collection progress

Set up a system for recording the insertion of data-collection sheets into the file so that you know how many remain to be collected before further work begins.

Scan for completeness

Once all forms have been collected, scan through all sheets for their completeness before doing anything else. If there are omissions, returning to the data source to complete the data will be more likely to succeed if done soon after collection rather than weeks or months later.

Knowledge Check — Section 1

1. What should you construct before beginning any work with your data?

Before working with your data, you should construct a plausible causal diagram. This identifies which variables are important outcomes and predictors, which are potential confounders, and which might be intervening variables.

2. Why is data analysis described as an “iterative process”?

Data analysis is iterative because as you gain more insight into your data, you often need to revisit earlier steps, revise your approach, and re-examine your variables and models.

3. What should you do if you find omissions in data-collection sheets?

Returning to the data source to complete missing data will more likely be successful if done soon after the data were initially collected, rather than weeks or months later when the analysis has begun.

Reflection

Think of a research question you are interested in. Sketch out (describe) a causal diagram showing the key outcome, main predictors, potential confounders, and any intervening variables. How does this diagram help you plan your analysis?

Model answerPick a question (e.g., does air-pollution exposure during pregnancy lower birth weight?). DAG: prenatal PM2.5 exposure → birth weight, with confounders maternal age, SES, smoking, prenatal-care utilisation, neighbourhood environment, and pre-pregnancy BMI all pointing into both exposure and outcome. Intervening variables: placental function and maternal hypertension on the causal path. The DAG helps plan the analysis by (a) identifying the minimal sufficient adjustment set (using dagitty: the set of confounders to control for to identify the direct effect of PM2.5); (b) flagging mediators that must NOT be adjusted for if the total effect is the question (placental function); (c) flagging potential colliders (e.g., gestational age) that must be considered carefully; (d) supporting a mediation analysis if the indirect path through placental function is the question. The DAG turns a list of "things to control for" into a structured causal claim.
Reflection saved!
* Complete the quiz and reflection to continue.
Section 2

Data Coding, Entry & File Management

⏱ Estimated time: 20 minutes

Introduction and Overview

Section 1 set up the conceptual workflow and the discipline of starting from a causal diagram before you touch any data. Section 2 turns to the practical side: how do you encode raw responses, enter them into a workable file, organise files across a project, and keep track of what every variable in your dataset means? These are unglamorous tasks, but they determine whether the analysis you eventually run is reproducible. Organising data so that each variable is a column and each observation a row — the "tidy data" convention — makes downstream analysis dramatically easier (Wickham, 2014; Wikipedia: Tidy data).

Learning Objectives

  • Apply coding conventions for missing values, numeric codes, and avoiding compound codes.
  • Plan a data-entry workflow that minimises transcription error.
  • Lay out a project folder structure that distinguishes raw, processed, and analysis files.
  • Build and maintain a variable codebook that any collaborator could open and use.

Data Coding

Before entering data into a computer, careful coding is essential. Good coding practices prevent errors that can cascade throughout an entire analysis.

Missing Values
Click to learn more
🔢
Numeric Codes
Click to learn more
No Compound Codes
Click to learn more

Data Entry

Some important issues to consider when entering your data into a computer file:

Double-data entry

Double-data entry, followed by comparison of the 2 files to detect any inconsistencies, is preferable to single-data entry. This dramatically reduces the error rate in your dataset.

Caution with spreadsheets

Spreadsheets are a convenient tool for initial data entry, but they must be used with extreme caution. It is possible to sort individual columns, which could destroy your entire dataset with one inappropriate “sort” command. Custom data-entry software provides a greater margin of safety.

Save and back up immediately

As soon as the data-entry process has been completed, save the original data files in a safe location. In large, expensive trials, keep a copy of all originals stored in another location. Convert your data to the format your statistical software uses as soon as possible.

Keeping Track of Files

It is important to have a system for keeping track of all your files. Key recommendations:

  • Assign a logical name with a 2-digit numerical suffix (e.g., brazil01). A 2-digit suffix allows you to have 99 versions that still sort correctly when listed alphabetically.
  • When data manipulations are carried out, save the file with a new name (the next available number). Do not change data and then overwrite the file.
  • Keep a simple log of files created with information about the contents (e.g., number of observations and variables).
Example: File Log for a Blood Pressure Study

bp01.odc (27/09/07) — Original blood pressure study data; spreadsheet; 1 record per measurement. 1092 obs, 8 vars.

bp01.dta (28/09/07) — Original file; Stata format. 1092 obs, 8 vars.

bp02.dta (30/09/07) — 45 records with missing values dropped. 1047 obs, 8 vars.

R A reproducible recoding pipeline (no overwrites, ever)

The "save a new version, don't overwrite" rule is automatic if your transformations live in a script. The file log becomes the script's git history.

library(tidyverse)

# Read raw, never modify in place
bp_raw <- read_csv("data/raw/bp01.csv")

# Tidy: drop incomplete rows, build derived variables, lock factors
bp_clean <- bp_raw |>
  drop_na(systolic, diastolic, age) |>
  mutate(
    age_ct    = age - mean(age),                              # centred
    age_ctsq  = age_ct^2,                                       # quadratic term
    age_c3    = cut(age, c(0, 35, 55, Inf),
                    labels = c("young", "middle", "older")),
    htn       = factor(systolic >= 140 | diastolic >= 90,
                       levels = c(FALSE, TRUE),
                       labels = c("normotensive", "hypertensive"))
  )

# Persist as a new versioned file - and a small log line
write_csv(bp_clean, "data/processed/bp02.csv")
cat("bp02.csv", format(Sys.Date()), nrow(bp_clean), "obs",
    "\n", file = "data/file_log.txt", append = TRUE)

## At any time you can rebuild bp02 from bp01 by re-running this script.

This is the difference between “data analysis” and “data engineering”. A clicked-together SPSS workflow can't be re-derived. A script can — six months from now, by a colleague, on a different computer.

R Reflect on what you just ran

Use the questions below to interpret the output you produced. Look at your console / plot before answering.

1. The mutate() call creates four new variables (age_ct, age_ctsq, age_c3, htn). For each, state in one phrase what kind of variable it is (continuous, categorical, derived) and why a future analyst would want it pre-built.

Model answerage_ct is a continuous variable (mean-centred age) — useful so a regression intercept reflects the population mean. age_ctsq is a derived continuous variable (centred age squared) — allows the model to fit non-linear age effects without high collinearity with linear age. age_c3 is a categorical (factor) variable (three age groups) — convenient for stratified summaries and clinical-grouping interpretation. htn is a derived categorical (binary) variable — allows easy contingency analyses and clinically intuitive subgroup reports. Pre-building these in the cleaning script means downstream analysis code references them by name instead of duplicating the recoding logic.

2. The threshold for htn is systolic >= 140 | diastolic >= 90. If you raised the systolic cutoff to 150, would the prevalence of "hypertensive" go up or down? What does this tell you about the sensitivity of categorical recodes to threshold choice?

Model answerRaising the systolic cutoff to 150 would lower the prevalence of hypertensive classification (fewer people would meet the threshold). This illustrates how sensitive categorical recodes are to threshold choice: a 10 mmHg shift can move the prevalence by 5–15 percentage points. The reproducibility lesson: any categorical cut-point should be defended by reference to clinical guidelines (or explicit alternative cut-points checked in sensitivity analyses), and the analysis script should expose the threshold as a named constant rather than buried in a formula.

3. The script writes bp02.csv rather than overwriting bp01.csv. Describe one error this rule would prevent that an SPSS point-and-click workflow would not.

Model answerWriting bp02.csv instead of overwriting bp01.csv preserves an audit trail of derivations. SPSS point-and-click workflows commonly overwrite the working dataset, so a bug discovered three steps later cannot be undone — the original derived state is gone. With versioned outputs, you can re-run from any intermediate point, diff one version against another, and verify the consequence of a single cleaning decision. Concrete bug-prevention: a labelling error in bp02 (e.g., reversed levels) is recoverable because bp01 remains intact to be re-derived.
Saved.

Keeping Track of Variables

Even a relatively focused study can give rise to a large number of variables once transformed and recoded variables have been created. Recommendations include:

Use short but informative names and have all related variables start with the same name. Long names can be shortened by removing vowels (e.g., wtr_cstrn for “water cistern”). If your statistics program is case sensitive, use ONLY lower-case letters. At some point, prepare a master list of all variables.

VariableDescription
ageOriginal data (in years)
age_ctAge after centring by subtraction of the mean
age_ctsqQuadratic term (age_ct squared)
age_c2Age categorised into 2 categories (young vs old)
age_c3Age categorised into 3 categories
Knowledge Check — Section 2

1. Why should you never use compound codes?

Only code one piece of information in a single variable. Compound codes (e.g., 1=male Caucasian, 2=female Caucasian) make it extremely difficult to separate and analyse each characteristic independently.

2. What is the advantage of double-data entry?

Double-data entry, followed by comparison of the 2 files to detect any inconsistencies, is preferable to single-data entry because it dramatically reduces the entry error rate.

3. When data manipulations are carried out, what should you do with the file?

Save the file with a new name (the next available number in your naming convention) so you always have a record of all versions and can trace back to the original data if needed.

Reflection

Describe a file-naming and version-control system you would use for a dataset in your own research area. How would you organise the variable names for a study with demographic, clinical, and outcome variables?

Model answerFile naming: YYYYMMDD_studyname_dataset_version.ext (e.g., 20260516_smoking_cohort_clean_v03.csv). Version control with Git for code and DVC (Data Version Control) for data, plus weekly snapshots to a cloud backup. Variable naming: lowercase snake_case throughout; demographic prefix dem_ (e.g., dem_age, dem_sex, dem_education); clinical prefix cli_ (cli_bp_systolic, cli_hba1c); outcome prefix out_ (out_mi_5y, out_cvd_death). Avoid spaces and special characters; date suffix for derived variables (out_mi_5y_v2); a codebook (markdown + machine-readable JSON/YAML) accompanies the dataset documenting type, valid range, missing-data codes, and derivation logic. This is the difference between a dataset a stranger can reproduce in 6 months and one only the original analyst can navigate.
Reflection saved!
* Complete the quiz and reflection to continue.
Section 3

Program Files, Data Editing & Verification

⏱ Estimated time: 15 minutes

Introduction and Overview

Section 2 organised your files and variables. Section 3 turns to the analyses themselves: working in program mode rather than interactively (so the work is reproducible), editing data systematically rather than ad hoc, and verifying that the dataset behaves as expected before you draw any inferences from it. The discipline you build here is what separates a defensible analysis from a fragile one, and it has been argued that reproducibility is now the minimum standard for evaluating computational research (Peng, 2011).

Learning Objectives

  • Contrast interactive and program-mode workflows and justify why programs are required for reproducible work.
  • Edit data through scripted, documented steps rather than ad hoc fixes to the raw file.
  • Run systematic verification checks on ranges, types, and internal consistency.
  • Decide when verification is sufficient to move on to substantive analysis.

Program Mode vs. Interactive Processing

Statistical programs can be used in an interactive mode (selecting items from menus or typing in a command) or in program mode (compiling a series of commands into a program and then running it).

Interactive mode is very useful for exploring your data and trying out analyses. However, it should not be used for any of the “real” processing and/or analysis because it is very difficult to keep a clear record of steps taken. Consequently, it is difficult or impossible to reconstruct the analyses you have completed.

Program mode is the recommended approach. You compile the commands into a program and then run it. These program files can be saved and used to reconstruct any analyses you have carried out. Key tips: name files logically, structure the program to be easy to follow, use sequential indents, and document the file thoroughly with comments.

Critical Rule

Do all of the analyses in your statistical program. Don’t start doing basic statistics in a spreadsheet. You are going to need the statistical program eventually, and it will be much easier to keep track of all your analyses if they are all done there.

Data Editing

Before beginning any analyses, spend time editing your data. The most important components are:

🏷
Labelling Variables
Click to learn more
📄
Labelling Categories
Click to learn more
🔄
Missing Value Codes
Click to learn more

Data Verification

Before you start any analyses, you must verify that your data are correct. This can be combined with data processing and involves going through all of your variables, one-by-one.

For continuous variables
  • Determine the number of valid observations and the number of missing values
  • Check the maximum and minimum values (or the 5 smallest and 5 largest) to make sure they are reasonable; if they are not, find the error, correct it, and repeat the process
  • Prepare a histogram of the data to get an idea of the distribution and see if it looks reasonable
For categorical variables
  • Determine the number of valid observations and the number of missing values
  • Obtain a frequency distribution to see if the counts in each category look reasonable (and to make sure there are no unexpected categories)
Knowledge Check — Section 3

1. Why should program mode be preferred over interactive mode for “real” data analysis?

Program mode compiles commands into a program file that can be saved and reused, making it possible to reconstruct and reproduce all analyses. Interactive mode makes it difficult to keep a clear record of steps taken.

2. When verifying continuous variables, what should you examine first?

For continuous variables, the first verification step is to determine the number of valid observations and missing values, then check the minimum and maximum values (or the 5 smallest and 5 largest) to make sure they are reasonable.

3. What is the purpose of attaching labels to categorical variable values?

Categorical variables should have meaningful labels attached to each category (e.g., sex coded as 0 or 1 should have labels “male” and “female” attached) so that output is immediately interpretable.

Reflection

Imagine you receive a dataset where a colleague entered data interactively in a spreadsheet with no documentation. What steps would you take to clean, verify, and prepare the data for analysis? What problems might you encounter?

Model answerSteps: (1) Inventory: list all variables, their declared types, missing-value patterns, and value ranges; flag any variable with no obvious purpose. (2) Range checks: identify implausible values (age = 999, blood pressure = 0) and recode to NA with documentation. (3) Consistency checks: cross-variable plausibility (pregnancy in men, dates of birth after dates of death). (4) De-duplication: check for duplicated participant IDs and resolve. (5) Recoding: standardise binary and categorical variables to consistent encodings. (6) Code-up an audit trail: every cleaning decision logged in a script that re-creates the cleaned dataset from raw. (7) Documentation: build a codebook from scratch. Problems likely encountered: free-text entries in numeric fields ("unknown", "N/A", "-"), mixed date formats, encoding errors (Excel auto-converting gene names to dates), inconsistent missing-data codes (.,-,99,999), and meaningless variables you cannot interpret. The audit trail is the single most valuable artefact: it makes the cleaning reproducible and reviewable.
Reflection saved!
* Complete the quiz and reflection to continue.
Section 4

Data Processing & Unconditional Associations

⏱ Estimated time: 20 minutes

Introduction and Overview

Sections 1–3 produced a verified, well-documented dataset. Section 4 finally takes the step every student is impatient to make: the actual analysis. We start with how to process outcome and predictor variables for analysis, handle multilevel data structure, and run the “unconditional associations” (single-predictor descriptions) that are the necessary first look at any dataset before you touch a multivariable model.

Learning Objectives

  • Process outcome variables to fit the planned analysis (categorical, continuous, count, or rate).
  • Process predictor variables, including categorisation, scaling, and recoding decisions.
  • Recognise multilevel structure in your dataset and the implications for later modelling.
  • Run and interpret unconditional (single-predictor) associations as the first analytic step.
  • Keep an analytic log that allows you to reconstruct every decision you made.

Processing the Outcome Variable(s)

While verifying data, you can also start processing your outcome variable(s). Review the stated goals of the study to determine the format(s) which best suits the goal(s). Consider the following based on outcome type:

Categorical outcomes

Is the distribution of outcomes across categories acceptable? For example, if you planned a multinomial regression with a 3-category outcome, but very few observations fall in one of the categories, you might want to recode it to a 2-category variable.

Continuous outcomes

Does the variable have the characteristics necessary for the planned analysis? If linear regression is planned, is the distribution approximately normal? If not, explore transformations. Note: It is the normality of the residuals which is ultimately important, but if the original variable is far from normal and there are no strong predictors, the residuals are unlikely to be normal.

Count / rate outcomes

If Poisson regression is planned, are the mean and variance of the distribution approximately equal? If not, consider negative binomial regression or alternative analytic approaches.

Time-to-event outcomes

What proportion of the observations are censored? You might also want to generate a simple graph of the empirical hazard function to get an idea what shape it has.

Processing Predictor Variables

It is important to go through all predictor variables to determine how they will be handled:

  • Missing values: Are there many? If so, you might need to abandon plans to use that predictor, or conduct 2 analyses (one on the subset where the predictor is present and one on the full dataset ignoring the predictor).
  • Distribution: For continuous variables, is there a reasonable representation over the whole range of values? If not, it might be necessary to categorise the variable.
  • Categorical variables: Are all categories reasonably well represented? If not, you might have to combine categories.

Multilevel Data

If your data are multilevel (e.g., blood pressure measurements within individuals within centres), evaluate the hierarchical structure:

Key Questions for Multilevel Data

What is the average (and range) number of observations at one level in each higher-level unit? Are individuals uniquely identified within a hierarchical level? It is often useful to create one unique identifier for each observation in the dataset.

Unconditional Associations

Before proceeding with any multivariable analyses, it is important to evaluate unconditional associations within the data. These serve as the foundation for building more complex models.

Variable TypesAnalytical Approach
Two continuous variablesCorrelation coefficient, scatterplot, simple linear regression
One continuous + one categoricalOne-way ANOVA, simple linear or logistic regression
Two categorical variablesCross-tabulation and χ² test

When evaluating unconditional associations, pay attention to:

  • Associations between predictors and outcome: Determine if there is any association at all; determine the functional form (is it linear?); get a simple picture of the strength and direction.
  • Associations between pairs of predictors: Look for potential collinearity problems (highly correlated predictors).
  • Confounding variables: Evaluate associations between the confounding variables and the key predictors of interest and the outcome.

Keeping Track of Your Analyses

Before starting the more substantial analysis, set up a system for keeping track of your results:

📦
Analyse in Blocks
Click to learn more
📝
Keep Log Files
Click to learn more
📅
Label & Date
Click to learn more
Knowledge Check — Section 4

1. Why should you evaluate unconditional associations before multivariable analyses?

Evaluating unconditional associations before multivariable analyses helps you understand the basic relationships in your data, identify collinearity, detect confounding, and determine the functional form of relationships—all of which inform the complex models you will subsequently build.

2. If a continuous outcome variable is far from normally distributed, what should you do?

If the continuous outcome is not approximately normally distributed, you should explore transformations which might normalise the distribution. It is ultimately the normality of the residuals that is important, but a far-from-normal variable with no strong predictors will produce non-normal residuals.

3. What is the appropriate analytical approach for evaluating the association between two categorical variables?

For associations between two categorical variables, cross-tabulation and the chi-squared (χ²) test are the appropriate analytical approaches. These are particularly useful for identifying unexpected observations.

Reflection

Consider a dataset with 15 predictor variables and one continuous outcome. Describe the sequence of unconditional analyses you would carry out before fitting any multivariable models. How would you handle a predictor that has 30% missing values?

Model answerFor 15 predictors and a continuous outcome, unconditional analyses before multivariable modelling: (1) Univariate summary for each predictor (mean, SD, range, n missing) and the outcome (histogram, Q-Q plot for normality). (2) Bivariate associations: scatterplot or boxplot of each predictor against the outcome, with Pearson/Spearman correlation or appropriate group-comparison test; report effect size and 95% CI. (3) Pairwise correlations among predictors to detect collinearity. (4) Visualise distributions by stratifying covariates against potential modifiers (e.g., by sex, age group). For the predictor with 30% missing: (a) characterise the missingness pattern (random, related to other variables, related to outcome); (b) if MCAR, complete-case analysis may suffice but loses power; (c) if MAR, use multiple imputation by chained equations (MICE) with sensitivity analyses; (d) if MNAR is plausible, run pattern-mixture or selection models to bound the inference. Never silently drop the variable or impute the mean; both bias the result.
Reflection saved!
* Complete the quiz and reflection to continue.
Final Assessment

A Structured Approach to Data Analysis — Final Assessment

15 questions • 100% required to pass

Bringing It All Together

This lesson laid out a structured workflow for taking a dataset from collection to first analysis. Section 1 insisted that you start with a causal diagram and a clear separation of outcomes, predictors, confounders, and intervening variables — before you touch the data. Section 2 turned that intent into discipline at the level of files: thoughtful coding, a project folder you could hand to a collaborator, and a codebook that captures every variable. Section 3 made the workflow reproducible by moving you out of point-and-click interactive mode into program-mode scripts, with systematic editing and verification. Section 4 closed the loop by processing outcomes and predictors, surfacing multilevel structure, and producing the unconditional associations that should always precede a multivariable model.

The thread running through all four sections is that an analysis is only as trustworthy as the steps that came before it. Every later lesson in HSCI 410 — linear regression, model building, logistic regression, count data, survival, mixed models — assumes you arrive at the modelling step with a clean, documented, and well-understood dataset. The structured approach is what makes the rest of the course possible. It also positions you to interpret p-values, confidence intervals, and effect sizes responsibly when you eventually report them (Wasserstein & Lazar, 2016; Greenland et al., 2016), and to avoid the analytic flexibility that drives false-positive results (Simmons, Nelson, & Simonsohn, 2011; see also Wikipedia: John Tukey on the historical roots of exploratory data analysis).

Key Takeaways from Lesson 1

  • Always start with a causal diagram: it forces you to declare your outcome, predictors, confounders, and intervening variables before the data can mislead you.
  • Data analysis is iterative; expect to back up several steps as you learn more about your data.
  • Coding and file-management decisions made in the first hour shape every analysis you run afterward — treat them as part of the analysis, not pre-work.
  • Use program mode, not interactive clicks: a script you can re-run is the only honest record of what you did.
  • Verify the dataset (ranges, types, consistency) before you trust any descriptive or inferential output.
  • Run unconditional associations first; they reveal data problems and effect sizes that a multivariable model will hide.

Final Reflection

Reflecting on the entire lesson, what do you consider the most important step in the structured approach to data analysis? How would you apply this structured approach to a dataset you are currently working with or plan to work with in the future?

Model answerThe most important step is the causal-question specification before any modelling: deciding what relationship you want to estimate (total vs. direct vs. mediated effect), drawing the DAG that encodes assumed structure, and identifying the minimal adjustment set before looking at the data. Everything downstream (variable cleaning, transformation, model class, software choice, reporting) is conditional on this decision. In practice, applying it to a current dataset: write the question as a single sentence, draw the DAG (using dagitty.net), compute the adjustment set, lock the analytic specification in a pre-registered protocol, then proceed to cleaning and modelling without further peeking at outcome distributions. This sequence is what separates confirmatory from exploratory analysis — both are valuable, but only the confirmatory analysis supports causal claims, and only this structured approach makes the distinction enforceable.
Reflection saved!
Final Assessment — Lesson 1 (15 Questions)

1. What is the first step recommended before beginning any data analysis?

The first step is to construct a plausible causal diagram of the problem, identifying outcomes, predictors, confounders, and intervening variables.

2. Why should you avoid starting analyses in a spreadsheet?

Doing all analyses in the statistical program makes it easier to keep track of all analyses and simplifies tracking modifications to the data.

3. What coding value should NOT be assigned to missing data?

The specific number assigned to missing values must not be a legitimate value for any of the responses. Common conventions include large negative numbers like −999.

4. Why is a 2-digit numerical suffix recommended for file names?

A 2-digit suffix allows you to have 99 versions of a file that will sort correctly when listed alphabetically (e.g., brazil01, brazil02, ... brazil99).

5. What is the danger of using the “sort” command in a spreadsheet for data entry?

In spreadsheets, it is possible to sort individual columns independently, which can destroy your entire dataset with one inappropriate “sort” command by misaligning records across columns.

6. What is the primary purpose of evaluating unconditional associations between pairs of predictors?

Associations between pairs of predictors are evaluated to detect potential collinearity problems, where highly correlated predictors can cause instability in multivariable models.

7. When processing a categorical outcome with 3 categories, when might you recode it to 2 categories?

If you planned a multinomial regression with a 3-category outcome, but there are very few observations in 1 of the 3 categories, you might want to recode it to a 2-category variable.

8. What approach should be used to document what a program file does?

All statistical programs allow you to add comments to the program files. These should document what the program does and, in some cases, record key results within the file itself.

9. For verifying a continuous variable, what visual tool is recommended?

For continuous variables, preparing a histogram gives you an idea of the distribution and allows you to see if it looks reasonable before proceeding with further analysis.

10. What is the appropriate analysis for the association between one continuous and one categorical variable?

For the association between one continuous and one categorical variable, one-way ANOVA, simple linear regression, or logistic regression are appropriate analytical approaches.

11. If a predictor variable has many missing values, what options are available?

If many values are missing, you might abandon plans to use that predictor, or conduct 2 analyses: one on the subset in which the predictor is present and one on the full dataset ignoring the predictor.

12. What should you do with log files from your analyses?

Give log files the same name as the program file (except with a different extension) so that it is easy to match the program that generated a particular set of results.

13. Why is interactive mode still useful despite its limitations?

Interactive mode is very useful for exploring your data and trying out analyses. However, the “real” processing and analysis should be done in program mode for reproducibility.

14. When evaluating confounding variables, what should you specifically look for?

Special attention needs to be paid to potential confounding variables by evaluating the associations between these variables and the key predictors of interest and the outcome, particularly if there is a strong association with both.

15. What does the chapter suggest you should do if a count/rate outcome’s mean and variance are not approximately equal?

If the mean and variance of a count/rate outcome are not approximately equal, Poisson regression assumptions may be violated. Consider negative binomial regression or alternative analytic approaches.

Lesson 1 Complete!

You have successfully completed A Structured Approach to Data Analysis. Your responses have been downloaded.

Lesson 2 — Data Cleaning and Descriptive Analyses — takes the next concrete step. With data collected and organized, the next job is detecting outliers, addressing missing values, and producing the descriptive summaries (Table 1) that every analysis report eventually needs. The structured workflow you built here makes that work tractable.