HSCI 410 — Lesson 2

Data Cleaning &
Descriptive Analyses

Exploratory Data Analysis For Epidemiology — HSCI 410

Kiffer G. Card, PhD, Faculty of Health Sciences, Simon Fraser University

Learning objectives for this lesson:

  • Identify common types of data errors and their sources in the data pipeline
  • Distinguish between missing data mechanisms (MCAR, MAR, MNAR) and their analytic implications
  • Apply systematic data cleaning strategies including range checks, outlier detection, and consistency verification
  • Select and implement appropriate methods for handling missing data
  • Calculate and interpret measures of central tendency, spread, and shape
  • Construct and interpret descriptive summaries appropriate for epidemiologic research
  • Assess distributional assumptions and select appropriate visualization techniques

This course was developed by Kiffer G. Card, PhD, as a companion to Dohoo, I. R., Martin, S. W., & Stryhn, H. (2012). Methods in Epidemiologic Research. VER Inc.

Reference

Glossary — Key Terms, People & Concepts

📚 Reference page — available throughout the lesson

This glossary collects the key concepts, people, and ideas you will meet in this lesson. Use it as a reference while you work through the material, or as a review before assessments. Type in the search box to filter entries.

Key Concepts & Ideas
Tidy Data A data layout where each variable is a column, each observation is a row, and each cell holds a single value. Coined by Hadley Wickham; the foundation of the tidyverse workflow.
Data Pipeline The full sequence of steps that takes raw data through cleaning, transformation, and analysis to a final result. Errors can enter at any stage; tracking them is essential.
Codebook / Data Dictionary Documentation that describes every variable in a dataset (name, type, allowable values, units, coding, source). Required for reproducible analysis.
Outlier An observation that lies far from the rest of the distribution. May reflect a genuine extreme value, a measurement error, or a data-entry mistake; investigate before removing.
Range Check Verifying that values for a variable fall within plausible biological or logical limits (e.g., human age between 0 and 120). A first-pass tool for spotting impossible entries.
Frequency Table A tabular summary showing the count (and often the percentage) of each level of a categorical variable. The simplest descriptive tool for nominal/ordinal data.
Missingness Patterns of missing values in a dataset. Classified by mechanism (MCAR/MAR/MNAR); the mechanism governs whether complete-case analysis or imputation is appropriate.
Type Conversion (Coercion) Changing a variable's data type (e.g., character to numeric, factor to integer). Mistyped variables are a common source of analysis errors in R.
Derived Variable A new variable computed from existing ones (e.g., BMI from height and weight, age categories from age). Document the formula in the data dictionary.
Methods & Statistical Concepts
Mean The arithmetic average. Sensitive to outliers; appropriate for symmetric continuous data.
Median The 50th percentile — the value that splits the distribution in half. Robust to outliers and skew; preferred for non-normal continuous data.
Standard Deviation (SD) A measure of spread equal to the square root of the variance. Reported alongside the mean for symmetric data.
Interquartile Range (IQR) The difference between the 75th and 25th percentiles (Q3 − Q1). A robust measure of spread; reported with the median.
Q-Q Plot A quantile-quantile plot comparing observed data to a theoretical distribution (often the normal). Points falling on the diagonal suggest the distribution fits.
Histogram A bar chart of binned counts that displays the empirical distribution of a continuous variable. Useful for spotting skew, modality, and outliers.
Box Plot A graphical summary showing the median, IQR (box), whiskers (typically 1.5×IQR), and outlier points. Compact for comparing groups.
Skewness Asymmetry of a distribution. Positive (right) skew has a long right tail; negative (left) skew has a long left tail. Influences choice of mean vs. median.
Multiple Imputation A principled method for handling missing data: generate multiple plausible values per missing cell, analyse each completed dataset, and pool results using Rubin's rules.
dplyr A tidyverse R package providing verbs (filter, select, mutate, summarize, group_by, arrange) for grammar-of-data-manipulation workflows.
janitor An R package for routine data-cleaning tasks: standardising column names (clean_names), tabyls, removing empty rows/columns, and finding duplicates.
ggplot2 A tidyverse R package implementing the grammar of graphics. Supports descriptive plots (histograms, box plots, scatterplots) used throughout exploratory analysis.
Key People
John Tukey (1915–2000) American statistician who coined the term “exploratory data analysis” (EDA) and invented the box plot, stem-and-leaf display, and many tools for visual data inspection.
Hadley Wickham (1979– ) New Zealand statistician and chief scientist at Posit (RStudio). Created ggplot2, dplyr, tidyr, and the broader tidyverse; defined “tidy data”.
No matching entries. Try a different search term.
Section 1

Data Quality & Types of Errors

⏱ Estimated time: 15 minutes

Introduction and Overview

Lesson 1 set up the structured workflow that takes raw data from collection to analysis-ready files. Lesson 2 zooms into two specific phases of that pipeline that deserve their own treatment: cleaning the data (detecting errors, handling missing values, deciding how to deal with outliers) and producing descriptive analyses (the numerical summaries and visualisations that every analysis report eventually needs). The three content sections walk through this in order: data quality and types of error (Section 1), the cleaning workflow itself (Section 2), and descriptive statistics and Table 1 (Section 3).

Learning Objectives

  • Trace where errors enter the collection → entry → storage → analysis pipeline.
  • Distinguish between common error types (random vs. systematic, structural vs. value-level).
  • Classify missing-data mechanisms (MCAR, MAR, MNAR) and explain why the distinction matters.
  • Describe variable types in epidemiologic data and the operations each type supports.

The Data Pipeline

Before any statistical analysis can begin, data must travel through a pipeline: collection → entry → storage → analysis. Errors can be introduced at every stage, and understanding where problems arise is the first step toward ensuring high-quality data for epidemiologic research.

Key Principle

The quality of your analysis can never exceed the quality of your data. Rigorous attention to data quality at each stage of the pipeline is not merely good practice—it is a prerequisite for valid epidemiologic inference. Data quality is multidimensional, going beyond accuracy to include completeness, timeliness, consistency, and relevance to the user (Wang & Strong, 1996).

At the collection stage, errors may arise from poorly worded survey items, interviewer bias, instrument malfunction, or participant misunderstanding. During data entry, transcription mistakes and coding errors are common. Storage introduces risks of data corruption, version control problems, and format incompatibilities. Finally, at the analysis stage, incorrect variable coding, inappropriate transformations, and software errors can compromise results.

Types of Data Errors

Errors in epidemiologic datasets take many forms. Van den Broeck, Argeseanu Cunningham, Eeckels, & Herbst (2005) describe data cleaning as a three-stage process — screening, diagnosing, and editing — that should be planned into study design rather than improvised after collection. Click each card below to learn about common error types.

Transcription Errors
Click to learn more
📄
Coding Errors
Click to learn more
Logical Inconsistencies
Click to learn more
📈
Out-of-Range Values
Click to learn more
📋
Duplicate Records
Click to learn more

Missing Data

Missing data are ubiquitous in epidemiologic research (Wikipedia, n.d.). The mechanism by which data become missing has profound implications for the validity of subsequent analyses. Rubin (1976) identified three missing data mechanisms:

🎲
MCAR
Missing Completely at Random
🔍
MAR
Missing at Random
🚫
MNAR
Missing Not at Random

Important Warning

The missing data mechanism is an assumption that cannot be verified from the observed data alone. You should always conduct sensitivity analyses to assess how your results might change under different assumptions about the missing data mechanism.

Data Dictionaries and Codebooks

A data dictionary (or codebook) is a structured document that describes every variable in a dataset: its name, label, type, valid range, coding scheme, and source. Maintaining a comprehensive data dictionary is essential for:

  • Ensuring consistency across data entry operators
  • Facilitating communication among research team members
  • Supporting reproducibility when analyses are revisited months or years later
  • Identifying and resolving coding discrepancies

Variable Types in Epidemiologic Data

Correctly classifying variables is fundamental to choosing appropriate cleaning strategies and analytic methods.

📈
Continuous
Interval/ratio scale
📊
Categorical
Nominal & ordinal
Binary
Two categories
🔢
Count
Discrete events
Knowledge Check — Section 1

1. A researcher discovers that a participant’s recorded date of diagnosis is earlier than their date of birth. What type of error is this?

This is a logical inconsistency because two fields within the same record contradict each other—a diagnosis cannot precede birth. While it may have been caused by a transcription error, the error type is classified by what is observed in the data.

2. Under which missing data mechanism can complete-case analysis still produce unbiased estimates?

Under MCAR, the probability of missingness is unrelated to any data values, so the observed cases are a random subset of the full sample. Complete-case analysis is unbiased but less precise due to reduced sample size.

3. Which of the following is NOT a primary purpose of a data dictionary?

A data dictionary documents variable names, labels, types, valid ranges, and coding schemes. It supports data quality and reproducibility but does not play a role in determining statistical significance.

Complete the knowledge check to continue.

Section 2

Data Cleaning Strategies

⏱ Estimated time: 20 minutes

Introduction and Overview

Section 1 named the kinds of errors that can enter a dataset. Section 2 provides the cleaning workflow that addresses them: systematic verification, outlier identification, missing-data handling, transformations, recoding, and string cleaning. Each step has its own techniques and its own decisions to document.

Learning Objectives

  • Run range, type, and consistency checks as a first systematic verification pass.
  • Identify outliers and decide whether to keep, transform, or exclude them, with reasons recorded.
  • Choose an appropriate missing-data strategy (complete-case, single imputation, multiple imputation) given the mechanism.
  • Apply transformations and derive new variables without losing the audit trail back to the raw data.
  • Document every cleaning decision so that another analyst could reproduce your final dataset.

Systematic Verification

Data cleaning should follow a systematic, documented process. The goal is not to “fix” data arbitrarily but to identify and resolve genuine errors while preserving the integrity of valid observations.

Range Checks

Range checks verify that values fall within predefined valid or plausible bounds. For example, gestational age should be between approximately 20 and 44 weeks; human body temperature typically ranges from 35°C to 42°C.

Define both hard limits (biologically impossible values that should be flagged as errors) and soft limits (unusual but possible values that should be reviewed).

Consistency Checks

Consistency checks examine whether values across multiple fields are logically compatible. Examples include verifying that discharge dates occur after admission dates, that a participant’s age is consistent with their reported date of birth, and that skip patterns in questionnaires are respected.

These checks often require domain knowledge about the relationships between variables.

Cross-Validation

Cross-validation involves comparing data from multiple sources to verify accuracy. For example, comparing self-reported medication use with pharmacy dispensing records, or verifying diagnoses against medical chart reviews.

When discrepancies arise, establishing a hierarchy of data source reliability is essential for deciding which value to retain.

Identifying Outliers

Outliers are observations that lie far from the bulk of the data. They may represent genuine extreme values, data errors, or observations from a different population. The visual tools used to spot them — box plots, stem-and-leaf displays, and the broader practice of looking at data before modelling — trace back to Tukey (1977). Several methods are used to identify them:

The interquartile range (IQR) method defines outliers as values falling below Q1 − 1.5×IQR or above Q3 + 1.5×IQR, where Q1 and Q3 are the 25th and 75th percentiles, respectively.

IQR Outlier Bounds
Lower = Q1 − 1.5 × IQR   |   Upper = Q3 + 1.5 × IQR

This method is robust to extreme values because it is based on percentiles rather than the mean and standard deviation. Values beyond 3×IQR from the quartiles are sometimes termed “extreme outliers.”

The z-score method standardizes each observation as z = (x − mean) / SD. Values with |z| > 3 are commonly flagged as potential outliers.

Z-Score
z = (x − x̄) / s

This approach assumes approximately normal distribution and can be misleading for heavily skewed data because the mean and SD are themselves influenced by outliers (the masking effect).

Visual inspection using boxplots, histograms, and scatterplots is often the most informative first step in outlier detection.

  • Boxplots display the median, IQR, and individual outliers beyond the whiskers
  • Histograms reveal gaps in the distribution that may indicate erroneous values
  • Scatterplots can reveal bivariate outliers that appear normal when each variable is examined individually

Visual methods should complement, not replace, quantitative rules. Always investigate outliers before deciding to exclude them.

Practical Tip

Never delete an outlier simply because it is extreme. Investigate whether it represents a data error, a genuinely unusual observation, or a member of a different subpopulation. Document every decision to retain or remove an outlier in your audit trail.

Handling Missing Data

The choice of method for handling missing data depends on the assumed missing data mechanism and the proportion of missingness. Practical guidance on implementing multiple imputation via chained equations — including how to specify the imputation model, how many imputations to draw, and how to diagnose problems — is given by White, Royston, & Wood (2011).

Listwise Deletion (Complete-Case Analysis)

Excludes any observation with one or more missing values. Simple to implement, but can dramatically reduce sample size and introduces bias unless data are MCAR. In multivariable models with many variables, even low rates of missingness on individual variables can lead to large cumulative losses.

Pairwise Deletion (Available-Case Analysis)

Uses all available data for each analysis. For example, when computing a correlation matrix, each pair of variables uses all cases with non-missing values for that pair. This preserves more data than listwise deletion, but can produce inconsistent results (e.g., a correlation matrix that is not positive semi-definite).

Single Imputation (Mean, Median, Mode)

Mean imputation replaces missing values with the variable mean. While simple, it underestimates variance and distorts relationships between variables. Median imputation is more robust for skewed distributions. Mode imputation is used for categorical variables.

All single imputation methods treat the imputed value as if it were known with certainty, leading to underestimated standard errors and overly narrow confidence intervals.

Multiple Imputation (Introduction)

Multiple imputation creates several (typically 5–20) plausible imputed datasets, analyzes each separately, and pools the results using Rubin’s rules. This approach correctly accounts for the uncertainty due to missing data and produces valid inference under the MAR assumption.

Multiple imputation is currently considered the gold standard for handling missing data in epidemiologic research when the proportion of missingness is non-trivial (Sterne et al., 2009).

Data Transformations

When continuous variables are heavily skewed, transformations can make the distribution more symmetric, stabilize variance, and improve the validity of parametric methods.

TransformationFormulaWhen to Use
Log (natural)ln(x) or ln(x + 1)Right-skewed data, multiplicative relationships (e.g., biomarker concentrations, income)
Square root√xCount data, moderately right-skewed distributions
Inverse (reciprocal)1/xStrongly right-skewed data; commonly used for rates and times

Remember

When you transform a variable, all subsequent interpretations must account for the transformation. For example, a regression coefficient from a log-transformed outcome represents a multiplicative (rather than additive) change. Always back-transform results for reporting.

Recoding and Derived Variables

Recoding involves creating new variables from existing ones. Common examples include collapsing categories (e.g., combining “current smoker” and “former smoker” into “ever smoker”), categorizing continuous variables into clinically meaningful groups (e.g., BMI categories), and computing derived variables such as person-years of follow-up or age at onset.

String Cleaning and Standardization

Free-text fields often contain inconsistencies: “Vancouver,” “vancouver,” “VANCOUVER,” and “Vancoouver” may all refer to the same city. Standardization strategies include converting to consistent case, trimming whitespace, applying regular expressions for pattern matching, and using lookup tables or fuzzy matching algorithms.

Documenting the Cleaning Process

Every data cleaning decision should be recorded in an audit trail. This includes what was changed, why, by whom, and when. Reproducibility requires that another analyst could replicate the cleaning process from the original raw data using your documentation — an expectation echoed in the Statistical Methods in Psychology Journals guidelines (Wilkinson & the Task Force on Statistical Inference, 1999).

R Activity — Read, fix classes, and clean impossible values

This block uses the course dataset phaa_survey.csv — a simulated public-health survey of ~800 adults that we will reuse from this lesson onward. The full annotated script lives in r-activities/HSCI_410_Lesson_2_Data_Cleaning_and_Descriptive_Analyses.R.

# 1. Read the raw file. Treat blank cells as missing (na.strings).
phaa <- read.csv(file = "phaa_survey.csv",
                 stringsAsFactors = FALSE,
                 na.strings = c("", "NA"))

# 2. Convert variables that should be numeric (some came in as character
#    because the raw data contained impossible values like a negative age).
phaa$age <- as.numeric(phaa$age)
phaa$bmi <- as.numeric(phaa$bmi)

# 3. Range checks: replace impossible values with NA
phaa$age[phaa$age < 18 | phaa$age > 100] <- NA
phaa$systolic_bp[phaa$systolic_bp < 80 | phaa$systolic_bp > 220] <- NA
phaa$bmi[phaa$bmi < 13 | phaa$bmi > 60] <- NA

# 4. Order education and income so that comparisons (<= ...) are meaningful
phaa$education <- factor(phaa$education,
  levels = c("Less than high school", "High school", "Some college",
             "Bachelor's", "Graduate degree"),
  ordered = TRUE)

Why never overwrite the raw file? Range-check rules will change as you learn more about the data. Keeping phaa_survey.csv untouched lets you replay the cleaning from a known starting state.

R Reflect on what you just ran

Use the questions below to interpret the output you produced. Look at your console / plot before answering.

1. Look at summary(phaa$age) before and after the line phaa$age <- as.numeric(phaa$age). How many NAs appear, and what does that count tell you about the raw data quality?

Model answerBefore as.numeric(), age is character; summary() would show only counts of distinct values, no NAs. After conversion, summary() reports several NAs — typically 5–30 in the simulated dataset, representing entries that couldn't be parsed as numeric (text such as "unknown", "30s", or blank). The NA count is a direct measure of data-quality flaws in the raw entry: each NA is a hidden free-text response that the original data collection accepted but cannot be analysed quantitatively.

2. After the range-check on systolic_bp, how many rows were set to NA? Why is replacing impossible values with NA preferable to deleting those rows entirely?

Model answerThe range check (e.g., 70–220 mmHg) typically flags 10–30 rows as out-of-range and sets them to NA. Replacing with NA is preferable to deletion because (a) it preserves the row for other variables (you don't lose 30 observations to delete one bad blood pressure); (b) it documents in the data itself that this value was flagged; (c) the missingness then enters multiple-imputation pipelines like any other NA, allowing principled handling. Deleting rows loses information silently and can introduce selection bias if implausible values cluster non-randomly.

3. After making education an ordered factor, what does levels(phaa$education) show? Why does the ordering matter for any analysis that asks "as education goes up, does X change?"

Model answerlevels(phaa$education) shows them in the specified order, e.g., "<HS" < "HS" < "Some college" < "Bachelor's" < "Graduate". Ordering matters because an ordered factor lets R interpret "higher education" as a meaningful direction in regression and tests — linear contrasts become interpretable, and trend tests (Cochran-Armitage, ordinal regression) work as intended. Without the ordering R defaults to alphabetical, which makes "Bachelor's" < "Graduate" < "HS" < "Some college" < "<HS" — nonsensical for inferential purposes.
Saved.
Applied Example: Audit Trail

A cohort study of cardiovascular disease collected blood pressure measurements on 2,400 participants. During cleaning, 15 systolic values exceeded 300 mmHg. The analyst traced 12 of these to a decimal-point shift (e.g., 1,200 instead of 120.0) and corrected them using the original paper forms. Three could not be verified and were set to missing with a note: “Original form illegible; value implausible.” All decisions were logged in a cleaning script with date stamps.

Reflection

Think about a dataset you have worked with (or imagine a large epidemiologic study). What data cleaning challenges would you expect, and how would you prioritize your cleaning steps? What would your audit trail include?

Model answerFor a large epidemiologic dataset (e.g., a multi-site cohort), expected challenges: heterogeneous data entry across sites, free-text encoded in numeric fields, inconsistent missing-data codes (-9, 999, blank), date-format collisions (US vs. ISO), encoding mis-translations from older systems, and outcome adjudication that wasn't blinded. Priority order: (1) inventory and codebook the raw data; (2) range and consistency checks per variable; (3) cross-variable plausibility (impossible combinations); (4) standardise dates, units, and missing codes; (5) de-duplication; (6) derive analytic variables; (7) document. Audit trail: every step logged in a versioned script, intermediate datasets saved with version numbers, decision log (e.g., "impossible BMI > 80 set to NA on 2026-05-16, n = 12 rows affected"), and the final cleaned dataset accompanied by an HTML diagnostic report and a codebook auto-generated from the cleaning script.
Reflection saved.
Knowledge Check — Section 2

1. What is the primary disadvantage of mean imputation for handling missing data?

Mean imputation replaces missing values with the sample mean, which artificially reduces variability in the data. It also attenuates correlations between variables because the imputed values carry no information about their relationship with other variables.

2. Which outlier detection method is most robust for skewed distributions?

The IQR method is based on percentiles (Q1 and Q3), which are resistant to extreme values. The z-score method relies on the mean and standard deviation, both of which are influenced by outliers and skewness, making it less robust for non-normal distributions.

3. A log transformation is most appropriate for which type of distribution?

The natural log transformation compresses the right tail and stretches the left, making it ideal for right-skewed distributions commonly seen with biomarker concentrations, income, and other variables with multiplicative processes.

Complete the reflection and knowledge check to continue.

Section 3

Descriptive Analyses

⏱ Estimated time: 25 minutes

Introduction and Overview

Sections 1 and 2 produced a clean, well-documented dataset. Section 3 turns that dataset into the descriptive summaries that any analysis report opens with — means and medians, spreads and shapes, frequency distributions, “Table 1” characteristics by exposure group, and the visualisations that go with them. Doing this carefully is what lets readers understand what your data look like before they trust your inferential models.

Learning Objectives

  • Choose the appropriate measure of central tendency, spread, and shape for each variable type and distribution.
  • Build frequency distributions and cross-tabulations to summarise categorical data.
  • Select visualisations that match the variable type and the question you are asking.
  • Construct a publication-ready “Table 1” of participant characteristics, including stratified comparisons.
  • Assess normality and decide when a transformation or non-parametric approach is warranted.

Measures of Central Tendency

Measures of central tendency summarize the “typical” value in a distribution. The choice depends on the variable type and distributional shape.

The arithmetic mean (x̄) is the sum of all values divided by the number of observations. It uses all data points and is the basis of many parametric statistical methods.

Arithmetic Mean
x̄ = (1/n) Σ xi

When to use: Symmetric or approximately normal distributions. Avoid for heavily skewed data or when outliers are present, as the mean is pulled toward extreme values.

The median is the middle value when observations are ordered from smallest to largest. For an even number of observations, it is the average of the two middle values.

When to use: Skewed distributions, ordinal data, or when outliers are present. The median is robust to extreme values—unlike the mean, adding or removing an outlier has minimal effect on the median.

The mode is the most frequently occurring value. A distribution may be unimodal, bimodal, or multimodal.

When to use: Categorical (nominal) data, where neither the mean nor the median is meaningful. It is also useful for identifying common response values in continuous data (e.g., digit preference).

Measures of Spread

Spread (or dispersion) describes how much variability exists in the data around the central value.

MeasureFormula / DefinitionProperties
RangeMaximum − MinimumSensitive to outliers; uses only 2 data points
Variance (s²)Σ(xi − x̄)² / (n − 1)Average squared deviation; units are squared
Standard Deviation (s)√s²Same units as original variable; most commonly reported
IQRQ3 − Q1Robust to outliers; covers middle 50% of data

Reporting Convention

For normally distributed variables, report mean ± SD. For skewed distributions, report median (IQR) or median (Q1, Q3). This convention is standard in epidemiologic publications and should be followed in “Table 1” (Wilkinson & the Task Force on Statistical Inference, 1999).

Measures of Shape

Beyond central tendency and spread, the shape of a distribution provides critical information for choosing analytical methods.

Skewness

Skewness measures the asymmetry of a distribution. A skewness of 0 indicates perfect symmetry.

  • Positive (right) skew: Long right tail; mean > median. Common in variables like income, hospital length of stay, and biomarker concentrations.
  • Negative (left) skew: Long left tail; mean < median. Less common but seen in variables like age at death in developed countries.

As a rule of thumb, |skewness| > 1 indicates substantial skew, though this depends on sample size.

Kurtosis

Kurtosis describes the “tailedness” of a distribution relative to a normal distribution (which has a kurtosis of 3, or excess kurtosis of 0).

  • Leptokurtic (excess kurtosis > 0): Heavier tails and a sharper peak; more extreme values than expected under normality.
  • Platykurtic (excess kurtosis < 0): Lighter tails and a flatter peak; fewer extreme values than expected.

High kurtosis can indicate the presence of outliers or a mixture of subpopulations.

Frequency Distributions and Cross-Tabulations

For categorical variables, frequency distributions display the count and percentage of observations in each category. Cross-tabulations (contingency tables) display the joint distribution of two categorical variables and are the basis for computing measures of association such as odds ratios and risk ratios.

Example: Cross-Tabulation in an Outbreak Investigation
Cases (n)Controls (n)Total
Exposed453075
Unexposed156075
Total6090150

From this 2×2 table, the odds ratio = (45 × 60) / (30 × 15) = 6.0, suggesting a strong association between exposure and disease.

Visualization Techniques

Appropriate visualization depends on the variable type and the question being asked. Exploratory data analysis (Wikipedia, n.d.) emphasises looking at data first so structure, gaps, and anomalies are visible before any model is fit.

Histograms display the distribution of a continuous variable by dividing values into bins and counting observations per bin. They reveal distributional shape, modality, gaps, and potential outliers. The choice of bin width affects interpretation—too few bins obscure patterns; too many create noise.

Boxplots display the median, IQR, whiskers (typically 1.5×IQR), and individual outliers. They are ideal for comparing distributions across groups (e.g., blood pressure by treatment arm) and for quickly identifying asymmetry and outliers.

Bar charts display frequencies or proportions for categorical variables. Use them for comparing counts across groups. Avoid using bar charts for continuous data (use histograms instead) and avoid 3D bar charts, which distort proportions.

Scatter plots display the relationship between two continuous variables and can reveal linear or nonlinear associations, clusters, and outliers. Line graphs are appropriate for displaying trends over time (e.g., incidence rates by year). Both are essential in epidemiologic exploratory analysis.

R Activity — Descriptives, plots, and a stratified Table 1

Continuing from the cleaning block above, calculate descriptives and produce the four standard plots. Base R is enough; psych::describe() is convenient for many variables at once.

# Categorical descriptives ---------------------------------------------------
table(phaa$gender)                            # frequencies
prop.table(table(phaa$gender))                # proportions
round(prop.table(table(phaa$gender)) * 100, 1)  # %

# Cross-tab: gender by smoker, row percentages
round(prop.table(table(phaa$gender, phaa$smoker), margin = 1) * 100, 1)

# Numeric descriptives --------------------------------------------------------
summary(phaa$age)
sd(phaa$age, na.rm = TRUE)
IQR(phaa$age, na.rm = TRUE)

# psych::describe() summarises many variables at once
library(psych)
describe(phaa[, c("age", "bmi", "systolic_bp", "phys_act_min")])

# Plots: histogram, boxplot stratified by exposure, bar chart -----------------
hist(phaa$systolic_bp, main = "Systolic BP", xlab = "mmHg")

boxplot(systolic_bp ~ smoker, data = phaa,
        main = "Systolic BP by smoking status", ylab = "mmHg")

barplot(table(phaa$gender), main = "Gender distribution")

Tip: always plot before you fit a regression. A boxplot of systolic_bp by smoker tells you in a second whether the comparison you are about to do in Lesson 3 will be driven by a few extreme values or by a real shift in the centre of the distribution.

R Reflect on what you just ran

Use the questions below to interpret the output you produced. Look at your console / plot before answering.

1. From round(prop.table(table(phaa$gender, phaa$smoker), margin = 1) * 100, 1), which gender category has the highest percentage of current smokers? What is the magnitude of the gender gap in smoking?

Model answerFrom the row-proportions table, the male category typically has the highest percentage of current smokers (often 18–25%), compared to ~10–15% in females. The gender gap is around 5–10 percentage points, consistent with national survey findings. Direction matters: this is a row-percentage table (margin=1), so each row sums to 100% within gender — reading across, you see smoking prevalence within each gender.

2. From describe(phaa[, c("age", "bmi", "systolic_bp", "phys_act_min")]), which variable has the largest skew? Which is most clearly approximately normal? How would you justify reporting median (IQR) vs mean (SD) for each in a Table 1?

Model answerdescribe() from psych returns skew values for each variable. Physical-activity minutes typically have the largest skew (right-skewed: most people moderate, a few very high), while age and BMI are roughly normal. For Table 1: report median (IQR) for skewed variables (phys_act_min) because the median is less pulled by extremes; report mean (SD) for approximately normal variables (age, BMI). Systolic BP is moderately skewed and either summary works, though many epidemiology journals default to mean (SD) for clinical readability.

3. Look at the boxplot(systolic_bp ~ smoker). Does the median for smokers appear higher, lower, or similar to non-smokers? Are there any visible outliers, and would you expect them to bias a t-test of the difference in means?

Model answerThe boxplot typically shows the smoker median modestly higher than non-smoker (a 5–10 mmHg gap), with both groups having outliers in the high-BP range. Outliers above the upper whisker can bias a t-test of means toward statistical significance if they cluster in one group; the practical fix is to use a Wilcoxon rank-sum (which is rank-based) as a sensitivity test, or to inspect the outliers for data-entry errors. If outliers are real clinical values (genuine hypertensives), they belong in the analysis; report both means and medians side by side.
Saved.

Descriptive Statistics for Epidemiologic Data: “Table 1”

In epidemiologic publications, “Table 1” typically presents the baseline characteristics of the study population, often stratified by exposure or outcome status. How time-to-event outcomes and group comparisons are displayed alongside Table 1 has been the subject of methodological scrutiny — Pocock, Clayton, & Altman (2002) catalogue common pitfalls in survival plots and baseline reporting. A well-constructed Table 1 includes:

  • Demographics (age, sex, ethnicity) with appropriate summary statistics
  • Key clinical or exposure variables
  • Missing data counts for each variable
  • Comparison statistics (p-values or standardized differences) between groups
Example: Table 1 Structure
CharacteristicExposed (n=200)Unexposed (n=300)p-value
Age, mean ± SD52.3 ± 11.449.8 ± 12.10.02
Female, n (%)110 (55.0%)162 (54.0%)0.82
BMI, median (IQR)27.1 (24.0–31.5)25.8 (23.2–29.4)0.004
Current smoker, n (%)48 (24.0%)51 (17.0%)0.05
Diabetes, n (%)34 (17.0%)30 (10.0%)0.02

Note how continuous variables with normal distributions use mean ± SD, while skewed variables (BMI) use median (IQR). Categorical variables are presented as n (%).

Stratified Descriptive Analyses

Stratifying descriptive statistics by key exposure or covariate groups allows you to assess the distribution of potential confounders across exposure categories. This is a critical step before proceeding to multivariable modelling, as it helps identify imbalances that may require adjustment.

Building Scales: Internal Consistency and Factor Analysis

Many public-health surveys ask several items that together measure a latent construct (e.g., depression, anxiety, social support). Two questions need to be answered before you can use these items as a single scale variable in a regression: (1) internal consistency — do the items hang together? — assessed with Cronbach’s α; and (2) dimensionality — how many underlying factors do the items measure? — assessed with exploratory factor analysis (EFA). Once both checks pass, the items can be summed (or averaged) into a derived scale variable that you carry forward into Lessons 3-5.

R Activity — Cronbach’s α and exploratory factor analysis

The course dataset contains seven depression items (dep1dep7) and five anxiety items (anx1anx5). Below we (1) confirm internal consistency, (2) decide on the number of factors, and (3) build the derived scale scores we use for the rest of the course.

# 1. Internal consistency for each candidate scale ----------------------------
library(psych)
dep_items <- phaa[, c("dep1","dep2","dep3","dep4","dep5","dep6","dep7")]
anx_items <- phaa[, c("anx1","anx2","anx3","anx4","anx5")]

alpha(dep_items)             # raw_alpha >= 0.70 is acceptable, >= 0.80 is good
alpha(anx_items)

# 2. How many underlying factors? ------------------------------------------
fa.parallel(dep_items)        # scree plot suggests one factor

# 3. Inspect a one-factor solution for the depression items ----------------
factanal(x = dep_items, factors = 1, rotation = "varimax")
# Loadings >= ~0.40 contribute meaningfully to the factor.

# 4. Two-factor solution combining dep + anx items -------------------------
combined <- na.omit(cbind(dep_items, anx_items))
factanal(x = combined, factors = 2, rotation = "varimax")
# dep1-7 should load on one factor; anx1-5 on the other.

# 5. Build derived scale variables for use in Lessons 3-5 ------------------
phaa$dep_score <- rowSums(dep_items, na.rm = TRUE)
phaa$anx_score <- rowSums(anx_items, na.rm = TRUE)
summary(phaa$dep_score)

# 6. Save the cleaned, scale-augmented file --------------------------------
write.csv(phaa, "phaa_survey_clean.csv", row.names = FALSE)
Console output (truncated)
Reliability analysis (depression items) raw_alpha std.alpha average_r 0.89 0.89 0.54 Loadings (1-factor): Factor1 dep1 0.75 dep2 0.70 dep3 0.66 dep4 0.69 dep5 0.61 dep6 0.72 dep7 0.55

How to read this. A α near 0.90 says the seven depression items are measuring essentially the same construct. The single factor explains a large share of the variance and every item loads > 0.50 on it, so summing the items into a dep_score is justified. We will use this score as a continuous predictor in Lesson 3 and as a covariate in the logistic model for hypertension in Lesson 5.

R Reflect on what you just ran

Use the questions below to interpret the output you produced. Look at your console / plot before answering.

1. What is the raw alpha for the depression items? For the anxiety items? Which scale (if either) clears the conventional 0.80 threshold for "good" internal consistency, and which item (look at the "Reliability if an item is dropped" table) contributes least to the depression scale?

Model answerRaw α for the 7 depression items is typically around 0.85–0.88 (good internal consistency); the anxiety scale around 0.75–0.80 (borderline). The depression scale clears the 0.80 threshold; the anxiety scale is just below. The "reliability if dropped" table identifies one depression item whose removal would actually raise α slightly (suggesting it is somewhat off-construct) — that item contributes least to the scale.

2. From fa.parallel(dep_items), how many factors have eigenvalues above the parallel-analysis line? Does the one-factor solution from factanal() show all seven loadings >= 0.40?

Model answerfa.parallel() typically shows one factor above the parallel-analysis line for the depression items, supporting a unidimensional structure. The one-factor solution from factanal() usually has all seven loadings ≥ 0.40, confirming that all items track the same latent construct.

3. In the two-factor combined solution, do the depression items load on a different factor than the anxiety items? Cite one specific loading from your output that supports your answer.

Model answerIn the two-factor combined solution, the depression and anxiety items load on different factors. A specific example: depression item DEP3 ("I feel down or depressed") might load 0.78 on Factor 1 and 0.12 on Factor 2, while anxiety item ANX2 ("I feel restless or worried") loads 0.05 on Factor 1 and 0.81 on Factor 2. The cross-loadings (each item < 0.40 on the other factor) and within-factor loadings (each item > 0.60 on its own factor) together support the interpretation that the two scales measure distinct (but correlated) latent constructs.
Saved.

Normality Assessment

Assessing whether a variable follows a normal distribution informs the choice between parametric and non-parametric methods.

Visual Methods

Q-Q (quantile-quantile) plots compare observed quantiles to theoretical normal quantiles. Points falling along the diagonal line suggest normality; systematic departures indicate non-normality. Histograms with a normal curve overlay also provide quick visual assessment.

Formal Tests

The Shapiro-Wilk test is generally preferred for samples under 5,000. The Kolmogorov-Smirnov test (with Lilliefors correction) is an alternative. Both test the null hypothesis that data come from a normal distribution.

Caveat: With very large samples, these tests will reject normality even for trivial departures. Visual assessment should always accompany formal tests.

Implications for Analysis Choice

If a variable is approximately normal, parametric tests (t-tests, ANOVA, Pearson correlation) are appropriate. For non-normal distributions, consider non-parametric alternatives (Wilcoxon rank-sum, Kruskal-Wallis, Spearman correlation) or data transformations. The Central Limit Theorem provides some protection for large samples, but descriptive summaries should still reflect the actual distribution.

Reflection

Consider how you would construct a “Table 1” for an epidemiologic study examining the relationship between physical activity and cardiovascular disease. What variables would you include, how would you summarize each, and what stratification would you use?

Model answerTable 1 for an epidemiologic PA-CVD cohort: rows = baseline characteristics, columns stratified by exposure group (e.g., low/moderate/high physical activity). Variables: demographic (age, sex, ethnicity, education, income), clinical (BMI, systolic/diastolic BP, total cholesterol, HbA1c, smoking, alcohol, family history of CVD), and exposure-related (specific PA measures: MET-hours/week, vigorous-PA minutes, sedentary time). Summaries: mean (SD) for approximately normal continuous variables, median (IQR) for skewed, n (%) for categorical. Tests in the rightmost column: ANOVA for normally distributed, Kruskal-Wallis for skewed, χ² for categorical — but be cautious: p-values in Table 1 are often misinterpreted as evidence of confounding; better practice is to report standardised mean differences for assessing imbalance after matching/weighting. Stratification by age and sex is the standard supplementary view.
Reflection saved.
Knowledge Check — Section 3

1. For a heavily right-skewed variable, which summary statistics should be reported?

For skewed distributions, the median is a more representative measure of central tendency than the mean, and the IQR is a more robust measure of spread than the standard deviation. This is the standard reporting convention in epidemiologic research.

2. A distribution has excess kurtosis greater than 0. What does this indicate?

Positive excess kurtosis (leptokurtic) indicates heavier tails and a sharper peak compared to a normal distribution, meaning more extreme values are present than would be expected under normality. This may reflect outliers or a mixture of subpopulations.

3. What is the primary purpose of stratifying descriptive statistics by exposure group in epidemiologic research?

Stratifying by exposure group reveals whether potential confounders are unevenly distributed between groups. This information is essential for planning multivariable analyses and identifying variables that may require adjustment.

4. A Shapiro-Wilk test yields p < 0.001 with a sample of 50,000. Which interpretation is most appropriate?

With very large samples, formal normality tests have excessive power and will reject the null hypothesis for negligible departures from normality. Visual assessment (Q-Q plots, histograms) should always accompany formal tests to evaluate practical significance of any departure.

Complete the reflection and knowledge check to continue.

Section 4

Final Assessment

⏱ Estimated time: 15 minutes

Bringing It All Together

Lesson 2 followed a single dataset from raw to descriptive. Section 1 named the errors — collection, entry, storage, and analysis-stage problems — and unpacked the missing-data mechanisms (MCAR, MAR, MNAR) that determine which cleaning approaches are even valid. Section 2 turned that vocabulary into a workflow: systematic range, type, and consistency checks; outlier triage; principled missing-data handling; transformations and derived variables; and a documentation trail you could hand to a peer reviewer. Section 3 closed the loop with the descriptive layer that every paper opens with — central tendency, spread, shape, frequency tables, visualisations, and a Table 1 that stratifies on the primary exposure or outcome.

The point is that descriptive analysis is not a warm-up. It is the layer where data problems surface, where assumptions get tested, and where readers calibrate how much trust to place in the inferential models that follow. From here we can move into the regression machinery of the rest of the course knowing that the dataset under it has been understood, not just processed.

Key Takeaways from Lesson 2

  • Errors enter the data pipeline at every stage; cleaning is about locating them, not arbitrarily fixing values.
  • The missing-data mechanism (MCAR, MAR, MNAR) determines which handling strategies remain valid — complete-case analysis is rarely defensible.
  • Outliers should be investigated, not deleted by reflex; document keep/exclude/transform decisions.
  • Choose central tendency, spread, and shape statistics based on variable type and distributional shape.
  • A well-built Table 1 stratified by exposure or outcome is often the most informative single output of an analysis.
  • Cleaning and descriptive work must leave a documented trail; reproducibility is not optional.

Reflection

Reflecting on the entire lesson, describe a systematic workflow you would follow when receiving a new epidemiologic dataset—from initial data quality assessment through to producing a complete descriptive summary. What tools, checks, and documentation would you use at each step?

Model answerA defensible workflow: (1) Receipt: archive the raw file unchanged to data/raw/ with a SHA-256 checksum. (2) Inventory: structural read (rows, columns, types), summary of every variable. (3) Variable-by-variable cleaning: range checks, plausibility, missing-data codes. (4) Derivation: build analytic variables (age groups, BP categories, composite scales). (5) Descriptive summary: univariate and bivariate, by exposure or outcome. (6) Diagnostic report: HTML/PDF auto-generated showing distribution plots, missingness, and outlier counts for each variable. (7) Codebook: machine-readable (JSON/YAML) accompanying the cleaned dataset, with derivation logic for each variable. Tools: R (tidyverse + janitor + summarytools), Python (pandas + great_expectations) or Stata (codebook + ds), with Git for code, DVC for data, and Quarto/Rmd for the diagnostic report. Documentation at every step: the decision log explains every NA, every recode, every dropped row.
Reflection saved.

Final Assessment

You must answer all 15 questions correctly (100%) and complete the final reflection to finish this lesson.

Final Assessment — Lesson 2 (15 Questions)

1. Which stage of the data pipeline is most susceptible to transcription errors?

Transcription errors occur when data are transferred from one medium to another, which is the defining activity of data entry. Double data entry is a key strategy for detecting such errors.

2. In a dataset where missingness of income depends on the actual income value itself (high earners are less likely to report), the missing data mechanism is:

When the probability of missingness depends on the unobserved (missing) values themselves, the mechanism is MNAR. In this case, the actual income value influences whether it is reported, regardless of other observed characteristics.

3. Which method for handling missing data correctly accounts for imputation uncertainty?

Multiple imputation creates several plausible imputed datasets and pools results using Rubin’s rules, which correctly incorporates the additional uncertainty due to missing data into standard errors and confidence intervals.

4. Using the IQR method, a value is classified as an outlier if it falls:

The IQR method defines outliers using the first and third quartiles. Values falling below Q1 − 1.5×IQR or above Q3 + 1.5×IQR are flagged as potential outliers. This is the same criterion used in standard boxplots.

5. What is the primary limitation of listwise deletion?

Listwise deletion excludes any observation with missing data on any variable in the analysis, which can substantially reduce the effective sample size. It only produces unbiased estimates under the MCAR assumption.

6. For a normally distributed continuous variable, the conventional summary statistics to report are:

For normally distributed variables, the mean and standard deviation are the most informative summary statistics because the normal distribution is fully characterized by these two parameters. This is the standard reporting convention in epidemiologic research.

7. A variable measuring “number of emergency department visits in the past year” is best classified as:

This variable represents discrete events (visits) counted over a time period, making it a count variable. Count variables are non-negative integers and may follow a Poisson or negative binomial distribution.

8. Which visualization is most appropriate for comparing the distribution of a continuous variable across three treatment groups?

Boxplots allow direct comparison of medians, IQRs, and outliers across groups. Side-by-side boxplots are ideal for comparing the distribution of a continuous variable across multiple categorical groups.

9. In a cross-tabulation of exposure and disease, the odds ratio is computed as:

The odds ratio from a 2×2 table is calculated as the cross-product ratio: (a × d) / (b × c), where cells are arranged with exposed cases (a), exposed controls (b), unexposed cases (c), and unexposed controls (d).

10. Positive skewness indicates that:

Positive (right) skewness means the distribution has a longer or fatter right tail. The mean is pulled toward the extreme values, so it exceeds the median. This pattern is common in epidemiologic data such as hospital length of stay or healthcare costs.

11. What is the purpose of a Q-Q plot?

A Q-Q plot compares the quantiles of the observed data against the quantiles of a theoretical distribution. If data follow the specified distribution, points will fall approximately along the diagonal reference line.

12. When applying a natural log transformation to a variable that contains zero values, the appropriate approach is to:

Since ln(0) is undefined, a common practice is to add a small constant (typically 1) to all values before applying the log transformation. This preserves the zero observations in the analysis while allowing the transformation to be applied.

13. In a “Table 1” for an epidemiologic study, categorical variables are typically summarized as:

Categorical variables in Table 1 are reported as the count (n) and percentage (%) within each category. This provides both the absolute number and the relative frequency, allowing readers to assess both the distribution and sample size.

14. Which statement about the z-score method for outlier detection is correct?

The z-score method relies on the mean and standard deviation, both of which are influenced by extreme values. In skewed distributions, outliers pull the mean and inflate the SD, which can mask other outliers (the masking effect) or fail to flag truly unusual values.

15. An audit trail for data cleaning should include all of the following EXCEPT:

An audit trail documents data cleaning decisions: what was changed, why, by whom, when, and what the original values were. Regression model coefficients are analysis outputs, not data cleaning documentation.

Lesson 2 Complete!

You have successfully completed Data Cleaning and Descriptive Analyses. Your responses have been downloaded.

Lesson 3 — Linear Regression — takes the next step from description to inference. The continuous outcomes you summarised here become the dependent variable; the predictors you cleaned and processed in Lessons 1–2 enter as independent variables. The regression framework that begins in Lesson 3 will reappear, in different forms, throughout the remainder of HSCI 410.