Data Cleaning &
Descriptive Analyses

Exploratory Data Analysis For Epidemiology

Learning objectives for this lesson:

Identify common types of data errors and their sources in the data pipeline
Distinguish between missing data mechanisms (MCAR, MAR, MNAR) and their analytic implications
Apply systematic data cleaning strategies including range checks, outlier detection, and consistency verification
Select and implement appropriate methods for handling missing data
Calculate and interpret measures of central tendency, spread, and shape
Construct and interpret descriptive summaries appropriate for epidemiologic research
Assess distributional assumptions and select appropriate visualization techniques

This course was developed by Dr. Kiffer G. Card, Faculty of Health Sciences, Simon Fraser University based on Dohoo, I. R., Martin, S. W., & Stryhn, H. (2012). Methods in Epidemiologic Research. VER Inc.

Reference

Glossary: Key Terms, People & Concepts

📚 Reference page, available throughout the lesson

This glossary collects the key concepts, people, and ideas you will meet in this lesson. Use it as a reference while you work through the material, or as a review before assessments. Type in the search box to filter entries.

Key Concepts & Ideas

Tidy Data A data layout where each variable is a column, each observation is a row, and each cell holds a single value. Coined by Hadley Wickham; the foundation of the tidyverse workflow.

Data Pipeline The full sequence of steps that takes raw data through cleaning, transformation, and analysis to a final result. Errors can enter at any stage; tracking them is essential.

Codebook / Data Dictionary Documentation that describes every variable in a dataset (name, type, allowable values, units, coding, source). Required for reproducible analysis.

Outlier An observation that lies far from the rest of the distribution. May reflect a genuine extreme value, a measurement error, or a data-entry mistake; investigate before removing.

Range Check Verifying that values for a variable fall within plausible biological or logical limits (e.g., human age between 0 and 120). A first-pass tool for spotting impossible entries.

Frequency Table A tabular summary showing the count (and often the percentage) of each level of a categorical variable. The simplest descriptive tool for nominal/ordinal data.

Missingness Patterns of missing values in a dataset. Classified by mechanism (MCAR/MAR/MNAR); the mechanism governs whether complete-case analysis or imputation is appropriate.

Type Conversion (Coercion) Changing a variable's data type (e.g., character to numeric, factor to integer). Mistyped variables are a common source of analysis errors in R.

Derived Variable A new variable computed from existing ones (e.g., BMI from height and weight, age categories from age). Document the formula in the data dictionary.

Methods & Statistical Concepts

Mean The arithmetic average. Sensitive to outliers; appropriate for symmetric continuous data.

Median The 50th percentile, the value that splits the distribution in half. Robust to outliers and skew; preferred for non-normal continuous data.

Standard Deviation (SD) A measure of spread equal to the square root of the variance. Reported alongside the mean for symmetric data.

Interquartile Range (IQR) The difference between the 75th and 25th percentiles (Q3 − Q1). A robust measure of spread; reported with the median.

Q-Q Plot A quantile-quantile plot comparing observed data to a theoretical distribution (often the normal). Points falling on the diagonal suggest the distribution fits.

Histogram A bar chart of binned counts that displays the empirical distribution of a continuous variable. Useful for spotting skew, modality, and outliers.

Box Plot A graphical summary showing the median, IQR (box), whiskers (typically 1.5×IQR), and outlier points. Compact for comparing groups.

Skewness Asymmetry of a distribution. Positive (right) skew has a long right tail; negative (left) skew has a long left tail. Influences choice of mean vs. median.

Multiple Imputation A principled method for handling missing data: generate multiple plausible values per missing cell, analyse each completed dataset, and pool results using Rubin's rules.

dplyr A tidyverse R package providing verbs (filter, select, mutate, summarize, group_by, arrange) for grammar-of-data-manipulation workflows.

janitor An R package for routine data-cleaning tasks: standardising column names (clean_names), tabyls, removing empty rows/columns, and finding duplicates.

ggplot2 A tidyverse R package implementing the grammar of graphics. Supports descriptive plots (histograms, box plots, scatterplots) used throughout exploratory analysis.

Key People

John Tukey (1915–2000) American statistician who coined the term “exploratory data analysis” (EDA) and invented the box plot, stem-and-leaf display, and many tools for visual data inspection.

Hadley Wickham (1979– ) New Zealand statistician and chief scientist at Posit (RStudio). Created ggplot2, dplyr, tidyr, and the broader tidyverse; defined “tidy data”.

No matching entries. Try a different search term.

Section 2

Data Cleaning Strategies

⏱ Estimated time: 20 minutes

Section 2 of 3

Data Cleaning Strategies

Systematic verification, outlier triage, missing-data methods, transformations, and the audit trail that ties it all together.

Verification

Three types of systematic checks

Range checks

Hard limits (impossible) and soft limits (unusual but possible). Define from the codebook before you look at the data.

Consistency checks

Cross-field logic: discharge after admission, age matches date of birth, skip patterns respected.

Cross-validation

Compare across data sources. Establish a reliability hierarchy when sources disagree.

Outliers

Three ways to find them; one rule for all of them

IQR method

\[ \color{#0B7B6B}{\text{Low}} = \color{#C2410C}{Q_1} - 1.5 \times IQR \quad \color{#1D4ED8}{\text{High}} = \color{#BE185D}{Q_3} + 1.5 \times IQR \]

Q1 25th percentile Q3 75th percentile Low / High outlier fences

Z-score

\[ \color{#0B7B6B}{z} = \frac{\color{#C2410C}{x} - \color{#6D28D9}{\bar{x}}}{\color{#1D4ED8}{s}} \quad \text{flag if } |z| > 3 \]

z standardized score x observed value x̄ sample mean s standard deviation

Visual inspection (box plots, histograms, scatter plots) should accompany any quantitative rule.

The rule: investigate before you delete. An extreme value may be a genuine clinical finding or a data-entry shift of one decimal place.

Tukey (1977): look at data before you model it.

Missing data

Methods ranked by how well they handle uncertainty

Listwise deletion

Simple but biased unless MCAR. Large cumulative losses in multivariable models.

Pairwise deletion

Uses all available cases per calculation. Preserves data but results may not fit together coherently.

Single imputation

Mean / median / mode. Underestimates variance; standard errors too narrow.

Multiple imputation

Gold standard (Sterne et al., 2009). Generates 5–20 plausible datasets; pools results with Rubin’s rules.

Transformations

Addressing skew without discarding data

Transformation	Formula	When to use
Log (natural)	$\ln(x)$ or $\ln(x+1)$	Right-skewed; multiplicative relationships
Square root	$\sqrt{x}$	Count data; moderate skew
Reciprocal	$1/x$	Strongly skewed rates and times

Back-transform all results for reporting. A coefficient from a log-transformed outcome is multiplicative, not additive.

Recoding & strings

New variables from old; order from messy text

Recoding

Collapse categories (current + former → ever smoker); band continuous variables (BMI groups); derive quantities (person-years, age at onset). Every recode goes in the dictionary.

String cleaning

“Vancouver”, “vancouver”, “VANCOOUVER” → one standard value. Consistent case, trimmed whitespace, pattern rules, lookup tables, fuzzy matching.

Documentation

The audit trail is non-negotiable

Every cleaning decision needs to record:

What changed (original value and replacement)
Why (biological implausibility, logical contradiction, etc.)
Who and when

Reproducibility means another analyst can rebuild the cleaned dataset from the raw file using only your documentation.

Do not edit values by hand in a spreadsheet. Write a script. The script is the audit trail.Wilkinson et al., 1999

Introduction and Overview

An earlier section named the kinds of errors that can enter a dataset. This section provides the cleaning workflow that addresses them: systematic verification, outlier identification, missing-data handling, transformations, recoding, and string cleaning. Each step has its own techniques and its own decisions to document.

Learning Objectives

Run range, type, and consistency checks as a first systematic verification pass.
Identify outliers and decide whether to keep, transform, or exclude them, with reasons recorded.
Choose an appropriate missing-data strategy (complete-case, single imputation, multiple imputation) given the mechanism.
Apply transformations and derive new variables without losing the audit trail back to the raw data.
Document every cleaning decision so that another analyst could reproduce your final dataset.

Systematic Verification

Data cleaning should follow a systematic, documented process. The goal is not to “fix” data arbitrarily but to identify and resolve genuine errors while preserving the integrity of valid observations.

Range Checks

Range checks verify that values fall within predefined valid or plausible bounds. For example, gestational age should be between approximately 20 and 44 weeks; human body temperature typically ranges from 35°C to 42°C.

Define both hard limits (biologically impossible values that should be flagged as errors) and soft limits (unusual but possible values that should be reviewed).

Consistency Checks

Consistency checks examine whether values across multiple fields are logically compatible. Examples include verifying that discharge dates occur after admission dates, that a participant’s age is consistent with their reported date of birth, and that skip patterns in questionnaires are respected.

These checks often require domain knowledge about the relationships between variables.

Cross-Validation

Cross-validation involves comparing data from multiple sources to verify accuracy. For example, comparing self-reported medication use with pharmacy dispensing records, or verifying diagnoses against medical chart reviews.

When discrepancies arise, establishing a hierarchy of data source reliability is essential for deciding which value to retain.

Identifying Outliers

Outliers are observations that lie far from the bulk of the data. They may represent genuine extreme values, data errors, or observations from a different population. The visual tools used to spot them, including box plots, stem-and-leaf displays, and the broader practice of looking at data before modelling, trace back to Tukey (1977). Several methods are used to identify them:

The interquartile range (IQR) method defines outliers as values falling below Q1 − 1.5×IQR or above Q3 + 1.5×IQR, where Q1 and Q3 are the 25th and 75th percentiles, respectively.

IQR outlier bounds

\[ \color{#0B7B6B}{\text{Lower}} = \color{#C2410C}{Q_1} - 1.5 \times \color{#6D28D9}{IQR} \qquad \color{#1D4ED8}{\text{Upper}} = \color{#BE185D}{Q_3} + 1.5 \times \color{#6D28D9}{IQR} \]

An observation is an outlier if it falls below the lower bound (the 25th percentile minus 1.5 times the interquartile range) or above the upper bound (the 75th percentile plus 1.5 times that same range).

This method is robust to extreme values because it is based on percentiles rather than the mean and standard deviation. Values beyond 3×IQR from the quartiles are sometimes termed “extreme outliers.”

The z-score method standardizes each observation as z = (x − mean) / SD. Values with |z| > 3 are commonly flagged as potential outliers.

Z-score

\[ \color{#0B7B6B}{z} = \frac{\color{#C2410C}{x} - \color{#6D28D9}{\bar{x}}}{\color{#1D4ED8}{s}} \]

The z-score is the distance of a value from the mean, in units of the standard deviation. Values with an absolute z-score above 3 are commonly flagged.

This approach assumes an approximately normal distribution and can be misleading for heavily skewed data because the mean and SD are themselves influenced by outliers (the masking effect).

Visual inspection using boxplots, histograms, and scatterplots is often the most informative first step in outlier detection.

Boxplots display the median, IQR, and individual outliers beyond the whiskers
Histograms reveal gaps in the distribution that may indicate erroneous values
Scatterplots can reveal bivariate outliers that appear normal when each variable is examined individually

Visual methods should complement, not replace, quantitative rules. Always investigate outliers before deciding to exclude them.

Practical Tip

Never delete an outlier simply because it is extreme. Investigate whether it represents a data error, a genuinely unusual observation, or a member of a different subpopulation. Document every decision to retain or remove an outlier in your audit trail.

Box plot of systolic blood pressure with the IQR outlier fences marked; five points fall beyond the fences and are flagged as outliers. — The box spans the interquartile range; the dashed lines are the lower and upper fences (Q1 − 1.5×IQR and Q3 + 1.5×IQR). Points beyond them, shown in orange, are flagged for investigation, not automatic deletion.

Handling Missing Data

The choice of method for handling missing data depends on the assumed missing data mechanism and the proportion of missingness. Practical guidance on implementing multiple imputation via chained equations, including how to specify the imputation model, how many imputations to draw, and how to diagnose problems, is given by White, Royston, & Wood (2011).

Listwise Deletion (Complete-Case Analysis)

Excludes any observation with one or more missing values. Simple to implement, but can dramatically reduce sample size and introduces bias unless data are MCAR. In multivariable models with many variables, even low rates of missingness on individual variables can lead to large cumulative losses.

Pairwise Deletion (Available-Case Analysis)

Uses all available data for each analysis. For example, when computing a correlation matrix, each pair of variables uses all cases with non-missing values for that pair. This preserves more data than listwise deletion, but can produce inconsistent results (e.g., a correlation matrix that is not positive semi-definite).

Single Imputation (Mean, Median, Mode)

Mean imputation replaces missing values with the variable mean. While simple, it underestimates variance and distorts relationships between variables. Median imputation is more robust for skewed distributions. Mode imputation is used for categorical variables.

All single imputation methods treat the imputed value as if it were known with certainty, leading to underestimated standard errors and overly narrow confidence intervals.

Multiple Imputation (Introduction)

Multiple imputation creates several (typically 5–20) plausible imputed datasets, analyzes each separately, and pools the results using Rubin’s rules. This approach correctly accounts for the uncertainty due to missing data and produces valid inference under the MAR assumption.

Multiple imputation is currently considered the gold standard for handling missing data in epidemiologic research when the proportion of missingness is non-trivial (Sterne et al., 2009).

Data Transformations

When continuous variables are heavily skewed, transformations can make the distribution more symmetric, stabilize variance, and improve the validity of parametric methods.

Two histograms: a right-skewed biomarker distribution on the left, and the same data after a natural-log transform on the right, now approximately symmetric. — A natural-log transform pulls a right-skewed variable toward symmetry, which often improves the fit of methods that assume approximate normality. Interpretation then shifts to the multiplicative scale.

Transformation	Formula	When to Use
Log (natural)	ln(x) or ln(x + 1)	Right-skewed data, multiplicative relationships (e.g., biomarker concentrations, income)
Square root	√x	Count data, moderately right-skewed distributions
Inverse (reciprocal)	1/x	Strongly right-skewed data; commonly used for rates and times

Remember

When you transform a variable, all subsequent interpretations must account for the transformation. For example, a regression coefficient from a log-transformed outcome represents a multiplicative (rather than additive) change. Always back-transform results for reporting.

Recoding and Derived Variables

Recoding involves creating new variables from existing ones. Common examples include collapsing categories (e.g., combining “current smoker” and “former smoker” into “ever smoker”), categorizing continuous variables into clinically meaningful groups (e.g., BMI categories), and computing derived variables such as person-years of follow-up or age at onset.

String Cleaning and Standardization

Free-text fields often contain inconsistencies: “Vancouver,” “vancouver,” “VANCOUVER,” and “Vancoouver” may all refer to the same city. Standardization strategies include converting to consistent case, trimming whitespace, applying regular expressions for pattern matching, and using lookup tables or fuzzy matching algorithms.

Documenting the Cleaning Process

Every data cleaning decision should be recorded in an audit trail. This includes what was changed, why, by whom, and when. Reproducibility requires that another analyst could replicate the cleaning process from the original raw data using your documentation, an expectation echoed in the Statistical Methods in Psychology Journals guidelines (Wilkinson & the Task Force on Statistical Inference, 1999).

R Activity: read, fix classes, and clean impossible values

This block uses the course dataset phaa_survey.csv, a simulated public-health survey of ~800 adults that we will reuse from this lesson onward. The full annotated script lives in r-activities/HSCI_410_Lesson_2_Data_Cleaning_and_Descriptive_Analyses.R.

# 1. Read the raw file. Treat blank cells as missing (na.strings).
phaa <- read.csv(file = "phaa_survey.csv",
                 stringsAsFactors = FALSE,
                 na.strings = c("", "NA"))

# 2. Convert variables that should be numeric (some came in as character
#    because the raw data contained impossible values like a negative age).
phaa$age <- as.numeric(phaa$age)
phaa$bmi <- as.numeric(phaa$bmi)

# 3. Range checks: replace impossible values with NA
phaa$age[phaa$age < 18 | phaa$age > 100] <- NA
phaa$systolic_bp[phaa$systolic_bp < 80 | phaa$systolic_bp > 220] <- NA
phaa$bmi[phaa$bmi < 13 | phaa$bmi > 60] <- NA

# 4. Order education and income so that comparisons (<= ...) are meaningful
phaa$education <- factor(phaa$education,
  levels = c("Less than high school", "High school", "Some college",
             "Bachelor's", "Graduate degree"),
  ordered = TRUE)

Why never overwrite the raw file? Range-check rules will change as you learn more about the data. Keeping phaa_survey.csv untouched lets you replay the cleaning from a known starting state.

R Reflect on what you just ran

Use the questions below to interpret the output you produced. Look at your console / plot before answering.

1. Look at summary(phaa$age) before and after the line phaa$age <- as.numeric(phaa$age). How many NAs appear, and what does that count tell you about the raw data quality?

Model answerBefore as.numeric(), age is character, so summary() reports only its length, class, and mode, not quartiles or an NA count. After conversion, summary() reports several NAs, typically 5–30 in the simulated dataset, representing entries that couldn't be parsed as numeric (text such as "unknown", "30s", or blank). The NA count is a direct measure of data-quality flaws in the raw entry: each NA is a hidden free-text response that the original data collection accepted but cannot be analysed quantitatively.

2. After the range-check on systolic_bp, how many rows were set to NA? Why is replacing impossible values with NA preferable to deleting those rows entirely?

Model answerThe range check (e.g., 80–220 mmHg) typically flags 10–30 rows as out-of-range and sets them to NA. Replacing with NA is preferable to deletion because (a) it preserves the row for other variables (you don't lose 30 observations to delete one bad blood pressure); (b) it documents in the data itself that this value was flagged; (c) the missingness then enters multiple-imputation pipelines like any other NA, allowing principled handling. Deleting rows loses information silently and can introduce selection bias if implausible values cluster non-randomly.

3. After making education an ordered factor, what does levels(phaa$education) show? Why does the ordering matter for any analysis that asks "as education goes up, does X change?"

Model answerlevels(phaa$education) shows them in the specified order, e.g., "<HS" < "HS" < "Some college" < "Bachelor's" < "Graduate". Ordering matters because an ordered factor lets R interpret "higher education" as a meaningful direction in regression and tests, so linear contrasts become interpretable, and trend tests (Cochran-Armitage, ordinal regression) work as intended. Without the ordering R defaults to alphabetical, which makes "Bachelor's" < "Graduate" < "HS" < "Some college" < "<HS", which is nonsensical for inferential purposes.

Saved.

Applied Example: Audit Trail

A cohort study of cardiovascular disease collected blood pressure measurements on 2,400 participants. During cleaning, 15 systolic values exceeded 300 mmHg. The analyst traced 12 of these to a decimal-point shift (e.g., 1,200 instead of 120.0) and corrected them using the original paper forms. Three could not be verified and were set to missing with a note: “Original form illegible; value implausible.” All decisions were logged in a cleaning script with date stamps.

Reflection

Think about a dataset you have worked with (or imagine a large epidemiologic study). What data cleaning challenges would you expect, and how would you prioritize your cleaning steps? What would your audit trail include?

Model answerFor a large epidemiologic dataset (e.g., a multi-site cohort), expected challenges: heterogeneous data entry across sites, free-text encoded in numeric fields, inconsistent missing-data codes (-9, 999, blank), date-format collisions (US vs. ISO), encoding mis-translations from older systems, and outcome adjudication that wasn't blinded. Priority order: (1) inventory and codebook the raw data; (2) range and consistency checks per variable; (3) cross-variable plausibility (impossible combinations); (4) standardise dates, units, and missing codes; (5) de-duplication; (6) derive analytic variables; (7) document. Audit trail: every step logged in a versioned script, intermediate datasets saved with version numbers, decision log (e.g., "impossible BMI > 80 set to NA on 2026-05-16, n = 12 rows affected"), and the final cleaned dataset accompanied by an HTML diagnostic report and a codebook auto-generated from the cleaning script.

Reflection saved.

● Complete the reflection and knowledge check to continue.

Section 3

Descriptive Analyses

⏱ Estimated time: 25 minutes

Section 3 of 3

Descriptive Analyses

Measures of central tendency, spread, and shape; visualisations; scale scores; Table 1; normality assessment.

Central tendency

Three measures; one rule for choosing

Arithmetic mean

\[ \color{#0B7B6B}{\bar{x}} = \frac{1}{n} \sum_{i=1}^{n} \color{#1D4ED8}{x_i} \]

x̄ mean x_i each observed value n number of observations

Mean

Uses all data; sensitive to skew and outliers. Report with SD for normal data.

Median

Resistant to extremes. Report with IQR for skewed data or when outliers are present.

Mode is appropriate for nominal categorical data where neither mean nor median is meaningful.

Spread and shape

How far and how symmetric

Sample standard deviation

\[ \color{#0B7B6B}{s} = \sqrt{\frac{\sum_{i=1}^{n}(\color{#1D4ED8}{x_i} - \color{#6D28D9}{\bar{x}})^2}{\color{#C2410C}{n}-1}} \quad \color{#047857}{\text{IQR}} = Q_3 - Q_1 \]

s standard deviation x_i each value x̄ mean n sample size IQR interquartile range

Categorical data

Cross-tabulations and the odds ratio

Odds ratio from a 2×2 table

\[ \color{#0B7B6B}{OR} = \frac{\color{#C2410C}{a}\,\color{#6D28D9}{d}}{\color{#1D4ED8}{b}\,\color{#BE185D}{c}} \]

OR odds ratio a exposed cases d unexposed non-cases b exposed non-cases c unexposed cases

	Cases	Controls
Exposed	45 (a)	30 (b)
Unexposed	15 (c)	60 (d)

OR = (45 × 60) / (30 × 15) = 6.0. A strong association, visible from the descriptive table alone.

Visualisation

Match the plot to the question

Histograms

Distribution of a continuous variable: shape, modality, gaps, outliers.

Boxplots

Median + IQR + outlier points. Best for comparing groups side by side.

Bar charts

Frequencies for categorical variables. Not for continuous data.

Scatter / line

Continuous relationships and trends over time.

Building scales

From questionnaire items to a score

Internal consistency

Cronbach's α: ≥ 0.70 acceptable, ≥ 0.80 good. The seven depression items in the course data reach α ≈ 0.89.

Dimensionality

Exploratory factor analysis + parallel analysis. Loadings ≥ ~0.40 contribute meaningfully; depression and anxiety items load on separate factors.

Both checks pass → sum the items into a derived score (dep_score) and carry it into later lessons.

Table 1

The publication-ready summary of your cohort

Characteristic	Exposed	Unexposed	p
Age, mean ± SD	52.3 ± 11.4	49.8 ± 12.1	0.02
Female, n (%)	110 (55%)	162 (54%)	0.82
BMI, median (IQR)	27.1 (24.0–31.5)	25.8 (23.2–29.4)	0.004

Normal variables: mean ± SD. Skewed variables: median (IQR). Categorical: n (%). Missing-data counts belong in the table too.

Normality

Is it close enough to normal?

Visual methods

Q-Q plot: points along the diagonal suggest normality; systematic departures signal skew or heavy tails. Histogram with normal overlay for a quick read.

Formal tests

Shapiro-Wilk (preferred for n < 5,000); Kolmogorov-Smirnov as an alternative. In very large samples they flag trivial departures, so pair them with a plot.

Approximately normal → t-tests, ANOVA, Pearson. Otherwise → transform, or use rank-based tests (Wilcoxon rank-sum, Kruskal-Wallis, Spearman).

Carry forward

What this lesson set up

Choose central tendency and spread based on variable type and distributional shape.
A well-stratified Table 1 is often the most informative single output of an analysis.
Normality assessment should combine visual tools (Q-Q plot) with formal tests; neither alone is sufficient.
The clean, described dataset is the foundation on which every regression in later lessons rests.

Introduction and Overview

Earlier sections produced a clean, well-documented dataset. This section turns that dataset into the descriptive summaries that any analysis report opens with: means and medians, spreads and shapes, and frequency distributions, “Table 1” characteristics by exposure group, and the visualisations that go with them. Doing this carefully is what lets readers understand what your data look like before they trust your inferential models.

Learning Objectives

Choose the appropriate measure of central tendency, spread, and shape for each variable type and distribution.
Build frequency distributions and cross-tabulations to summarise categorical data.
Select visualisations that match the variable type and the question you are asking.
Construct a publication-ready “Table 1” of participant characteristics, including stratified comparisons.
Assess normality and decide when a transformation or non-parametric approach is warranted.

Measures of Central Tendency

Measures of central tendency summarize the “typical” value in a distribution. The choice depends on the variable type and distributional shape.

The arithmetic mean (x̄) is the sum of all values divided by the number of observations. It uses all data points and is the basis of many parametric statistical methods.

Arithmetic mean

\[ \color{#0B7B6B}{\bar{x}} = \frac{1}{\color{#C2410C}{n}} \sum_{i=1}^{\color{#C2410C}{n}} \color{#1D4ED8}{x_i} \]

The mean is the sum of all observed values divided by the number of observations.

When to use: Symmetric or approximately normal distributions. Avoid for heavily skewed data or when outliers are present, as the mean is pulled toward extreme values.

The median is the middle value when observations are ordered from smallest to largest. For an even number of observations, it is the average of the two middle values.

When to use: Skewed distributions, ordinal data, or when outliers are present. The median is robust to extreme values; unlike the mean, adding or removing an outlier has minimal effect on the median.

The mode is the most frequently occurring value. A distribution may be unimodal, bimodal, or multimodal.

When to use: Categorical (nominal) data, where neither the mean nor the median is meaningful. It is also useful for identifying common response values in continuous data (e.g., digit preference).

Measures of Spread

Spread (or dispersion) describes how much variability exists in the data around the central value.

Measure	Formula / Definition	Properties
Range	Maximum − Minimum	Sensitive to outliers; uses only 2 data points
Variance (s²)	Σ(x_i − x̄)² / (n − 1)	Average squared deviation; units are squared
Standard Deviation (s)	√s²	Same units as original variable; most commonly reported
IQR	Q3 − Q1	Robust to outliers; covers middle 50% of data

Two of these formulas often trip up beginners. The variance divides by n − 1 rather than n because the deviations are measured from the sample mean, not the unknown true mean; dividing by n would systematically understate the spread, and using the slightly smaller n − 1 corrects for it. The standard deviation is then simply its square root, which brings the value back from squared units (squared mmHg, for instance) to the original scale so it can sit beside the mean.

Reporting Convention

For normally distributed variables, report mean ± SD. For skewed distributions, report median (IQR) or median (Q1, Q3). This convention is standard in epidemiologic publications and should be followed in “Table 1” (Wilkinson & the Task Force on Statistical Inference, 1999).

Measures of Shape

Beyond central tendency and spread, the shape of a distribution provides critical information for choosing analytical methods.

Skewness

Skewness measures the asymmetry of a distribution. A skewness of 0 indicates perfect symmetry.

Positive (right) skew: Long right tail; mean > median. Common in variables like income, hospital length of stay, and biomarker concentrations.
Negative (left) skew: Long left tail; mean < median. Less common but seen in variables like age at death in developed countries.

As a rule of thumb, |skewness| > 1 indicates substantial skew, though this depends on sample size.

Kurtosis

Kurtosis describes the “tailedness” of a distribution relative to a normal distribution (which has a kurtosis of 3, or excess kurtosis of 0).

Leptokurtic (excess kurtosis > 0): Heavier tails and a sharper peak; more extreme values than expected under normality.
Platykurtic (excess kurtosis < 0): Lighter tails and a flatter peak; fewer extreme values than expected.

High kurtosis can indicate the presence of outliers or a mixture of subpopulations.

Frequency Distributions and Cross-Tabulations

For categorical variables, frequency distributions display the count and percentage of observations in each category. Cross-tabulations (contingency tables) display the joint distribution of two categorical variables and are the basis for computing measures of association such as odds ratios and risk ratios.

Example: Cross-Tabulation in an Outbreak Investigation

	Cases (n)	Controls (n)	Total
Exposed	45	30	75
Unexposed	15	60	75
Total	60	90	150

From this 2×2 table, the odds ratio = (45 × 60) / (30 × 15) = 2,700 / 450 = 6.0, suggesting a strong association between exposure and disease.

Visualization Techniques

Appropriate visualization depends on the variable type and the question being asked. Exploratory data analysis emphasises looking at data first so structure, gaps, and anomalies are visible before any model is fit.

Histograms display the distribution of a continuous variable by dividing values into bins and counting observations per bin. They reveal distributional shape, modality, gaps, and potential outliers. The choice of bin width affects interpretation: too few bins obscure patterns, too many create noise.

Boxplots display the median, IQR, whiskers (typically 1.5×IQR), and individual outliers. They are ideal for comparing distributions across groups (e.g., blood pressure by treatment arm) and for quickly identifying asymmetry and outliers.

Bar charts display frequencies or proportions for categorical variables. Use them for comparing counts across groups. Avoid using bar charts for continuous data (use histograms instead) and avoid 3D bar charts, which distort proportions.

Scatter plots display the relationship between two continuous variables and can reveal linear or nonlinear associations, clusters, and outliers. Line graphs are appropriate for displaying trends over time (e.g., incidence rates by year). Both are essential in epidemiologic exploratory analysis.

R Activity: descriptives, plots, and a stratified Table 1

Continuing from the cleaning block above, calculate descriptives and produce the four standard plots. Base R is enough; psych::describe() is convenient for many variables at once.

# Categorical descriptives ---------------------------------------------------
table(phaa$gender)                            # frequencies
prop.table(table(phaa$gender))                # proportions
round(prop.table(table(phaa$gender)) * 100, 1)  # %

# Cross-tab: gender by smoker, row percentages
round(prop.table(table(phaa$gender, phaa$smoker), margin = 1) * 100, 1)

# Numeric descriptives --------------------------------------------------------
summary(phaa$age)
sd(phaa$age, na.rm = TRUE)
IQR(phaa$age, na.rm = TRUE)

# psych::describe() summarises many variables at once
library(psych)
describe(phaa[, c("age", "bmi", "systolic_bp", "phys_act_min")])

# Plots: histogram, boxplot stratified by exposure, bar chart -----------------
hist(phaa$systolic_bp, main = "Systolic BP", xlab = "mmHg")

boxplot(systolic_bp ~ smoker, data = phaa,
        main = "Systolic BP by smoking status", ylab = "mmHg")

barplot(table(phaa$gender), main = "Gender distribution")

Tip: always plot before you fit a regression. A boxplot of systolic_bp by smoker tells you in a second whether the comparison you are about to do in a later lesson will be driven by a few extreme values or by a real shift in the centre of the distribution.

R Reflect on what you just ran

Use the questions below to interpret the output you produced. Look at your console / plot before answering.

1. From round(prop.table(table(phaa$gender, phaa$smoker), margin = 1) * 100, 1), which gender category has the highest percentage of current smokers? What is the magnitude of the gender gap in smoking?

Model answerFrom the row-proportions table, the male category typically has the highest percentage of current smokers (often 18–25%), compared to ~10–15% in females. The gender gap is around 5–10 percentage points, consistent with national survey findings. Direction matters: this is a row-percentage table (margin=1), so each row sums to 100% within gender; reading across, you see smoking prevalence within each gender.

2. From describe(phaa[, c("age", "bmi", "systolic_bp", "phys_act_min")]), which variable has the largest skew? Which is most clearly approximately normal? How would you justify reporting median (IQR) vs mean (SD) for each in a Table 1?

Model answerdescribe() from psych returns skew values for each variable. Physical-activity minutes typically have the largest skew (right-skewed: most people moderate, a few very high), while age and BMI are roughly normal. For Table 1: report median (IQR) for skewed variables (phys_act_min) because the median is less pulled by extremes; report mean (SD) for approximately normal variables (age, BMI). Systolic BP is moderately skewed and either summary works, though many epidemiology journals default to mean (SD) for clinical readability.

3. Look at the boxplot(systolic_bp ~ smoker). Does the median for smokers appear higher, lower, or similar to non-smokers? Are there any visible outliers, and would you expect them to bias a t-test of the difference in means?

Model answerThe boxplot typically shows the smoker median modestly higher than non-smoker (a 5–10 mmHg gap), with both groups having outliers in the high-BP range. Outliers above the upper whisker can bias a t-test of means toward statistical significance if they cluster in one group; the practical fix is to use a Wilcoxon rank-sum (which is rank-based) as a sensitivity test, or to inspect the outliers for data-entry errors. If outliers are real clinical values (genuine hypertensives), they belong in the analysis; report both means and medians side by side.

Saved.

Descriptive Statistics for Epidemiologic Data: “Table 1”

In epidemiologic publications, “Table 1” typically presents the baseline characteristics of the study population, often stratified by exposure or outcome status. How time-to-event outcomes and group comparisons are displayed alongside Table 1 has been the subject of methodological scrutiny. Pocock, Clayton, & Altman (2002) catalogue common pitfalls in survival plots and baseline reporting. A well-constructed Table 1 includes:

Demographics (age, sex, ethnicity) with appropriate summary statistics
Key clinical or exposure variables
Missing data counts for each variable
Comparison statistics (p-values or standardized differences) between groups

Example: Table 1 Structure

Characteristic	Exposed (n=200)	Unexposed (n=300)	p-value
Age, mean ± SD	52.3 ± 11.4	49.8 ± 12.1	0.02
Female, n (%)	110 (55.0%)	162 (54.0%)	0.82
BMI, median (IQR)	27.1 (24.0–31.5)	25.8 (23.2–29.4)	0.004
Current smoker, n (%)	48 (24.0%)	51 (17.0%)	0.05
Diabetes, n (%)	34 (17.0%)	30 (10.0%)	0.02

Note how continuous variables with normal distributions use mean ± SD, while skewed variables (BMI) use median (IQR). Categorical variables are presented as n (%).

Stratified Descriptive Analyses

Stratifying descriptive statistics by key exposure or covariate groups allows you to assess the distribution of potential confounders across exposure categories. This is a critical step before proceeding to multivariable modelling, as it helps identify imbalances that may require adjustment.

Building Scales: Internal Consistency and Factor Analysis

Many public-health surveys ask several items that together measure a latent construct (e.g., depression, anxiety, social support). Two questions need to be answered before you can use these items as a single scale variable in a regression: (1) internal consistency (do the items hang together?), assessed with Cronbach’s α; and (2) dimensionality (how many underlying factors do the items measure?), assessed with exploratory factor analysis (EFA). Once both checks pass, the items can be summed (or averaged) into a derived scale variable that you carry forward into later lessons.

R Activity: Cronbach’s α and exploratory factor analysis

The course dataset contains seven depression items (dep1 … dep7) and five anxiety items (anx1 … anx5). Below we (1) confirm internal consistency, (2) decide on the number of factors, and (3) build the derived scale scores we use for the rest of the course.

# 1. Internal consistency for each candidate scale ----------------------------
library(psych)
dep_items <- phaa[, c("dep1","dep2","dep3","dep4","dep5","dep6","dep7")]
anx_items <- phaa[, c("anx1","anx2","anx3","anx4","anx5")]

alpha(dep_items)             # raw_alpha >= 0.70 is acceptable, >= 0.80 is good
alpha(anx_items)

# 2. How many underlying factors? ------------------------------------------
fa.parallel(dep_items)        # scree plot suggests one factor

# 3. Inspect a one-factor solution for the depression items ----------------
factanal(x = dep_items, factors = 1, rotation = "varimax")
# Loadings >= ~0.40 contribute meaningfully to the factor.

# 4. Two-factor solution combining dep + anx items -------------------------
combined <- na.omit(cbind(dep_items, anx_items))
factanal(x = combined, factors = 2, rotation = "varimax")
# dep1-7 should load on one factor; anx1-5 on the other.

# 5. Build derived scale variables for use in later lessons ------------------
phaa$dep_score <- rowSums(dep_items, na.rm = TRUE)
phaa$anx_score <- rowSums(anx_items, na.rm = TRUE)
summary(phaa$dep_score)

# 6. Save the cleaned, scale-augmented file --------------------------------
write.csv(phaa, "phaa_survey_clean.csv", row.names = FALSE)

Console output (truncated)

Reliability analysis (depression items) raw_alpha std.alpha average_r 0.89 0.89 0.54 Loadings (1-factor): Factor1 dep1 0.75 dep2 0.70 dep3 0.66 dep4 0.69 dep5 0.61 dep6 0.72 dep7 0.55

How to read this. A α near 0.90 says the seven depression items are measuring essentially the same construct. The single factor explains a large share of the variance and every item loads > 0.50 on it, so summing the items into a dep_score is justified. We will use this score as a continuous predictor in a later lesson and as a covariate in the logistic model for hypertension in a later lesson.

R Reflect on what you just ran

Use the questions below to interpret the output you produced. Look at your console / plot before answering.

1. What is the raw alpha for the depression items? For the anxiety items? Which scale (if either) clears the conventional 0.80 threshold for "good" internal consistency, and which item (look at the "Reliability if an item is dropped" table) contributes least to the depression scale?

Model answerRaw α for the seven depression items is about 0.89 here (good internal consistency), matching the console output above; the anxiety scale is lower, around 0.75–0.80 (borderline). The depression scale clears the 0.80 threshold; the anxiety scale is just below. In the "reliability if an item is dropped" table, the item that contributes least is the one whose alpha-if-dropped value is highest, equivalently the one with the lowest item-total correlation, here the item loading only about 0.55.

2. From fa.parallel(dep_items), how many factors have eigenvalues above the parallel-analysis line? Does the one-factor solution from factanal() show all seven loadings >= 0.40?

Model answerfa.parallel() typically shows one factor above the parallel-analysis line for the depression items, supporting a unidimensional structure. The one-factor solution from factanal() usually has all seven loadings ≥ 0.40, confirming that all items track the same latent construct.

3. In the two-factor combined solution, do the depression items load on a different factor than the anxiety items? Cite one specific loading from your output that supports your answer.

Model answerIn the two-factor combined solution, the depression and anxiety items load on different factors. A specific example: depression item DEP3 ("I feel down or depressed") might load 0.78 on Factor 1 and 0.12 on Factor 2, while anxiety item ANX2 ("I feel restless or worried") loads 0.05 on Factor 1 and 0.81 on Factor 2. The cross-loadings (each item < 0.40 on the other factor) and within-factor loadings (each item > 0.60 on its own factor) together support the interpretation that the two scales measure distinct (but correlated) latent constructs.

Saved.

Normality Assessment

Assessing whether a variable follows a normal distribution informs the choice between parametric and non-parametric methods.

Visual Methods

Q-Q (quantile-quantile) plots compare observed quantiles to theoretical normal quantiles. Points falling along the diagonal line suggest normality; systematic departures indicate non-normality. Histograms with a normal curve overlay also provide quick visual assessment.

Formal Tests

The Shapiro-Wilk test is generally preferred for samples under 5,000. The Kolmogorov-Smirnov test (with Lilliefors correction) is an alternative. Both test the null hypothesis that data come from a normal distribution.

Caveat: With very large samples, these tests will reject normality even for trivial departures. Visual assessment should always accompany formal tests.

Implications for Analysis Choice

If a variable is approximately normal, parametric tests (t-tests, ANOVA, Pearson correlation) are appropriate. For non-normal distributions, consider non-parametric alternatives (Wilcoxon rank-sum, Kruskal-Wallis, Spearman correlation) or data transformations. The Central Limit Theorem provides some protection for large samples, but descriptive summaries should still reflect the actual distribution.

Reflection

Consider how you would construct a “Table 1” for an epidemiologic study examining the relationship between physical activity and cardiovascular disease. What variables would you include, how would you summarize each, and what stratification would you use?

Model answerTable 1 for an epidemiologic PA-CVD cohort: rows = baseline characteristics, columns stratified by exposure group (e.g., low/moderate/high physical activity). Variables: demographic (age, sex, ethnicity, education, income), clinical (BMI, systolic/diastolic BP, total cholesterol, HbA1c, smoking, alcohol, family history of CVD), and exposure-related (specific PA measures: MET-hours/week, vigorous-PA minutes, sedentary time). Summaries: mean (SD) for approximately normal continuous variables, median (IQR) for skewed, n (%) for categorical. Tests in the rightmost column: ANOVA for normally distributed, Kruskal-Wallis for skewed, χ² for categorical, but be cautious: p-values in Table 1 are often misinterpreted as evidence of confounding; better practice is to report standardised mean differences for assessing imbalance after matching/weighting. Stratification by age and sex is the standard supplementary view.

Reflection saved.

● Complete the reflection and knowledge check to continue.

Transformation	Formula	When to use
Log (natural)	\(\ln(x)\) or \(\ln(x+1)\)	Right-skewed; multiplicative relationships
Square root	\(\sqrt{x}\)	Count data; moderate skew
Reciprocal	\(1/x\)	Strongly skewed rates and times

HSCI 410 · Lesson 2

Exploratory Data Analysis For Epidemiology

Data Cleaning &Descriptive Analyses

Learning objectives for this lesson:

Glossary: Key Terms, People & Concepts

Data Quality & Types of Errors

Data Cleaning & Descriptive Analyses

Code blocks: logic before syntax

Data Quality & Types of Errors

Where errors enter

Five things that go wrong

Transcription

Coding

Logical inconsistency

Out-of-range

Duplicates

Three mechanisms (Rubin, 1976)

MCAR

MAR

MNAR

Codebooks and variable types

Why a data dictionary?

Three things to take into the next section

Introduction and Overview

Learning Objectives

The Data Pipeline

Key Principle

Types of Data Errors

Missing Data

Important Warning

Data Dictionaries and Codebooks

Variable Types in Epidemiologic Data

Data Cleaning Strategies

Data Cleaning Strategies

Three types of systematic checks

Range checks

Consistency checks

Cross-validation

Three ways to find them; one rule for all of them

Methods ranked by how well they handle uncertainty

Listwise deletion

Pairwise deletion

Single imputation

Multiple imputation

Addressing skew without discarding data

New variables from old; order from messy text

Recoding

String cleaning

The audit trail is non-negotiable

Introduction and Overview

Learning Objectives

Systematic Verification

Identifying Outliers

Practical Tip

Handling Missing Data

Data Transformations

Remember

Recoding and Derived Variables

String Cleaning and Standardization

Documenting the Cleaning Process

R Reflect on what you just ran

Reflection

Descriptive Analyses

Descriptive Analyses

Three measures; one rule for choosing

Mean

Median

How far and how symmetric

Cross-tabulations and the odds ratio

Match the plot to the question

Histograms

Boxplots

Bar charts

Scatter / line

From questionnaire items to a score

Internal consistency

Dimensionality

The publication-ready summary of your cohort

Is it close enough to normal?

Visual methods

Data Cleaning &
Descriptive Analyses