A Structured Approach to Data Analysis

Exploratory Data Analysis For Epidemiology

Kiffer G. Card, PhD, Faculty of Health Sciences, Simon Fraser University

Learning objectives for this lesson:

Construct a causal diagram before beginning data analysis
Establish a system for managing data-collection sheets, files, and variables
Apply best practices for data coding, entry, and verification
Process outcome and predictor variables appropriately for analysis
Evaluate unconditional associations between variables
Set up a systematic approach for keeping track of analyses

This course was developed by Kiffer G. Card, PhD, as a companion to Dohoo, I. R., Martin, S. W., & Stryhn, H. (2012). Methods in Epidemiologic Research. VER Inc.

Section 1

Introduction & Data Collection

⏱ Estimated time: 15 minutes

Why a Structured Approach?

When starting the analysis of a complex dataset, it is very helpful to have a structured approach in mind. For most people, there is a strong tendency to jump straight into the sophisticated analysis that will provide the ultimate answer. This rarely works out—the results will inevitably be wrong because important preliminary steps were skipped.

Key Principle

Data analysis is an iterative process which often requires that you back up several steps as you gain more insight into your data. A structured template, while not the only approach, will be applicable in most situations and will serve to guide your initial efforts.

Start with a Causal Diagram

Before you start any work with your data, it is essential to construct a plausible causal diagram of the problem you are about to investigate. This will help identify:

Which variables are important outcomes and predictors
Which are potential confounders
Which might be intervening variables between your main predictors and outcomes

Practical Tip

Keep this causal diagram in mind throughout the entire data-analysis process. With large datasets, it will not be possible to include all predictors as separate entities. This can be handled by including blocks of variables (e.g., demographic characteristics) in the diagram instead of listing each variable.

Managing Data-Collection Sheets

It is important to establish a permanent storage system for all original data-collection sheets (survey forms, data-collection forms, etc.) that makes it easy to retrieve individual sheets if they are needed during the analysis.

Protect your originals

Do not remove originals from your file. If you need a specific sheet for use at another location, make a photocopy. Never ship the original to another location without first making copies of all forms.

Track collection progress

Set up a system for recording the insertion of data-collection sheets into the file so that you know how many remain to be collected before further work begins.

Scan for completeness

Once all forms have been collected, scan through all sheets for their completeness before doing anything else. If there are omissions, returning to the data source to complete the data will be more likely to succeed if done soon after collection rather than weeks or months later.

Section 1 Knowledge Check

1. What should you construct before beginning any work with your data?

A frequency distribution of all variables A regression model of the primary outcome A plausible causal diagram of the problem under investigation

Before working with your data, you should construct a plausible causal diagram. This identifies which variables are important outcomes and predictors, which are potential confounders, and which might be intervening variables.

2. Why is data analysis described as an “iterative process”?

Because it only needs to be done once Because it often requires backing up several steps as you gain more insight into your data Because statistical software requires multiple iterations to converge

Data analysis is iterative because as you gain more insight into your data, you often need to revisit earlier steps, revise your approach, and re-examine your variables and models.

3. What should you do if you find omissions in data-collection sheets?

Return to the data source to complete the data as soon as possible Delete those records from the dataset Impute the values using statistical methods

Returning to the data source to complete missing data will more likely be successful if done soon after the data were initially collected, rather than weeks or months later when the analysis has begun.

Reflection

Think of a research question you are interested in. Sketch out (describe) a causal diagram showing the key outcome, main predictors, potential confounders, and any intervening variables. How does this diagram help you plan your analysis?

Reflection saved!

* Complete the quiz and reflection to continue.

Section 2

Data Coding, Entry & File Management

⏱ Estimated time: 20 minutes

Data Coding

Before entering data into a computer, careful coding is essential. Good coding practices prevent errors that can cascade throughout an entire analysis.

❓

Missing Values

Click to learn more

🔢

Numeric Codes

Click to learn more

⚠

No Compound Codes

Click to learn more

Data Entry

Some important issues to consider when entering your data into a computer file:

Double-data entry

Double-data entry, followed by comparison of the 2 files to detect any inconsistencies, is preferable to single-data entry. This dramatically reduces the error rate in your dataset.

Caution with spreadsheets

Spreadsheets are a convenient tool for initial data entry, but they must be used with extreme caution. It is possible to sort individual columns, which could destroy your entire dataset with one inappropriate “sort” command. Custom data-entry software provides a greater margin of safety.

Save and back up immediately

As soon as the data-entry process has been completed, save the original data files in a safe location. In large, expensive trials, keep a copy of all originals stored in another location. Convert your data to the format your statistical software uses as soon as possible.

Keeping Track of Files

It is important to have a system for keeping track of all your files. Key recommendations:

Assign a logical name with a 2-digit numerical suffix (e.g., brazil01). A 2-digit suffix allows you to have 99 versions that still sort correctly when listed alphabetically.
When data manipulations are carried out, save the file with a new name (the next available number). Do not change data and then overwrite the file.
Keep a simple log of files created with information about the contents (e.g., number of observations and variables).

Example: File Log for a Blood Pressure Study

bp01.odc (27/09/07) — Original blood pressure study data; spreadsheet; 1 record per measurement. 1092 obs, 8 vars.

bp01.dta (28/09/07) — Original file; Stata format. 1092 obs, 8 vars.

bp02.dta (30/09/07) — 45 records with missing values dropped. 1047 obs, 8 vars.

Keeping Track of Variables

Even a relatively focused study can give rise to a large number of variables once transformed and recoded variables have been created. Recommendations include:

Use short but informative names and have all related variables start with the same name. Long names can be shortened by removing vowels (e.g., wtr_cstrn for “water cistern”). If your statistics program is case sensitive, use ONLY lower-case letters. At some point, prepare a master list of all variables.

Variable	Description
`age`	Original data (in years)
`age_ct`	Age after centring by subtraction of the mean
`age_ctsq`	Quadratic term (age_ct squared)
`age_c2`	Age categorised into 2 categories (young vs old)
`age_c3`	Age categorised into 3 categories

Section 2 Knowledge Check

1. Why should you never use compound codes?

Because compound codes take up more storage space Because compound codes are difficult to type Because each variable should only code one piece of information; combining creates analysis problems

Only code one piece of information in a single variable. Compound codes (e.g., 1=male Caucasian, 2=female Caucasian) make it extremely difficult to separate and analyse each characteristic independently.

2. What is the advantage of double-data entry?

It allows comparison of two files to detect inconsistencies and reduce errors It doubles the sample size It allows the data to be entered twice as fast

Double-data entry, followed by comparison of the 2 files to detect any inconsistencies, is preferable to single-data entry because it dramatically reduces the entry error rate.

3. When data manipulations are carried out, what should you do with the file?

Overwrite the original file to save space Save it with a new name (the next available number) and do not overwrite the original Delete the original and keep only the modified version

Save the file with a new name (the next available number in your naming convention) so you always have a record of all versions and can trace back to the original data if needed.

Reflection

Describe a file-naming and version-control system you would use for a dataset in your own research area. How would you organise the variable names for a study with demographic, clinical, and outcome variables?

Reflection saved!

* Complete the quiz and reflection to continue.

Section 3

Program Files, Data Editing & Verification

⏱ Estimated time: 15 minutes

Program Mode vs. Interactive Processing

Statistical programs can be used in an interactive mode (selecting items from menus or typing in a command) or in program mode (compiling a series of commands into a program and then running it).

Interactive mode is very useful for exploring your data and trying out analyses. However, it should not be used for any of the “real” processing and/or analysis because it is very difficult to keep a clear record of steps taken. Consequently, it is difficult or impossible to reconstruct the analyses you have completed.

Program mode is the recommended approach. You compile the commands into a program and then run it. These program files can be saved and used to reconstruct any analyses you have carried out. Key tips: name files logically, structure the program to be easy to follow, use sequential indents, and document the file thoroughly with comments.

Critical Rule

Do all of the analyses in your statistical program. Don’t start doing basic statistics in a spreadsheet. You are going to need the statistical program eventually, and it will be much easier to keep track of all your analyses if they are all done there.

Data Editing

Before beginning any analyses, spend time editing your data. The most important components are:

🏷

Labelling Variables

Click to learn more

📄

Labelling Categories

Click to learn more

🔄

Missing Value Codes

Click to learn more

Data Verification

Before you start any analyses, you must verify that your data are correct. This can be combined with data processing and involves going through all of your variables, one-by-one.

For continuous variables

Determine the number of valid observations and the number of missing values
Check the maximum and minimum values (or the 5 smallest and 5 largest) to make sure they are reasonable; if they are not, find the error, correct it, and repeat the process
Prepare a histogram of the data to get an idea of the distribution and see if it looks reasonable

For categorical variables

Determine the number of valid observations and the number of missing values
Obtain a frequency distribution to see if the counts in each category look reasonable (and to make sure there are no unexpected categories)

Section 3 Knowledge Check

1. Why should program mode be preferred over interactive mode for “real” data analysis?

Because interactive mode is slower Because program mode creates a record of all steps, making analyses reproducible Because interactive mode cannot perform complex analyses

Program mode compiles commands into a program file that can be saved and reused, making it possible to reconstruct and reproduce all analyses. Interactive mode makes it difficult to keep a clear record of steps taken.

2. When verifying continuous variables, what should you examine first?

The number of valid observations, missing values, and the minimum/maximum values The regression coefficients The correlation with the outcome variable

For continuous variables, the first verification step is to determine the number of valid observations and missing values, then check the minimum and maximum values (or the 5 smallest and 5 largest) to make sure they are reasonable.

3. What is the purpose of attaching labels to categorical variable values?

To increase the file size for better storage To allow the data to be used in a different statistical program To provide meaningful descriptions of what each numeric code represents

Categorical variables should have meaningful labels attached to each category (e.g., sex coded as 0 or 1 should have labels “male” and “female” attached) so that output is immediately interpretable.

Reflection

Imagine you receive a dataset where a colleague entered data interactively in a spreadsheet with no documentation. What steps would you take to clean, verify, and prepare the data for analysis? What problems might you encounter?

Reflection saved!

* Complete the quiz and reflection to continue.

Section 4

Data Processing & Unconditional Associations

⏱ Estimated time: 20 minutes

Processing the Outcome Variable(s)

While verifying data, you can also start processing your outcome variable(s). Review the stated goals of the study to determine the format(s) which best suits the goal(s). Consider the following based on outcome type:

Categorical outcomes

Is the distribution of outcomes across categories acceptable? For example, if you planned a multinomial regression with a 3-category outcome, but very few observations fall in one of the categories, you might want to recode it to a 2-category variable.

Continuous outcomes

Does the variable have the characteristics necessary for the planned analysis? If linear regression is planned, is the distribution approximately normal? If not, explore transformations. Note: It is the normality of the residuals which is ultimately important, but if the original variable is far from normal and there are no strong predictors, the residuals are unlikely to be normal.

Count / rate outcomes

If Poisson regression is planned, are the mean and variance of the distribution approximately equal? If not, consider negative binomial regression or alternative analytic approaches.

Time-to-event outcomes

What proportion of the observations are censored? You might also want to generate a simple graph of the empirical hazard function to get an idea what shape it has.

Processing Predictor Variables

It is important to go through all predictor variables to determine how they will be handled:

Missing values: Are there many? If so, you might need to abandon plans to use that predictor, or conduct 2 analyses (one on the subset where the predictor is present and one on the full dataset ignoring the predictor).
Distribution: For continuous variables, is there a reasonable representation over the whole range of values? If not, it might be necessary to categorise the variable.
Categorical variables: Are all categories reasonably well represented? If not, you might have to combine categories.

Multilevel Data

If your data are multilevel (e.g., blood pressure measurements within individuals within centres), evaluate the hierarchical structure:

Key Questions for Multilevel Data

What is the average (and range) number of observations at one level in each higher-level unit? Are individuals uniquely identified within a hierarchical level? It is often useful to create one unique identifier for each observation in the dataset.

Unconditional Associations

Before proceeding with any multivariable analyses, it is important to evaluate unconditional associations within the data. These serve as the foundation for building more complex models.

Variable Types	Analytical Approach
Two continuous variables	Correlation coefficient, scatterplot, simple linear regression
One continuous + one categorical	One-way ANOVA, simple linear or logistic regression
Two categorical variables	Cross-tabulation and χ² test

When evaluating unconditional associations, pay attention to:

Associations between predictors and outcome: Determine if there is any association at all; determine the functional form (is it linear?); get a simple picture of the strength and direction.
Associations between pairs of predictors: Look for potential collinearity problems (highly correlated predictors).
Confounding variables: Evaluate associations between the confounding variables and the key predictors of interest and the outcome.

Keeping Track of Your Analyses

Before starting the more substantial analysis, set up a system for keeping track of your results:

📦

Analyse in Blocks

Click to learn more

📝

Keep Log Files

Click to learn more

📅

Label & Date

Click to learn more

Section 4 Knowledge Check

1. Why should you evaluate unconditional associations before multivariable analyses?

Because multivariable analysis is not necessary if unconditional associations exist Because they reveal the strength and direction of associations and identify potential collinearity and confounding Because unconditional associations are the final step in data analysis

Evaluating unconditional associations before multivariable analyses helps you understand the basic relationships in your data, identify collinearity, detect confounding, and determine the functional form of relationships—all of which inform the complex models you will subsequently build.

2. If a continuous outcome variable is far from normally distributed, what should you do?

Explore transformations that might normalise the distribution Remove the variable from the analysis Always convert it to a categorical variable

If the continuous outcome is not approximately normally distributed, you should explore transformations which might normalise the distribution. It is ultimately the normality of the residuals that is important, but a far-from-normal variable with no strong predictors will produce non-normal residuals.

3. What is the appropriate analytical approach for evaluating the association between two categorical variables?

Correlation coefficient and scatterplot One-way ANOVA Cross-tabulation and chi-squared test

For associations between two categorical variables, cross-tabulation and the chi-squared (χ²) test are the appropriate analytical approaches. These are particularly useful for identifying unexpected observations.

Reflection

Consider a dataset with 15 predictor variables and one continuous outcome. Describe the sequence of unconditional analyses you would carry out before fitting any multivariable models. How would you handle a predictor that has 30% missing values?

Reflection saved!

* Complete the quiz and reflection to continue.

HSCI 410 — Lesson 1

Exploratory Data Analysis For Epidemiology

A Structured Approach to Data Analysis

Learning objectives for this lesson:

Introduction & Data Collection

Why a Structured Approach?

Start with a Causal Diagram

Managing Data-Collection Sheets

Section 1 Knowledge Check

Reflection

Data Coding, Entry & File Management

Data Coding

Data Entry

Keeping Track of Files

Keeping Track of Variables

Section 2 Knowledge Check

Reflection

Program Files, Data Editing & Verification

Program Mode vs. Interactive Processing

Data Editing

Data Verification

Section 3 Knowledge Check

Reflection

Data Processing & Unconditional Associations

Processing the Outcome Variable(s)

Processing Predictor Variables

Multilevel Data

Unconditional Associations

Keeping Track of Your Analyses

Section 4 Knowledge Check

Reflection

A Structured Approach to Data Analysis — Final Assessment

Final Reflection

Final Assessment

Lesson 1 Complete!