A Structured Approach to Data Analysis
Exploratory Data Analysis For Epidemiology
Kiffer G. Card, PhD, Faculty of Health Sciences, Simon Fraser University
Learning objectives for this lesson:
- Construct a causal diagram before beginning data analysis
- Establish a system for managing data-collection sheets, files, and variables
- Apply best practices for data coding, entry, and verification
- Process outcome and predictor variables appropriately for analysis
- Evaluate unconditional associations between variables
- Set up a systematic approach for keeping track of analyses
This course was developed by Kiffer G. Card, PhD, as a companion to Dohoo, I. R., Martin, S. W., & Stryhn, H. (2012). Methods in Epidemiologic Research. VER Inc.
Introduction & Data Collection
Why a Structured Approach?
When starting the analysis of a complex dataset, it is very helpful to have a structured approach in mind. For most people, there is a strong tendency to jump straight into the sophisticated analysis that will provide the ultimate answer. This rarely works out—the results will inevitably be wrong because important preliminary steps were skipped.
Data analysis is an iterative process which often requires that you back up several steps as you gain more insight into your data. A structured template, while not the only approach, will be applicable in most situations and will serve to guide your initial efforts.
Start with a Causal Diagram
Before you start any work with your data, it is essential to construct a plausible causal diagram of the problem you are about to investigate. This will help identify:
- Which variables are important outcomes and predictors
- Which are potential confounders
- Which might be intervening variables between your main predictors and outcomes
Keep this causal diagram in mind throughout the entire data-analysis process. With large datasets, it will not be possible to include all predictors as separate entities. This can be handled by including blocks of variables (e.g., demographic characteristics) in the diagram instead of listing each variable.
Managing Data-Collection Sheets
It is important to establish a permanent storage system for all original data-collection sheets (survey forms, data-collection forms, etc.) that makes it easy to retrieve individual sheets if they are needed during the analysis.
Do not remove originals from your file. If you need a specific sheet for use at another location, make a photocopy. Never ship the original to another location without first making copies of all forms.
Set up a system for recording the insertion of data-collection sheets into the file so that you know how many remain to be collected before further work begins.
Once all forms have been collected, scan through all sheets for their completeness before doing anything else. If there are omissions, returning to the data source to complete the data will be more likely to succeed if done soon after collection rather than weeks or months later.
Section 1 Knowledge Check
1. What should you construct before beginning any work with your data?
2. Why is data analysis described as an “iterative process”?
3. What should you do if you find omissions in data-collection sheets?
Reflection
Think of a research question you are interested in. Sketch out (describe) a causal diagram showing the key outcome, main predictors, potential confounders, and any intervening variables. How does this diagram help you plan your analysis?
Data Coding, Entry & File Management
Data Coding
Before entering data into a computer, careful coding is essential. Good coding practices prevent errors that can cascade throughout an entire analysis.
Data Entry
Some important issues to consider when entering your data into a computer file:
Double-data entry, followed by comparison of the 2 files to detect any inconsistencies, is preferable to single-data entry. This dramatically reduces the error rate in your dataset.
Spreadsheets are a convenient tool for initial data entry, but they must be used with extreme caution. It is possible to sort individual columns, which could destroy your entire dataset with one inappropriate “sort” command. Custom data-entry software provides a greater margin of safety.
As soon as the data-entry process has been completed, save the original data files in a safe location. In large, expensive trials, keep a copy of all originals stored in another location. Convert your data to the format your statistical software uses as soon as possible.
Keeping Track of Files
It is important to have a system for keeping track of all your files. Key recommendations:
- Assign a logical name with a 2-digit numerical suffix (e.g., brazil01). A 2-digit suffix allows you to have 99 versions that still sort correctly when listed alphabetically.
- When data manipulations are carried out, save the file with a new name (the next available number). Do not change data and then overwrite the file.
- Keep a simple log of files created with information about the contents (e.g., number of observations and variables).
bp01.odc (27/09/07) — Original blood pressure study data; spreadsheet; 1 record per measurement. 1092 obs, 8 vars.
bp01.dta (28/09/07) — Original file; Stata format. 1092 obs, 8 vars.
bp02.dta (30/09/07) — 45 records with missing values dropped. 1047 obs, 8 vars.
Keeping Track of Variables
Even a relatively focused study can give rise to a large number of variables once transformed and recoded variables have been created. Recommendations include:
Use short but informative names and have all related variables start with the same name. Long names can be shortened by removing vowels (e.g., wtr_cstrn for “water cistern”). If your statistics program is case sensitive, use ONLY lower-case letters. At some point, prepare a master list of all variables.
| Variable | Description |
|---|---|
age | Original data (in years) |
age_ct | Age after centring by subtraction of the mean |
age_ctsq | Quadratic term (age_ct squared) |
age_c2 | Age categorised into 2 categories (young vs old) |
age_c3 | Age categorised into 3 categories |
Section 2 Knowledge Check
1. Why should you never use compound codes?
2. What is the advantage of double-data entry?
3. When data manipulations are carried out, what should you do with the file?
Reflection
Describe a file-naming and version-control system you would use for a dataset in your own research area. How would you organise the variable names for a study with demographic, clinical, and outcome variables?
Program Files, Data Editing & Verification
Program Mode vs. Interactive Processing
Statistical programs can be used in an interactive mode (selecting items from menus or typing in a command) or in program mode (compiling a series of commands into a program and then running it).
Interactive mode is very useful for exploring your data and trying out analyses. However, it should not be used for any of the “real” processing and/or analysis because it is very difficult to keep a clear record of steps taken. Consequently, it is difficult or impossible to reconstruct the analyses you have completed.
Program mode is the recommended approach. You compile the commands into a program and then run it. These program files can be saved and used to reconstruct any analyses you have carried out. Key tips: name files logically, structure the program to be easy to follow, use sequential indents, and document the file thoroughly with comments.
Do all of the analyses in your statistical program. Don’t start doing basic statistics in a spreadsheet. You are going to need the statistical program eventually, and it will be much easier to keep track of all your analyses if they are all done there.
Data Editing
Before beginning any analyses, spend time editing your data. The most important components are:
Data Verification
Before you start any analyses, you must verify that your data are correct. This can be combined with data processing and involves going through all of your variables, one-by-one.
- Determine the number of valid observations and the number of missing values
- Check the maximum and minimum values (or the 5 smallest and 5 largest) to make sure they are reasonable; if they are not, find the error, correct it, and repeat the process
- Prepare a histogram of the data to get an idea of the distribution and see if it looks reasonable
- Determine the number of valid observations and the number of missing values
- Obtain a frequency distribution to see if the counts in each category look reasonable (and to make sure there are no unexpected categories)
Section 3 Knowledge Check
1. Why should program mode be preferred over interactive mode for “real” data analysis?
2. When verifying continuous variables, what should you examine first?
3. What is the purpose of attaching labels to categorical variable values?
Reflection
Imagine you receive a dataset where a colleague entered data interactively in a spreadsheet with no documentation. What steps would you take to clean, verify, and prepare the data for analysis? What problems might you encounter?
Data Processing & Unconditional Associations
Processing the Outcome Variable(s)
While verifying data, you can also start processing your outcome variable(s). Review the stated goals of the study to determine the format(s) which best suits the goal(s). Consider the following based on outcome type:
Is the distribution of outcomes across categories acceptable? For example, if you planned a multinomial regression with a 3-category outcome, but very few observations fall in one of the categories, you might want to recode it to a 2-category variable.
Does the variable have the characteristics necessary for the planned analysis? If linear regression is planned, is the distribution approximately normal? If not, explore transformations. Note: It is the normality of the residuals which is ultimately important, but if the original variable is far from normal and there are no strong predictors, the residuals are unlikely to be normal.
If Poisson regression is planned, are the mean and variance of the distribution approximately equal? If not, consider negative binomial regression or alternative analytic approaches.
What proportion of the observations are censored? You might also want to generate a simple graph of the empirical hazard function to get an idea what shape it has.
Processing Predictor Variables
It is important to go through all predictor variables to determine how they will be handled:
- Missing values: Are there many? If so, you might need to abandon plans to use that predictor, or conduct 2 analyses (one on the subset where the predictor is present and one on the full dataset ignoring the predictor).
- Distribution: For continuous variables, is there a reasonable representation over the whole range of values? If not, it might be necessary to categorise the variable.
- Categorical variables: Are all categories reasonably well represented? If not, you might have to combine categories.
Multilevel Data
If your data are multilevel (e.g., blood pressure measurements within individuals within centres), evaluate the hierarchical structure:
What is the average (and range) number of observations at one level in each higher-level unit? Are individuals uniquely identified within a hierarchical level? It is often useful to create one unique identifier for each observation in the dataset.
Unconditional Associations
Before proceeding with any multivariable analyses, it is important to evaluate unconditional associations within the data. These serve as the foundation for building more complex models.
| Variable Types | Analytical Approach |
|---|---|
| Two continuous variables | Correlation coefficient, scatterplot, simple linear regression |
| One continuous + one categorical | One-way ANOVA, simple linear or logistic regression |
| Two categorical variables | Cross-tabulation and χ² test |
When evaluating unconditional associations, pay attention to:
- Associations between predictors and outcome: Determine if there is any association at all; determine the functional form (is it linear?); get a simple picture of the strength and direction.
- Associations between pairs of predictors: Look for potential collinearity problems (highly correlated predictors).
- Confounding variables: Evaluate associations between the confounding variables and the key predictors of interest and the outcome.
Keeping Track of Your Analyses
Before starting the more substantial analysis, set up a system for keeping track of your results:
Section 4 Knowledge Check
1. Why should you evaluate unconditional associations before multivariable analyses?
2. If a continuous outcome variable is far from normally distributed, what should you do?
3. What is the appropriate analytical approach for evaluating the association between two categorical variables?
Reflection
Consider a dataset with 15 predictor variables and one continuous outcome. Describe the sequence of unconditional analyses you would carry out before fitting any multivariable models. How would you handle a predictor that has 30% missing values?
A Structured Approach to Data Analysis — Final Assessment
Final Reflection
Reflecting on the entire lesson, what do you consider the most important step in the structured approach to data analysis? How would you apply this structured approach to a dataset you are currently working with or plan to work with in the future?
Final Assessment
You must complete the final reflection above before submitting. Answer all 15 questions. 100% is required to pass.
1. What is the first step recommended before beginning any data analysis?
2. Why should you avoid starting analyses in a spreadsheet?
3. What coding value should NOT be assigned to missing data?
4. Why is a 2-digit numerical suffix recommended for file names?
5. What is the danger of using the “sort” command in a spreadsheet for data entry?
6. What is the primary purpose of evaluating unconditional associations between pairs of predictors?
7. When processing a categorical outcome with 3 categories, when might you recode it to 2 categories?
8. What approach should be used to document what a program file does?
9. For verifying a continuous variable, what visual tool is recommended?
10. What is the appropriate analysis for the association between one continuous and one categorical variable?
11. If a predictor variable has many missing values, what options are available?
12. What should you do with log files from your analyses?
13. Why is interactive mode still useful despite its limitations?
14. When evaluating confounding variables, what should you specifically look for?
15. What does the chapter suggest you should do if a count/rate outcome’s mean and variance are not approximately equal?