HSCI 410 — Lesson 1

A Structured Approach to Data Analysis

Exploratory Data Analysis For Epidemiology

Kiffer G. Card, PhD, Faculty of Health Sciences, Simon Fraser University

Learning objectives for this lesson:

  • Construct a causal diagram before beginning data analysis
  • Establish a system for managing data-collection sheets, files, and variables
  • Apply best practices for data coding, entry, and verification
  • Process outcome and predictor variables appropriately for analysis
  • Evaluate unconditional associations between variables
  • Set up a systematic approach for keeping track of analyses

This course was developed by Kiffer G. Card, PhD, as a companion to Dohoo, I. R., Martin, S. W., & Stryhn, H. (2012). Methods in Epidemiologic Research. VER Inc.

Section 1

Introduction & Data Collection

⏱ Estimated time: 15 minutes

Why a Structured Approach?

When starting the analysis of a complex dataset, it is very helpful to have a structured approach in mind. For most people, there is a strong tendency to jump straight into the sophisticated analysis that will provide the ultimate answer. This rarely works out—the results will inevitably be wrong because important preliminary steps were skipped.

Key Principle

Data analysis is an iterative process which often requires that you back up several steps as you gain more insight into your data. A structured template, while not the only approach, will be applicable in most situations and will serve to guide your initial efforts.

Start with a Causal Diagram

Before you start any work with your data, it is essential to construct a plausible causal diagram of the problem you are about to investigate. This will help identify:

  • Which variables are important outcomes and predictors
  • Which are potential confounders
  • Which might be intervening variables between your main predictors and outcomes
Practical Tip

Keep this causal diagram in mind throughout the entire data-analysis process. With large datasets, it will not be possible to include all predictors as separate entities. This can be handled by including blocks of variables (e.g., demographic characteristics) in the diagram instead of listing each variable.

Managing Data-Collection Sheets

It is important to establish a permanent storage system for all original data-collection sheets (survey forms, data-collection forms, etc.) that makes it easy to retrieve individual sheets if they are needed during the analysis.

Protect your originals

Do not remove originals from your file. If you need a specific sheet for use at another location, make a photocopy. Never ship the original to another location without first making copies of all forms.

Track collection progress

Set up a system for recording the insertion of data-collection sheets into the file so that you know how many remain to be collected before further work begins.

Scan for completeness

Once all forms have been collected, scan through all sheets for their completeness before doing anything else. If there are omissions, returning to the data source to complete the data will be more likely to succeed if done soon after collection rather than weeks or months later.

Section 1 Knowledge Check

1. What should you construct before beginning any work with your data?

Before working with your data, you should construct a plausible causal diagram. This identifies which variables are important outcomes and predictors, which are potential confounders, and which might be intervening variables.

2. Why is data analysis described as an “iterative process”?

Data analysis is iterative because as you gain more insight into your data, you often need to revisit earlier steps, revise your approach, and re-examine your variables and models.

3. What should you do if you find omissions in data-collection sheets?

Returning to the data source to complete missing data will more likely be successful if done soon after the data were initially collected, rather than weeks or months later when the analysis has begun.

Reflection

Think of a research question you are interested in. Sketch out (describe) a causal diagram showing the key outcome, main predictors, potential confounders, and any intervening variables. How does this diagram help you plan your analysis?

Reflection saved!
* Complete the quiz and reflection to continue.
Section 2

Data Coding, Entry & File Management

⏱ Estimated time: 20 minutes

Data Coding

Before entering data into a computer, careful coding is essential. Good coding practices prevent errors that can cascade throughout an entire analysis.

Missing Values
Click to learn more
🔢
Numeric Codes
Click to learn more
No Compound Codes
Click to learn more

Data Entry

Some important issues to consider when entering your data into a computer file:

Double-data entry

Double-data entry, followed by comparison of the 2 files to detect any inconsistencies, is preferable to single-data entry. This dramatically reduces the error rate in your dataset.

Caution with spreadsheets

Spreadsheets are a convenient tool for initial data entry, but they must be used with extreme caution. It is possible to sort individual columns, which could destroy your entire dataset with one inappropriate “sort” command. Custom data-entry software provides a greater margin of safety.

Save and back up immediately

As soon as the data-entry process has been completed, save the original data files in a safe location. In large, expensive trials, keep a copy of all originals stored in another location. Convert your data to the format your statistical software uses as soon as possible.

Keeping Track of Files

It is important to have a system for keeping track of all your files. Key recommendations:

  • Assign a logical name with a 2-digit numerical suffix (e.g., brazil01). A 2-digit suffix allows you to have 99 versions that still sort correctly when listed alphabetically.
  • When data manipulations are carried out, save the file with a new name (the next available number). Do not change data and then overwrite the file.
  • Keep a simple log of files created with information about the contents (e.g., number of observations and variables).
Example: File Log for a Blood Pressure Study

bp01.odc (27/09/07) — Original blood pressure study data; spreadsheet; 1 record per measurement. 1092 obs, 8 vars.

bp01.dta (28/09/07) — Original file; Stata format. 1092 obs, 8 vars.

bp02.dta (30/09/07) — 45 records with missing values dropped. 1047 obs, 8 vars.

Keeping Track of Variables

Even a relatively focused study can give rise to a large number of variables once transformed and recoded variables have been created. Recommendations include:

Use short but informative names and have all related variables start with the same name. Long names can be shortened by removing vowels (e.g., wtr_cstrn for “water cistern”). If your statistics program is case sensitive, use ONLY lower-case letters. At some point, prepare a master list of all variables.

VariableDescription
ageOriginal data (in years)
age_ctAge after centring by subtraction of the mean
age_ctsqQuadratic term (age_ct squared)
age_c2Age categorised into 2 categories (young vs old)
age_c3Age categorised into 3 categories

Section 2 Knowledge Check

1. Why should you never use compound codes?

Only code one piece of information in a single variable. Compound codes (e.g., 1=male Caucasian, 2=female Caucasian) make it extremely difficult to separate and analyse each characteristic independently.

2. What is the advantage of double-data entry?

Double-data entry, followed by comparison of the 2 files to detect any inconsistencies, is preferable to single-data entry because it dramatically reduces the entry error rate.

3. When data manipulations are carried out, what should you do with the file?

Save the file with a new name (the next available number in your naming convention) so you always have a record of all versions and can trace back to the original data if needed.

Reflection

Describe a file-naming and version-control system you would use for a dataset in your own research area. How would you organise the variable names for a study with demographic, clinical, and outcome variables?

Reflection saved!
* Complete the quiz and reflection to continue.
Section 3

Program Files, Data Editing & Verification

⏱ Estimated time: 15 minutes

Program Mode vs. Interactive Processing

Statistical programs can be used in an interactive mode (selecting items from menus or typing in a command) or in program mode (compiling a series of commands into a program and then running it).

Interactive mode is very useful for exploring your data and trying out analyses. However, it should not be used for any of the “real” processing and/or analysis because it is very difficult to keep a clear record of steps taken. Consequently, it is difficult or impossible to reconstruct the analyses you have completed.

Program mode is the recommended approach. You compile the commands into a program and then run it. These program files can be saved and used to reconstruct any analyses you have carried out. Key tips: name files logically, structure the program to be easy to follow, use sequential indents, and document the file thoroughly with comments.

Critical Rule

Do all of the analyses in your statistical program. Don’t start doing basic statistics in a spreadsheet. You are going to need the statistical program eventually, and it will be much easier to keep track of all your analyses if they are all done there.

Data Editing

Before beginning any analyses, spend time editing your data. The most important components are:

🏷
Labelling Variables
Click to learn more
📄
Labelling Categories
Click to learn more
🔄
Missing Value Codes
Click to learn more

Data Verification

Before you start any analyses, you must verify that your data are correct. This can be combined with data processing and involves going through all of your variables, one-by-one.

For continuous variables
  • Determine the number of valid observations and the number of missing values
  • Check the maximum and minimum values (or the 5 smallest and 5 largest) to make sure they are reasonable; if they are not, find the error, correct it, and repeat the process
  • Prepare a histogram of the data to get an idea of the distribution and see if it looks reasonable
For categorical variables
  • Determine the number of valid observations and the number of missing values
  • Obtain a frequency distribution to see if the counts in each category look reasonable (and to make sure there are no unexpected categories)

Section 3 Knowledge Check

1. Why should program mode be preferred over interactive mode for “real” data analysis?

Program mode compiles commands into a program file that can be saved and reused, making it possible to reconstruct and reproduce all analyses. Interactive mode makes it difficult to keep a clear record of steps taken.

2. When verifying continuous variables, what should you examine first?

For continuous variables, the first verification step is to determine the number of valid observations and missing values, then check the minimum and maximum values (or the 5 smallest and 5 largest) to make sure they are reasonable.

3. What is the purpose of attaching labels to categorical variable values?

Categorical variables should have meaningful labels attached to each category (e.g., sex coded as 0 or 1 should have labels “male” and “female” attached) so that output is immediately interpretable.

Reflection

Imagine you receive a dataset where a colleague entered data interactively in a spreadsheet with no documentation. What steps would you take to clean, verify, and prepare the data for analysis? What problems might you encounter?

Reflection saved!
* Complete the quiz and reflection to continue.
Section 4

Data Processing & Unconditional Associations

⏱ Estimated time: 20 minutes

Processing the Outcome Variable(s)

While verifying data, you can also start processing your outcome variable(s). Review the stated goals of the study to determine the format(s) which best suits the goal(s). Consider the following based on outcome type:

Categorical outcomes

Is the distribution of outcomes across categories acceptable? For example, if you planned a multinomial regression with a 3-category outcome, but very few observations fall in one of the categories, you might want to recode it to a 2-category variable.

Continuous outcomes

Does the variable have the characteristics necessary for the planned analysis? If linear regression is planned, is the distribution approximately normal? If not, explore transformations. Note: It is the normality of the residuals which is ultimately important, but if the original variable is far from normal and there are no strong predictors, the residuals are unlikely to be normal.

Count / rate outcomes

If Poisson regression is planned, are the mean and variance of the distribution approximately equal? If not, consider negative binomial regression or alternative analytic approaches.

Time-to-event outcomes

What proportion of the observations are censored? You might also want to generate a simple graph of the empirical hazard function to get an idea what shape it has.

Processing Predictor Variables

It is important to go through all predictor variables to determine how they will be handled:

  • Missing values: Are there many? If so, you might need to abandon plans to use that predictor, or conduct 2 analyses (one on the subset where the predictor is present and one on the full dataset ignoring the predictor).
  • Distribution: For continuous variables, is there a reasonable representation over the whole range of values? If not, it might be necessary to categorise the variable.
  • Categorical variables: Are all categories reasonably well represented? If not, you might have to combine categories.

Multilevel Data

If your data are multilevel (e.g., blood pressure measurements within individuals within centres), evaluate the hierarchical structure:

Key Questions for Multilevel Data

What is the average (and range) number of observations at one level in each higher-level unit? Are individuals uniquely identified within a hierarchical level? It is often useful to create one unique identifier for each observation in the dataset.

Unconditional Associations

Before proceeding with any multivariable analyses, it is important to evaluate unconditional associations within the data. These serve as the foundation for building more complex models.

Variable TypesAnalytical Approach
Two continuous variablesCorrelation coefficient, scatterplot, simple linear regression
One continuous + one categoricalOne-way ANOVA, simple linear or logistic regression
Two categorical variablesCross-tabulation and χ² test

When evaluating unconditional associations, pay attention to:

  • Associations between predictors and outcome: Determine if there is any association at all; determine the functional form (is it linear?); get a simple picture of the strength and direction.
  • Associations between pairs of predictors: Look for potential collinearity problems (highly correlated predictors).
  • Confounding variables: Evaluate associations between the confounding variables and the key predictors of interest and the outcome.

Keeping Track of Your Analyses

Before starting the more substantial analysis, set up a system for keeping track of your results:

📦
Analyse in Blocks
Click to learn more
📝
Keep Log Files
Click to learn more
📅
Label & Date
Click to learn more

Section 4 Knowledge Check

1. Why should you evaluate unconditional associations before multivariable analyses?

Evaluating unconditional associations before multivariable analyses helps you understand the basic relationships in your data, identify collinearity, detect confounding, and determine the functional form of relationships—all of which inform the complex models you will subsequently build.

2. If a continuous outcome variable is far from normally distributed, what should you do?

If the continuous outcome is not approximately normally distributed, you should explore transformations which might normalise the distribution. It is ultimately the normality of the residuals that is important, but a far-from-normal variable with no strong predictors will produce non-normal residuals.

3. What is the appropriate analytical approach for evaluating the association between two categorical variables?

For associations between two categorical variables, cross-tabulation and the chi-squared (χ²) test are the appropriate analytical approaches. These are particularly useful for identifying unexpected observations.

Reflection

Consider a dataset with 15 predictor variables and one continuous outcome. Describe the sequence of unconditional analyses you would carry out before fitting any multivariable models. How would you handle a predictor that has 30% missing values?

Reflection saved!
* Complete the quiz and reflection to continue.
Final Assessment

A Structured Approach to Data Analysis — Final Assessment

15 questions • 100% required to pass

Final Reflection

Reflecting on the entire lesson, what do you consider the most important step in the structured approach to data analysis? How would you apply this structured approach to a dataset you are currently working with or plan to work with in the future?

Reflection saved!

Final Assessment

You must complete the final reflection above before submitting. Answer all 15 questions. 100% is required to pass.

1. What is the first step recommended before beginning any data analysis?

The first step is to construct a plausible causal diagram of the problem, identifying outcomes, predictors, confounders, and intervening variables.

2. Why should you avoid starting analyses in a spreadsheet?

Doing all analyses in the statistical program makes it easier to keep track of all analyses and simplifies tracking modifications to the data.

3. What coding value should NOT be assigned to missing data?

The specific number assigned to missing values must not be a legitimate value for any of the responses. Common conventions include large negative numbers like −999.

4. Why is a 2-digit numerical suffix recommended for file names?

A 2-digit suffix allows you to have 99 versions of a file that will sort correctly when listed alphabetically (e.g., brazil01, brazil02, ... brazil99).

5. What is the danger of using the “sort” command in a spreadsheet for data entry?

In spreadsheets, it is possible to sort individual columns independently, which can destroy your entire dataset with one inappropriate “sort” command by misaligning records across columns.

6. What is the primary purpose of evaluating unconditional associations between pairs of predictors?

Associations between pairs of predictors are evaluated to detect potential collinearity problems, where highly correlated predictors can cause instability in multivariable models.

7. When processing a categorical outcome with 3 categories, when might you recode it to 2 categories?

If you planned a multinomial regression with a 3-category outcome, but there are very few observations in 1 of the 3 categories, you might want to recode it to a 2-category variable.

8. What approach should be used to document what a program file does?

All statistical programs allow you to add comments to the program files. These should document what the program does and, in some cases, record key results within the file itself.

9. For verifying a continuous variable, what visual tool is recommended?

For continuous variables, preparing a histogram gives you an idea of the distribution and allows you to see if it looks reasonable before proceeding with further analysis.

10. What is the appropriate analysis for the association between one continuous and one categorical variable?

For the association between one continuous and one categorical variable, one-way ANOVA, simple linear regression, or logistic regression are appropriate analytical approaches.

11. If a predictor variable has many missing values, what options are available?

If many values are missing, you might abandon plans to use that predictor, or conduct 2 analyses: one on the subset in which the predictor is present and one on the full dataset ignoring the predictor.

12. What should you do with log files from your analyses?

Give log files the same name as the program file (except with a different extension) so that it is easy to match the program that generated a particular set of results.

13. Why is interactive mode still useful despite its limitations?

Interactive mode is very useful for exploring your data and trying out analyses. However, the “real” processing and analysis should be done in program mode for reproducibility.

14. When evaluating confounding variables, what should you specifically look for?

Special attention needs to be paid to potential confounding variables by evaluating the associations between these variables and the key predictors of interest and the outcome, particularly if there is a strong association with both.

15. What does the chapter suggest you should do if a count/rate outcome’s mean and variance are not approximately equal?

If the mean and variance of a count/rate outcome are not approximately equal, Poisson regression assumptions may be violated. Consider negative binomial regression or alternative analytic approaches.

Lesson 1 Complete!

You have successfully completed A Structured Approach to Data Analysis. Your responses have been downloaded.