Sampling
Fundamental Epidemiological Concepts and Approaches
Kiffer G. Card, PhD, Faculty of Health Sciences, Simon Fraser University
Learning objectives for this lesson:
- Distinguish between a census and a sample, and between descriptive and analytic studies
- Describe the hierarchy of populations and the concept of a sampling frame
- Explain types of error, including Type I and Type II errors, and the concept of statistical power
- Compare non-probability sampling methods (judgement, convenience, purposive)
- Describe probability sampling methods (simple random, systematic, stratified, cluster, multistage, targeted)
- Understand the implications of complex sampling designs on data analysis
- Compute required sample sizes for common analytic objectives
This course was developed by Kiffer G. Card, PhD, as a companion to Dohoo, I. R., Martin, S. W., & Stryhn, H. (2012). Methods in Epidemiologic Research. VER Inc.
Glossary — Key Terms, People & Concepts
📚 Reference page — available throughout the lesson
This glossary collects the key concepts, people, and ideas you will meet in this lesson. Use it as a reference while you work through the material, or as a review before assessments. Type in the search box to filter entries.
Introduction to Sampling
⏱ Estimated reading time: 22 minutes
Introduction and Overview
Lesson 1 closed with the counterfactual framework: causal inference requires us to compare groups under different exposure conditions. This lesson asks the immediate next question: which people should be in those groups? Sampling sits at the foundation of every study you'll design in HSCI 341 because the choice of who ends up in the study determines almost everything else — the validity of the results, the generalisability of conclusions, and the bias inventory you spent HSCI 230 learning to recognise. The four content sections move from broad principles to concrete formulas: Section 1 sets up the population hierarchy, sampling frame, and the probability theory that makes inference from a sample possible; Section 2 covers types of error and non-probability sampling; Section 3 details the major probability sampling designs; Section 4 closes with how to analyse data from complex surveys and how to determine sample size in advance.
Learning Objectives
- Distinguish between a census and a sample.
- Contrast descriptive and analytic studies.
- Describe the hierarchy of populations (target, source, study sample).
- Explain the concepts of internal and external validity.
- Define a sampling frame and explain its importance.
- Describe the foundational ideas of probability theory — random variables, expected value, and variance — that underpin statistical inference from a sample.
- Recognise common probability distributions (Bernoulli, Binomial, Poisson, Normal, Exponential, Uniform) and identify the public-health phenomena they typically describe.
- Explain what a sampling distribution is and state the Central Limit Theorem in plain language.
Census vs. Sample
When we conduct research, we need data from either all individuals in a population or a subset of them. The process of obtaining this data is called measurement.
In a census, every individual in the population is evaluated. In a sample, data are collected from only a subset. Sampling is generally more convenient and less costly than conducting a full census. Interestingly, even a census can be viewed as a kind of sample — it captures the population at one point in time, making it a "sample" of the population over time.
Key Distinction
In a census, the only source of error is the measurement itself. With a sample, you contend with both measurement error and sampling error. However, a well-planned sample can provide virtually the same information as a census at a fraction of the cost.
Canadian Examples: Census vs. National Health Surveys
Canada runs both kinds of data collection at the population scale, and you will encounter all of them in public health practice:
- Census of Population (Statistics Canada, every 5 years; most recent 2021). A near-complete enumeration of every household in Canada. The short-form census goes to all households; the long-form census goes to a 25% mandatory sample. Provides the denominators behind almost every population health rate you will calculate.
- Canadian Community Health Survey (CCHS) — Statistics Canada / Health Canada / PHAC. A continuous cross-sectional sample survey (~65,000 respondents per cycle) covering self-reported health, behaviours, and health-care use. The flagship descriptive survey for population health surveillance.
- Canadian Health Measures Survey (CHMS) — Statistics Canada / Health Canada / PHAC. A multi-stage sample survey that adds direct physical measurements (blood pressure, biomarkers, fitness) and a biobank to self-report data. Smaller (~5,700 respondents per cycle) but anchors objective measurement of population health.
- National Population Health Survey (NPHS) — the longitudinal predecessor (1994–2011) to the CCHS, still used for life-course research.
The Census gives you a denominator and demographic context; CCHS and CHMS give you population estimates of health states with sampling error attached. Choosing among them is the first applied sampling decision a public-health analyst makes.
Descriptive vs. Analytic Studies
Samples support two fundamental types of studies:
Descriptive Studies (Surveys)
A descriptive study aims to describe population attributes such as the frequency of disease or the prevalence of an exposure. Surveys answer questions like: "What proportion of people had diarrhea over a 1-month period?" or "What is the average BMI of students in Grade 12?"
The focus is on characterizing the current state of a population rather than establishing cause-and-effect relationships.
Analytic Studies
An analytic study is designed to estimate the magnitude of an association between exposures and outcomes. These studies contrast groups and seek explanations for differences between them.
Examples: "Is water source associated with the incidence of diarrhea?" or "How does time spent playing video games affect the BMI of Grade 12 students?"
Establishing an association is the first step to inferring causation, as discussed in Lesson 1.
Choosing between a census and a sample is the first decision; choosing between a descriptive and an analytic purpose is the second. Both decisions sit on top of an even more fundamental concept — the relationship among the different populations a single study touches.
Hierarchy of Populations
Understanding the different populations involved in a study is essential for evaluating validity. There are three key populations to consider, each nested inside the next. The diagram and accordion below define them in turn; the labels that appear next to the arrows (external validity, sampling frame, internal validity) are the technical vocabulary you will use to talk about how a sample inherits or loses information from the broader population it was meant to represent.
Figure 3.1 — Hierarchy of populations in epidemiologic research. The target population is the broadest; the source population is the accessible subset; the study sample consists of those who actually participate.
The target population is the population to which you want to extrapolate your results. It is often not clearly defined and may vary depending on the perspective of the person interpreting the study. For example, researchers studying rainwater cisterns in Pernambuco State, Brazil might define the target as that state, while someone else may want to generalize the findings to all semi-arid regions of Brazil.
The source population is the population from which study subjects are actually drawn. All units in the source population should be "listable" and have a non-zero probability of being included in the study. For example, in a diarrhea study in Brazil, the source population included families from households participating in the One Million Cisterns Project (OMCP).
The study sample (or study group) consists of the individuals who actually end up in the study. It is typically a subset drawn from the source population. Researchers determine the necessary sample size, draw their sample, collect data from eligible subjects, and the final study sample consists of those who agreed to participate and whose data met quality requirements.
Validity: Internal and External
Internal validity refers to whether the study results are valid for members of the source population. It indicates whether the study obtained the "correct" answer for that population. Much of epidemiology is dedicated to methods that ensure internal validity.
External validity involves a subjective assessment of whether results can be generalized to the broader target population. It is generally easier to generalize results from analytic studies (which evaluate associations) than from descriptive studies (which estimate prevalence).
The Sampling Frame
The sampling frame is the list of all sampling units in the source population. Sampling units are the basic elements that will be sampled (e.g., households, individuals). A complete list of all sampling units is required for drawing a simple random sample, though some other methods do not require such a complete listing.
Example: Brazil Diarrhea Study
In a study of water cisterns and diarrhea in Brazil, a suitable sampling frame was the list of all households eligible for the One Million Cisterns Project. Once households were selected, a separate strategy was used for selecting individuals within each household.
Canadian Examples: Sampling Frames You Will Actually Use
National-scale Canadian surveys rarely have a single tidy list of every person. Instead, they assemble a frame from administrative listings:
- Statistics Canada Address Register (AR) — the dwelling-level frame used by the Census and many StatCan household surveys.
- Labour Force Survey (LFS) area frame — CCHS draws part of its sample from the LFS area frame (which is itself based on Census enumeration areas).
- Provincial health insurance registries (e.g., the BC Medical Services Plan client registry) — close to a population census of residents and the backbone of administrative-data research at Population Data BC (PopData BC).
- Disease registries such as the Canadian Cancer Registry or provincial reportable-disease lists serve as case frames for surveillance.
Notice how each frame has different coverage: a registry-based frame misses people without provincial coverage; an LFS area frame excludes residents on First Nations reserves and in institutions. The frame — not the questionnaire — is usually where exclusions and selection bias enter.
The vocabulary of populations and sampling frames is necessary but not sufficient. The deeper question is why we can learn about a whole population from a small sample at all. The answer comes from probability theory.
Probability Theory: Why Sampling Works
Every quantity we estimate from a sample — a prevalence, a mean, a risk ratio — is the value of a random variable. If we drew a different sample tomorrow, the value would be slightly different. Probability theory is the formal language we use to describe how those values vary, and it is what lets us turn a single sample into a defensible statement about a population.
Three concepts do most of the work in introductory biostatistics:
Three Foundational Ideas
1. A random variable is a numerical outcome of a random process — for example, the number of new TB cases reported in a health region next week, or the systolic blood pressure of the next adult who walks into a clinic.
2. The expected value (also called the mean, written μ or E[X]) is the long-run average of a random variable across many repetitions of the random process. It is the value we are usually trying to estimate.
3. The variance (σ2) and its square root, the standard deviation (σ), measure how spread out the random variable is around its mean. Spread, not the mean alone, is what controls how precisely we can estimate things from a sample.
A second idea worth naming is independence. Two observations are independent when knowing the value of one tells you nothing about the value of the other. Independence is the assumption that lets a small random sample stand in for a much larger population — and it is the assumption most often violated in real public-health data (clustered households, repeated measures on the same person, contagion in infectious disease).
Why You Should Care
Whenever you compute a confidence interval, run a hypothesis test, or quote a margin of error, you are doing arithmetic on a probability distribution — usually a Normal distribution — that describes what would happen if you repeated the study many times. If you don’t know what distribution your statistic comes from, you can’t honestly attach uncertainty to it.
Types of Probability Distributions
A probability distribution describes the values a random variable can take and how likely each value is. Distributions split first into discrete (countable outcomes — 0, 1, 2 cases) and continuous (any value on a range — height, blood pressure, time). The handful below are the ones you will encounter again and again in public health.
Discrete distributions
| Distribution | What it models | Public-health example |
|---|---|---|
| Bernoulli(p) | A single yes/no trial with success probability p. Mean = p, variance = p(1−p). | Whether one randomly chosen adult currently smokes. |
| Binomial(n, p) | Number of "successes" in n independent Bernoulli trials. Mean = np, variance = np(1−p). | Number of smokers in a CCHS sample of 1,000 adults. |
| Poisson(λ) | Number of rare events in a fixed interval of time, area, or person-time. Mean = variance = λ. | New cases of measles per week in a public-health unit; ER visits per hour. |
Continuous distributions
| Distribution | What it models | Public-health example |
|---|---|---|
| Uniform(a, b) | Every value between a and b is equally likely. The "default ignorance" distribution. | Random-digit dialling within an area code; random selection from a list. |
| Normal(μ, σ2) | The classic bell curve — symmetric, with most mass within 2σ of the mean. The default for many continuous biological measurements and, crucially, for sample means (see CLT below). | Adult height; systolic blood pressure; standardized test scores. |
| Exponential(λ) | Time between independent events occurring at a constant rate λ. Right-skewed, memoryless. Mean = 1/λ. | Time between successive ED arrivals; survival time under a constant hazard. |
| Log-normal / Right-skewed | Variables that are positive and span several orders of magnitude. The log of the variable is approximately Normal. | Household income; hospital length of stay; viral loads. |
How to Read a Distribution
Every distribution is summarized by two things: a shape (symmetric? right-skewed? bimodal?) and a small set of parameters that control its location and spread. When you see "BMI ~ Normal(27, 42)", read it as: BMI is approximately Normal, centred at 27 kg/m2, with a standard deviation of 4. About 95% of the population falls within 2 SDs of the mean — roughly 19 to 35.
Figure 3.2 — Four common distributions you will meet in public-health data. Discrete distributions assign probability to whole-number outcomes (left two); continuous distributions describe a smooth curve over a real-valued measurement (right two).
🔥 Try it Yourself: Distribution Simulator
What you'll do: use the simulator below to play with each distribution's parameters and watch the shape change in real time, then run the Central Limit Theorem demo to see why sample means from any population eventually look Normal. What to take away: distributions are not just textbook abstractions — they describe specific public-health phenomena, and the CLT is what lets us trust confidence intervals and power calculations even when the underlying data are skewed.
Aim for at least 10–15 minutes of play; this is the kind of intuition you will draw on for every confidence interval and power calculation in the rest of the course. Use the tabs to switch between exploring single distributions, the CLT demonstration, and a side-by-side comparison.
How to use this
Choose any of the six common distributions on the left. Adjust its parameters to see how the shape, mean, and spread respond. Then click "Draw a sample" to take a random draw and watch the empirical histogram converge on the theoretical curve as your sample grows.
Distribution
Parameters
Sampling
About Binomial — n = 20, p = 0.30
Binomial(n, p) counts the number of successes in n independent Bernoulli trials with the same success probability p.
Mean = np = 6.00, SD = √np(1−p) = 2.05.
- For large n and moderate p, the binomial looks Normal — this is one of the oldest examples of the CLT.
- Used for sample-size formulas around proportions and prevalence estimates.
- Assumes independence and constant p — both can fail in clustered data (households, schools).
Central Limit Theorem demonstration
Pick a population shape (try the heavily skewed ones — that's where the CLT feels most magical). Set a sample size n, then click Draw 1,000 sample means. Each sample of size n is summarised by its mean, and we plot those means. Watch the histogram of means converge on a Normal curve as n grows — even when the underlying population is wildly non-Normal.
Population shape
Sample size
Side-by-side: shapes at a glance
Six distributions, plotted on the same axes. Use this view to remember which shape goes with which name, and to compare how parameters change appearance. Hover for tooltips.
Display options
How to choose a distribution in practice
- Binary outcome (yes/no for one person)? Bernoulli. Count of "yes"s in a fixed sample? Binomial.
- Counting rare events in time/space (cases per week, ER visits per hour)? Poisson.
- Continuous, symmetric biological measurement (BP, height, lab values)? Normal — or check first whether the variable is approximately Normal.
- Time until next event under a constant hazard? Exponential. Survival under varying hazard? Weibull or Gamma (advanced).
- Strictly positive, right-skewed, multiplicative process (income, length of stay, viral load)? Log-normal — analyze on the log scale.
- No prior information about likely values? Uniform on a sensible range.
Each distribution above describes the underlying behaviour of a single random variable. But when we draw a sample from a population, the quantity we typically care about is a summary of that sample — a mean, proportion, or rate. To make inferences from such summaries, we need one more layer of theory.
Sampling Distributions and the Central Limit Theorem
So far we have talked about distributions of individuals in a population. Now consider the distribution of a statistic — for example, the mean of a random sample of n people. Because each sample of size n would give a slightly different mean, the mean itself has a distribution. We call it the sampling distribution of the mean.
Two facts about that sampling distribution drive almost all of frequentist inference:
The Two Pillars
1. Standard error. If individual observations have standard deviation σ, then sample means of size n have standard deviation σ/√n. This quantity — the SD of a statistic — is called the standard error (SE). Quadrupling your sample size halves the standard error.
2. The Central Limit Theorem (CLT). For sufficiently large n, the sampling distribution of the mean is approximately Normal, no matter what the underlying population looks like — even if the population is heavily skewed, bimodal, or discrete. "Sufficiently large" is often around n = 30 for moderately skewed distributions, and much smaller for symmetric ones.
The CLT is the hidden engine behind the bell curve that shows up everywhere in statistics. It is the reason a 95% confidence interval can be written as estimate ± 1.96 · SE: the 1.96 comes from the Normal distribution that the CLT promises us applies to the estimate, even when we have no idea what shape the underlying population has. The theorem traces from de Moivre (1733) through Laplace's 1812 binomial approximation to Lyapunov's 1901 general proof (Central limit theorem, Wikipedia).
Figure 3.3 — The Central Limit Theorem in action. The population can be wildly skewed, but as n grows the sampling distribution of the mean becomes increasingly Normal and increasingly narrow (its SD = σ/√n).
Caveat: The CLT Is About Means, Not Individuals
A common student misconception is that “a large enough sample makes the data Normal.” It does not. Income data with 1,000 observations is still right-skewed. What becomes Normal is the distribution of the sample mean across hypothetical repeated samples — that is what we use to construct confidence intervals around the mean. Inference for medians, proportions, ratios, or extreme values relies on the CLT or its analogues in different ways and may need larger n or different methods (bootstrapping, exact methods).
Key Takeaways
- A census measures everyone; a sample measures a subset — both involve measurement, but samples also introduce sampling error.
- Descriptive studies characterize populations; analytic studies evaluate associations between exposures and outcomes.
- The three populations (target, source, study sample) form a hierarchy, each linked to different aspects of study validity.
- The sampling frame is the list of all units from which the sample is drawn — in Canadian practice this often means a StatCan address register, an LFS area frame, or a provincial health insurance registry.
- Random variables, expected values, and variances are the language we use to describe how sample-based estimates vary — they are the foundation of every confidence interval and hypothesis test that follows.
- A small set of probability distributions (Bernoulli, Binomial, Poisson, Normal, Exponential, log-normal) describes the bulk of public-health phenomena you will encounter.
- The standard error of the mean is σ/√n, and by the Central Limit Theorem the sampling distribution of the mean is approximately Normal for sufficiently large n — this is the engine behind frequentist inference.
1. What is the key difference between a census and a sample?
2. An analytic study differs from a descriptive study in that it:
3. Internal validity refers to whether:
4. Which distribution would best describe the number of new measles cases reported per week in a public-health unit, where cases are rare and arrive roughly independently?
5. The Central Limit Theorem says that, for sufficiently large n:
6. If a measurement has population standard deviation σ = 12, and you draw a random sample of n = 144, what is the standard error of the sample mean?
✦ Pass the knowledge check with 100% to continue
Types of Error & Non-Probability Sampling
⏱ Estimated reading time: 12 minutes
Introduction and Overview
Section 1 set up the population hierarchy and the probability theory behind sampling. Section 2 turns to two practical consequences. First: when we draw conclusions from a sample, we make systematic types of mistake (Type I and Type II errors), and these mistakes are quantifiable. Second: there are sampling strategies that abandon probability theory altogether — convenience samples, judgement samples, snowball samples — with predictable consequences for inference. Both halves of this section give you the vocabulary to evaluate when those approaches are acceptable and when they are not.
Learning Objectives
- Explain the two types of statistical error (Type I and Type II).
- Define the null hypothesis, P-values, and statistical power.
- Describe three non-probability sampling methods and their limitations.
Types of Error
In any study based on a sample, the variability of the outcome, measurement error, and sample-to-sample variability all affect results. When making inferences based on sample data, they are subject to error. Within hypothesis testing in analytical studies, there are two key types of error:
Table 3.1 — Types of Error
| Conclusion of Analysis | Effect Truly Present | Effect Truly Absent |
|---|---|---|
| Effect present (reject null) | Correct | Type I (α) error |
| No effect (accept null) | Type II (β) error | Correct |
A Type I error occurs when you conclude that the outcomes in the groups are different (i.e., that an association exists), when in fact they are not. In other words, you falsely reject the null hypothesis. The probability of a Type I error is denoted α.
Statistical tests are aimed at disproving the null hypothesis (that there is no difference between groups). When P ≤ 0.05, we are "reasonably sure" that any detected effect is not due to chance — but there remains a 5% chance of making a Type I error.
A Type II error occurs when you conclude that there is no association between the exposure and outcome, when in fact there is. You fail to reject the null hypothesis when you should have. The probability of a Type II error is denoted β.
Reasons a study might fail to find a real effect include: the exposure truly had no effect, the study design was inappropriate, the sample size was too small (low power), or simply bad luck.
Power is the probability that you will find a statistically significant difference when a real difference of a defined magnitude exists. Mathematically, power = 1 − β.
For example, if a study has 80% power, it has an 80% chance of detecting a true effect of the specified size. To increase power, you need to increase the sample size. So-called negative findings (failure to find a difference) are less commonly reported in the literature, partly because many studies lack adequate power.
🎲 Interactive: Sample Size & the Law of Large Numbers
What you'll do: pick a population, set a sample size n, draw repeated samples, and watch the distribution of sample means concentrate around the true population mean as n grows. What to take away: the standard error shrinks as 1/√n — this is the formal reason why “more data” means better precision and higher statistical power. The intuition you build here drives every confidence interval and sample-size calculation in the rest of the course.
Population Distribution
The "true" distribution we're sampling from. Red line = true mean (μ).
Sampling Distribution of the Mean
Each bar is the count of sample means falling in that range. Yellow = most recent sample's mean.
The error framework above assumes a probability sample. The next subsection covers what happens when investigators forgo formal random selection altogether — and what that costs them.
Non-Probability Sampling
Samples drawn without an explicit method for determining each individual's probability of selection are known as non-probability samples. Whenever there is no formal process for random selection, the sample should be considered non-probability. Sample selection that is unrelated to the outcome of interest leaves inference intact, but selection that depends on unmeasured determinants of the outcome produces selection bias — a form of specification error formalised by Heckman (1979) and reviewed for hidden populations by Sudman & Kalton (1986). There are three main types:
Click each card to learn more:
SampleClick to learn more
SampleClick to learn more
SampleClick to learn more
Important Limitation
Non-probability samples are generally inappropriate for descriptive studies because you cannot generalize prevalence estimates to the source population without knowing each individual's probability of being included. However, non-probability methods are commonly used in analytical studies where comparing exposure groups is the priority.
Chain-Referral and Hybrid Designs
Snowball sampling — first formalised by Goodman (1961) — recruits hidden-population members through peer referrals and is widely used when no sampling frame exists. Two newer hybrids partially recover probability-style inference: respondent-driven sampling (RDS), introduced by Heckathorn (1997) and extended with unbiased estimators by Salganik & Heckathorn (2004); and time-location (venue-based) sampling, applied at national scale for HIV behavioural surveillance by MacKellar and colleagues (2007). Magnani, Sabin, Saidel, & Heckathorn (2005) review when each design is appropriate for hard-to-reach populations.
Key Takeaways
- Type I (α) error means falsely concluding there is an effect; Type II (β) error means missing a real effect.
- Power (1 − β) is the probability of detecting a true effect; increasing sample size increases power.
- Non-probability samples (judgement, convenience, purposive) lack a formal random selection process and are primarily used in analytic studies.
1. A Type I (α) error occurs when you:
2. Statistical power is defined as:
3. Why are non-probability samples generally inappropriate for descriptive studies?
✦ Pass the knowledge check with 100% to continue
Probability Sampling Methods
⏱ Estimated reading time: 15 minutes
Introduction and Overview
Section 2 closed by walking through what non-probability sampling looks like and why it forfeits inferential validity. Section 3 returns to probability sampling and details the major variants you'll encounter in real surveys. The six tabs below are not interchangeable — each design buys a different combination of cost, complexity, and statistical efficiency. Read the comparison table at the end of the section as a decision aid you can return to whenever you're choosing a design.
Learning Objectives
- Define a probability sample and explain why random selection is essential.
- Describe simple random, systematic random, stratified random, cluster, multistage, and targeted (risk-based) sampling methods.
- Identify the advantages and disadvantages of each method.
What Is a Probability Sample?
A probability sample is one in which every element in the population has a known, non-zero probability of being included. This implies that a formal process of random selection has been applied to the sampling frame. The key advantage is that probability samples allow for valid statistical inferences about the source population.
Random ≠ Haphazard
Random selection uses a formal, reproducible process (e.g., computer-generated random numbers, random number tables) — it is not the same as selecting participants haphazardly or arbitrarily.
Types of Probability Sampling
Simple Random Sample
In a simple random sample, every study subject in the source population has an equal probability of being included. A complete list of the source population is required, and a formal random process is used to select individuals.
Example: To study wait times in a hospital emergency room, you need 1,000 records from 13,000 admissions over the past year. You randomly generate 1,000 numbers between 1 and 13,000 and pull those records.
Advantage: Conceptually simple; all standard statistical analyses apply directly.
Limitation: Requires a complete list of the entire source population.
Systematic Random Sample
In a systematic random sample, a complete list is not required — you only need an estimate of the total population and sequential access to individuals. The sampling interval (j) is computed as the population size divided by the desired sample size.
How it works: Randomly pick a starting point between 1 and j, then select every jth subject after that.
Example: To sample 1,000 from 13,000 emergency patients, the sampling interval is 13. Randomly pick a number between 1 and 13 for your starting patient, then select every 13th patient thereafter.
Caution: Bias may occur if the factor you are studying is related to the sampling interval (e.g., periodic patterns in admissions).
Stratified Random Sample
The population is divided into mutually exclusive strata based on factors likely to affect the outcome. Then, within each stratum, a simple or systematic random sample is chosen. The mathematical foundations of stratification — including the now-standard optimum (Neyman) allocation rule for assigning sample size across strata — were laid out by Neyman (1934) in his landmark Royal Statistical Society paper that effectively founded probability sampling theory.
In proportional stratified sampling, the number sampled from each stratum is proportional to that stratum's share of the total population.
Three key advantages:
- Ensures all strata are represented in the sample.
- Can produce more precise overall estimates than a simple random sample because between-strata variation is removed.
- Allows estimation of stratum-specific outcomes.
Example: If hospital wait times differ between males and females, stratify records by sex and randomly sample within each group.
Cluster Sampling
A cluster is a natural grouping of study subjects with one or more common characteristics (e.g., a household is a cluster of people; a classroom is a cluster of students; a clinic is a cluster of patients).
In cluster sampling, the primary sampling unit (PSU) is the cluster itself, and it is often larger than the unit of concern. Every individual within a selected cluster is included in the sample.
Example: To estimate smoking prevalence among Grade 12 students, randomly select 10 of 47 Grade 12 classes and survey all students in those 10 classes.
Advantage: Easier when getting a list of clusters is simpler than listing all individuals. Often cheaper to visit fewer locations.
Limitation: Individuals within a cluster tend to be more alike, increasing sampling variation for a given sample size compared to SRS.
Important: A sample is only a "cluster sample" if the group is the sampling unit and the individuals within it are the unit of concern. If the group itself is the unit of concern (e.g., "does anyone in the household smoke indoors?"), it is not a cluster sample.
Multistage Sampling
Multistage sampling is similar to cluster sampling, except that after selecting primary sampling units (PSUs), a sample of secondary sampling units (individuals) is drawn within each PSU rather than surveying everyone.
Example: To study smoking among students, first randomly select 10 classes (PSUs), then randomly select 5 students from each class rather than surveying all students in every class. Within-household selection in face-to-face surveys is most often done using a Kish grid — the objective respondent-selection procedure introduced by Kish (1949).
To ensure all individuals have the same probability of being selected, either choose PSUs proportional to their size, or use a constant sampling proportion within each PSU — the latter requires PPS (probability-proportional-to-size) selection at earlier stages.
The number of individuals per cluster (ni) can be optimized by balancing within-cluster and between-cluster variance against the costs of sampling groups versus individuals.
Targeted (Risk-Based) Sampling
Targeted sampling stratifies the source population based on characteristics associated with the probability of disease occurrence, then focuses sampling on strata where disease is most likely to be found.
Individuals are assigned point values based on their probability of having the disease of interest, and sampling proceeds until a predetermined number of points have been sampled. This is an unequal probability sampling strategy — some individuals may even have a zero probability of inclusion.
Advantage: Requires a much smaller sample to detect rare diseases when key risk characteristics can be identified.
Limitation: Key epidemiological parameters (e.g., risk ratios) may not be known for the study population and must be estimated from other evidence.
Comparison of Sampling Methods
| Method | Requires Complete List? | Key Advantage | Key Limitation |
|---|---|---|---|
| Simple Random | Yes | Simple; all standard analyses apply | Needs complete population list |
| Systematic | No (needs estimate) | Practical; easy to implement | Periodic bias if factor linked to interval |
| Stratified | Yes (within strata) | More precise; ensures representation | Needs to know stratum membership |
| Cluster | List of clusters only | Cheaper; no need to list individuals | Higher variance than SRS for same n |
| Multistage | List of PSUs only | Flexible; cost-effective | Complex design; needs more subjects |
| Targeted | No (risk-based) | Efficient for rare diseases | Needs prior knowledge of risk factors |
Worked Example: How the CCHS Combines These Methods
The Canadian Community Health Survey illustrates a real multistage probability design in action:
- Stratification — the population is first stratified by health region (about 110 health regions across Canada), and a target sample size is allocated to each so that every region produces stable estimates.
- Clustering — within each health region, dwellings are sampled from the LFS area frame (groups of dwellings that share a geographic boundary). This is the cluster stage.
- Selection within cluster — one person is randomly selected from each chosen household to complete the interview.
- Top-up samples — an RDD (random digit dialling) telephone frame fills in coverage for areas where the area frame is sparse.
The result is a probability sample where every Canadian resident has a known, non-zero chance of selection — but where the selection probability differs by region, household size, and frame. That is why CCHS data must be analysed with survey weights and bootstrap replicate weights (covered on the next page).
Reflection
Think of a health research question you are interested in. Which sampling method would be most appropriate, and why? What practical constraints (cost, time, available lists) would influence your choice?
Minimum 20 characters required.
Key Takeaways
- Probability samples give every element a known, non-zero chance of selection, enabling valid statistical inference.
- Simple random sampling requires a complete list; systematic sampling needs only sequential access.
- Stratified sampling improves precision by removing between-strata variation.
- Cluster and multistage sampling are practical when listing all individuals is impractical, but they require more subjects for the same precision.
- Targeted sampling is efficient for rare outcomes but requires prior knowledge of risk characteristics.
1. What defines a probability sample?
2. A key advantage of stratified random sampling over simple random sampling is that it:
3. In cluster sampling, why is sampling variation typically greater than in simple random sampling for the same sample size?
4. Targeted (risk-based) sampling is most useful when:
✦ Complete the reflection and pass the knowledge check with 100% to continue
Analysing Survey Data & Sample Size
⏱ Estimated reading time: 15 minutes
Introduction and Overview
Section 3 walked through the major probability designs. Real surveys (the CCHS being the canonical Canadian example) typically combine several of these — stratification at the top level, clustering within strata, multistage selection within clusters. That combination is precisely what makes the analysis non-trivial: a complex sample design demands a complex analysis. The first half of this section covers how to do that analysis correctly. The second half closes the loop by walking through how to determine sample size before the data are collected.
Learning Objectives
- Explain how stratification, sampling weights, and clustering affect the analysis of survey data.
- Define the design effect and the finite population correction.
- Describe the key factors that determine sample size.
- Apply basic sample-size formulae for estimating proportions and means.
Analysing Complex Survey Data
When data come from a complex sampling design (involving stratification, weighting, or clustering), the analysis must account for these features. Ignoring them can lead to incorrect point estimates and underestimated standard errors.
Accounting for Stratification
If the population was divided into strata before sampling, this must be reflected in the analysis. Stratification provides stratum-specific estimates and can reduce the standard error of the overall estimate if the stratifying variable is related to the outcome.
However, stratification alone does not change the overall point estimate — it primarily affects precision. The total population size in each stratum must be known to compute appropriate sampling weights.
Sampling Weights
Not all individuals in a probability sample necessarily have the same probability of selection. The sampling weight for each individual is the inverse of their overall selection probability — this inverse-probability weighting underlies the Horvitz-Thompson estimator introduced by Horvitz & Thompson (1952), which produces unbiased totals and means from any probability sample with known inclusion probabilities.
The probability of selection depends on multiple stages. For example, in a household survey:
p(selection) = (n/N) × (m/M)
where n = households in sample, N = households in source population, m = individuals selected per household, and M = total people in that household.
The sampling weight = 1/p(selection). This weight reflects how many people in the source population each sampled individual "represents." Incorporating weights may change both the point estimate and the standard error.
Accounting for Clustering
In cluster and multistage sampling, individuals within groups are usually more alike than randomly chosen individuals. This means observations are not independent, and standard errors must be adjusted upward.
The most common approach is to identify the primary sampling unit (PSU) and adjust all standard error calculations for clustering at that level. The technique called variance linearisation is widely used for this purpose and requires a large number of PSUs to be reliable.
The Design Effect (deff)
The design effect (deff) summarizes the overall impact of the sampling plan on precision. It is the ratio of the variance from the complex sampling design to the variance that would have been obtained from a simple random sample of the same size. The concept and the term were coined by Leslie Kish in his classic textbook Survey Sampling (Kish, 1965) and remain the standard summary statistic for complex-design efficiency.
Interpreting the Design Effect
A deff > 1 means the complex design produces less precise (larger variance) estimates than a simple random sample would. For example, in the Brazil diarrhea study, the deff was 4.43, meaning the variance of the incidence estimate was 4.43 times larger than what a simple random sample of the same size would have produced.
Example: Impact of Survey Design on Estimates
| Type of Analysis | Incidence Estimate | SE |
|---|---|---|
| Simple random sample (assumed) | 0.1462 | 0.0061 |
| + Stratification | 0.1462 | 0.0059 |
| + Stratification + Weights | 0.1751 | 0.0091 |
| + Clustering | 0.1462 | 0.0088 |
| All features combined | 0.1751 | 0.0128 |
Notice how incorporating all features of the sampling plan changes both the point estimate (from 14.62% to 17.51%) and dramatically increases the standard error (from 0.0061 to 0.0128). Ignoring the sampling design would give a misleadingly precise — and potentially incorrect — result.
Canadian Practice: Bootstrap Weights for the CCHS and CHMS
Statistics Canada distributes the CCHS and CHMS with a set of 500 bootstrap replicate weights rather than releasing the underlying cluster identifiers (which would risk re-identification). The rescaling-bootstrap method that produces these weights was developed by Rao & Wu (1988). To get correct standard errors you re-run your analysis 500 times — once with each replicate weight — and combine the results.
Most analysts use survey or srvyr in R, svy commands in Stata, or SAS PROC SURVEY* procedures. If you ignore the bootstrap weights and just analyse the CCHS as if it were a simple random sample, your standard errors will typically be 30–80% too small — and your confidence intervals and p-values become meaningless.
Finite Population Correction (FPC)
When the proportion of the population sampled is relatively large (>10%), precision improves beyond what would be expected from an "infinite" population. The finite population correction adjusts the estimated variance downward:
FPC Formula
FPC = (N − n) / (N − 1)
where N is the population size and n is the sample size. The FPC should not be applied in multistage sampling even if the number of PSUs sampled exceeds 10% of the total PSUs. It is only applicable to descriptive studies using simple or stratified random sampling.
Analysing data correctly is necessary but not sufficient. Just as important is making sure you collected enough data to begin with — an underpowered study cannot be rescued by clever analysis. The remainder of this section covers sample-size calculation.
Sample-Size Determination
Choosing the right sample size involves both statistical and non-statistical considerations. Non-statistical factors include available resources (time, money, personnel) and the nature of the sampling frame. Statistical considerations include:
The more precise you need your estimate to be, the larger the sample you need. If you want to know diarrhea prevalence within ±5%, you need more subjects than if ±10% is acceptable. Precision is denoted L (the "allowable error" or half the desired confidence interval width).
For proportions, variance = p × q (where q = 1 − p). You need a rough estimate of the proportion to calculate the required sample size. For continuous variables like BMI, you need an estimate of the population variance (σ²). One approach: estimate the range that covers 95% of values, divide by 4 to get σ, then square it for σ².
The confidence level (typically 95%) determines how sure you want to be that the confidence interval includes the true population value. This is linked to the Z-value: for 95% confidence, Zα = 1.96. Higher confidence requires a larger sample.
In analytical studies, you also need to specify the desired power (often 80%). Power determines the sample size needed to detect a specific effect size. For 80% power, Zβ = −0.84. Greater power requires a larger sample.
Key Sample-Size Formulae
| Objective | Formula | Variables |
|---|---|---|
| Estimate a proportion | n = Zα² × p × q / L² | p = expected proportion; L = precision |
| Estimate a mean | n = Zα² × σ² / L² | σ² = population variance; L = precision |
| Compare 2 proportions | n = [Zα√(2pq) − Zβ√(p1q1 + p2q2)]² / (p1−p2)² | p = (p1+p2)/2; n = per group |
| Compare 2 means | n = 2[(Zα−Zβ)² × σ²] / (μ1−μ2)² | σ² = population variance; n = per group |
| FPC adjustment | n′ = 1 / (1/n + 1/N) | n = initial estimate; N = population size |
| Clustering adjustment | n′ = n × [1 + ρ(m−1)] | ρ = intra-class correlation; m = cluster size |
Worked Example: Comparing Two Proportions
Suppose you want to determine if rainwater cisterns reduce the monthly risk of diarrhea from 15% to 10%. With 95% confidence and 80% power:
p1 = 0.15, p2 = 0.10, p = 0.125, q = 0.875
Applying the formula yields n = 685 per group, so you would need 1,370 total individuals (685 with cisterns, 685 without).
If the outcome is clustered within households (ρ = 0.45, average household size m = 6), the clustering adjustment increases the requirement to 2,230 per group — more than triple the unadjusted estimate!
The companion R script r-activities/HSCI_341_Lesson_3_Sampling.R walks through three blocks: (A) drawing simple random, stratified, and cluster samples in base R; (B) computing weighted prevalence with the survey package; and (C) running sample-size calculations with power.prop.test and power.t.test, then adjusting for clustering via a design effect.
# PART A -- three probability sampling designs from a 1,000-row frame
set.seed(341)
N <- 1000
frame <- data.frame(id = 1:N,
province = sample(c("BC", "AB", "ON", "QC"), N, replace = TRUE),
household = sample(1:300, N, replace = TRUE))
srs <- frame[sample(N, 100), ] # simple random
strat <- do.call(rbind, by(frame, frame$province,
function(d) d[sample(nrow(d), 25), ])) # stratified
sel_hh <- sample(unique(frame$household), 30)
clust <- frame[frame$household %in% sel_hh, ] # cluster
c(SRS = nrow(srs), Stratified = nrow(strat), Cluster = nrow(clust))
# PART B -- design-corrected prevalence with the survey package
library(survey)
dat <- data.frame(province = sample(c("BC","AB","ON","QC"), 2000, replace = TRUE),
smoker = rbinom(2000, 1, 0.18),
weight = runif(2000, 800, 2200))
des <- svydesign(ids = ~1, strata = ~province, weights = ~weight, data = dat)
mean(dat$smoker) # naive (unweighted)
svymean(~smoker, design = des) # design-corrected
confint(svymean(~smoker, design = des)) # 95% CI
# PART C -- sample-size calculations + design-effect adjustment
power.prop.test(p1 = 0.15, p2 = 0.10,
power = 0.80, sig.level = 0.05) # two proportions
power.t.test(delta = 5, sd = 14,
power = 0.80, sig.level = 0.05) # two means (SBP)
n_srs <- 685; rho <- 0.45; m <- 6
ceiling(n_srs * (1 + rho*(m - 1))) # cluster-adjusted n
What you should be able to do after this activity: draw each of the three probability samples, fit a survey design and report a weighted prevalence with its CI, and compute a sample size for two proportions, two means, and a cluster design.
R Reflect on what you just ran
Use the questions below to interpret the actual numbers you produced. Look at your console output before answering.
1. The line c(SRS = ..., Stratified = ..., Cluster = ...) printed three sample sizes. What were the three numbers, and which design gave you the most variable sample size on a re-run? Why is the cluster sample size NOT exactly 100 or 200 here?
2. Compare mean(dat$smoker) with svymean(~smoker, design = des). Were they nearly the same, and why does that make sense given that the weights came from runif(800, 2200) with no relationship to province or smoker?
mean(dat$smoker) and svymean(~smoker, design = des) were nearly identical because the weights drawn from runif(800, 2200) are independent of both province and smoker status. Weights only matter when they are correlated with the variable being estimated (or with selection probability); under random weights with no informative structure, the weighted mean equals the unweighted mean in expectation. The simulation's point: weights fix design-induced bias only when there is design-induced bias to fix.3. power.prop.test(p1 = 0.15, p2 = 0.10, power = 0.80, ...) returned an n per group, and the cluster adjustment (rho = 0.45, m = 6) multiplied 685 by roughly 3.25 to give ~2,227. In one sentence, what does that ratio tell you about the price of cluster sampling vs. SRS?
Reflection
Why do you think it is important to account for clustering when determining sample size? What would happen to your study conclusions if you ignored the clustering effect?
Minimum 20 characters required.
Key Takeaways
- Complex survey analyses must account for stratification, sampling weights, and clustering to produce correct estimates and valid standard errors.
- The design effect (deff) quantifies how much less precise a complex design is relative to a simple random sample.
- Sample size depends on desired precision, expected variance, confidence level, and (for analytic studies) power.
- Clustering can dramatically increase the required sample size, especially when the intra-class correlation is high.
- The finite population correction reduces sample size requirements when sampling a large fraction (>10%) of the population.
1. What does a design effect (deff) of 4.43 indicate?
2. Sampling weights are computed as:
3. Which of the following increases the required sample size?
✦ Complete the reflection and pass the knowledge check with 100% to continue
Lesson Review & Final Assessment
⏱ Estimated time: 15 minutes
Bringing It All Together
This lesson moved sampling from a vague intuition to a structured set of decisions. You worked through the hierarchy of populations — target, source, study — and the way it maps onto internal and external validity. From there you built up the probability machinery (sampling distributions, the central limit theorem, Type I and Type II error, power) that lets a sample stand in for the population it was drawn from.
The second half of the lesson made those ideas operational: when to use simple, systematic, stratified, cluster, or multistage probability sampling; when non-probability designs are defensible; how complex survey data must be weighted and clustered in analysis; and how a sample-size calculation actually gets done. Lesson 4 will turn from who you measure to how — the design of the questionnaires those samples respond to.
Key Takeaways from Lesson 3
- Sampling is the bridge between a research question and feasible data collection: choose a sample so the inference back to the source and target populations is defensible.
- The target → source → study hierarchy is what makes internal vs. external validity a precise distinction rather than a slogan.
- The central limit theorem is what makes inference from a sample to a population work — sampling distributions, standard errors, and confidence intervals all depend on it.
- Type I (α), Type II (β), and power (1−β) are design parameters you set deliberately, not after-the-fact diagnostics.
- Probability designs (simple, systematic, stratified, cluster, multistage) trade off precision, cost, and feasibility; complex designs require weighting and a design effect in analysis.
- Sample-size calculations are explicit assumption documents: precision/effect size, variance, confidence level, and adjustments for clustering, attrition, and finite populations.
Reflection
Imagine you are designing a study to estimate the prevalence of a waterborne disease in a rural region with scattered villages. Describe the sampling strategy you would use, including the type of sampling, how you would define your populations, and what factors would influence your sample-size calculation.
svyglm with cluster-robust variance for any inferential analysis.Minimum 20 characters required.
Final Knowledge Assessment
Complete the following 15-question assessment. A score of 100% is required to complete the lesson. You may retake the assessment as many times as needed.
1. In a census, the only source of error is:
2. A descriptive study aims to:
3. The source population is best described as:
4. External validity refers to:
5. A Type II (β) error occurs when:
6. A convenience sample is characterized by:
7. In a simple random sample, every subject has:
8. The sampling interval in systematic random sampling is calculated as:
9. A key advantage of stratified random sampling is:
10. In cluster sampling, the primary sampling unit (PSU) is:
11. Sampling weights reflect:
12. The design effect (deff) is the ratio of:
13. When computing sample size for estimating a proportion, which factor does NOT increase the required sample size?
14. The clustering adjustment formula n′ = n[1 + ρ(m−1)] shows that the required sample size increases when:
15. Which statement best summarizes the importance of understanding sampling methods?
✦ Complete the final reflection above before submitting