Sampling in Qualitative Research
Qualitative Research Methods & Analysis in Public Health
Kiffer G. Card, PhD, Faculty of Health Sciences, Simon Fraser University
Learning objectives for this lesson:
- Distinguish probability and nonprobability sampling and explain when each is the right job
- Define saturation and contrast classical, empirical, and revised contemporary heuristics for qualitative sample size
- Identify and contrast the six nonprobability sampling strategies covered in Bernard, Wutich & Ryan, Chapter 3
- Distinguish theoretical sampling (Glaserian) from purposive sampling and explain why the difference matters
- Recognize when key-informant sampling is the right tool and how it relates to other strategies
- Defend a qualitative sample in writing — what your methods section owes the reader
- Document the HSCI 841 loneliness dataset's sampling logic and its limits on what your capstone can claim
- Complete the Week 3 capstone milestone: a 600-word sampling memo and a one-page sampling matrix
This course was developed by Kiffer G. Card, PhD, as a companion to Bernard, H. R., Wutich, A., & Ryan, G. W. (2017). Analyzing Qualitative Data: Systematic Approaches (2nd ed.). SAGE.
Two Kinds of Samples — and the Question Each Was Built to Answer
Introduction and Overview
A first-year epidemiology student arriving at HSCI 841 already has a clean mental model of sampling. There is a population. You define it, you draw a probability sample from it, you measure something, and the sample mean is an unbiased estimator of the population mean — with a confidence interval whose width is determined by sample size and design effect. That model is correct, it is powerful, and across HSCI 230, 341, and 410 you have used it to estimate prevalence, to compare exposed and unexposed groups, and to fit regression models. The model is also, as a description of how qualitative researchers actually sample, almost entirely wrong.
Bernard, Wutich, and Ryan (2017, p. 37) open Chapter 3 by drawing the line clearly: there are two kinds of samples in social-science research, and they were built to do different jobs. Probability samples were built so you can estimate population-level magnitudes with calculable error. Nonprobability samples were built so you can characterize a phenomenon — identify its categories, understand its mechanisms, and describe how people make sense of it. These are not different ways of doing the same job badly or well. They are different jobs.
This section unpacks that distinction. We start with what probability sampling is and what its sample-size logic looks like (briefly, because you already know it). We then introduce nonprobability sampling on its own terms, not as the disappointing-cousin-of-real-sampling that some introductory texts make it out to be. The remaining sections of the lesson work through the operational details: the six nonprobability strategies, the empirics of saturation, theoretical sampling, key informants, and the methods-section discipline you will need to defend your eventual capstone sample to a public-health reader.
Learning Objectives for Section 1
- Articulate the operational difference between probability and nonprobability sampling.
- Explain why “a nonprobability sample is a bad probability sample” is the wrong framing.
- Recall the sample-size logic of probability sampling in one sentence each: power, design effect, finite-population correction.
- Recognize that the sample-size question for nonprobability work is fundamentally different from the one you have been trained on.
1.1 What Probability Sampling Is, in One Page
A probability sample is one in which every member of a defined population has a known, non-zero probability of being selected, and that probability is built into the design (Bernard, Wutich & Ryan, 2017, p. 38). Simple random sampling, stratified sampling, cluster sampling, multi-stage sampling, and probability-proportional-to-size designs are all probability designs. The thing that makes them probability samples is not that they involve a random-number generator; it is that the selection probabilities are knowable. That knowability is what makes statistical inference to the population possible.
In probability sampling, sample size is a function of three quantities: the variance of the thing being measured, the precision you want around the estimate (the half-width of the confidence interval), and, for hypothesis-testing applications, the effect size you want to detect with a given power. Bernard, Wutich, and Ryan do not belabour this because if you have made it to a graduate qualitative methods course you have already done sample-size calculations in HSCI 341 or 410. You know that a prevalence study aiming for ±3% around a 50% prevalence at 95% confidence needs about 1,067 respondents; that a t-test for a moderate effect needs ~64 per arm at 80% power; that a cluster-randomized trial with a design effect of 1.5 needs roughly that much again. The arithmetic is well-understood and the textbooks for it sit in the room next door to this course.
What matters for our purposes is the shape of the probability-sampling sample-size argument. It is a shape that says: the more cases you collect, the more precisely you can estimate a population-level magnitude. Each additional case purchases a (diminishing) amount of statistical information about the same quantity. The argument is monotone. Twenty is better than ten; two hundred is better than twenty; two thousand is better than two hundred. The only reason to stop is cost.
1.2 What Nonprobability Sampling Is, and What Question It Was Built To Answer
Key insight - The probability sample is the wrong baseline
Qualitative sampling is often defended (or attacked) by comparing it to probability sampling. This is the wrong comparison. Probability samples are designed to support claims about population frequencies; qualitative samples are designed to support claims about meaning, mechanism, or variation. Asking a qualitative sample to deliver population estimates is like asking a microscope to map a continent. Different tools, different jobs.
A nonprobability sample is one in which the selection probabilities are not knowable and inference to a defined population is not the analytic goal. Convenience sampling, purposive sampling, quota sampling, snowball sampling, respondent-driven sampling, theoretical sampling, and key-informant sampling are all nonprobability designs. They differ from each other in important ways — this lesson is largely about how they differ — but they share that defining feature: you cannot calculate the probability that any given member of a hypothetical population was selected for your study.
The temptation, especially for analysts trained on probability sampling, is to read this as a deficiency. It is not. Nonprobability sampling was developed because some research questions are not population-magnitude questions. The questions qualitative researchers most commonly ask — what is loneliness, how does it show up, what configurations of loneliness exist, what are the mechanisms, what do people do about it — are not answered by knowing what proportion of British Columbians experience loneliness. They are answered by deliberately choosing cases that illuminate the phenomenon. The selection logic is curatorial rather than statistical.
Bernard, Wutich, and Ryan put the difference this way (paraphrasing pp. 37–38): probability samples are about extensity (the breadth of a phenomenon in a population); nonprobability samples are about intensity (the depth, detail, and configuration of the phenomenon). You can imagine the same phenomenon — loneliness, say — as having two complementary aspects that the two sampling logics give you access to. A probability sample of 5,000 Canadians would tell you what percentage report frequent loneliness, how the prevalence varies by age and income, and how the rate has changed over time. A nonprobability sample of 20 deliberately chosen interviews would tell you what the experience of loneliness is, what kinds of loneliness exist, what triggers it, how people interpret it, and what they do about it. Each design answers a question the other cannot.
The two-jobs framing in one table
| Feature | Probability sampling | Nonprobability sampling |
|---|---|---|
| Goal | Estimate a population magnitude | Characterize a phenomenon |
| Selection logic | Statistical (probabilities known) | Curatorial (probabilities not knowable) |
| Sample-size logic | Power / precision / variance | Saturation; informational redundancy |
| Inferential target | The defined population | The phenomenon, transferably described |
| Generalization claim | Statistical (CI around a parameter) | Theoretical / analytic (categories, mechanisms) |
| Typical n | Hundreds to thousands | One to several dozen, occasionally more |
1.3 Why “Nonprobability Is a Bad Probability Sample” Is the Wrong Reading
The single most common misreading of qualitative sampling in published health research is the assumption that the researcher tried to draw a probability sample, failed, and ended up with a convenience sample as a fallback. The published methods section often reinforces this misreading by describing the sample in apologetic terms (“owing to resource constraints we recruited a convenience sample of 18 participants”). Bernard, Wutich, and Ryan are emphatic that this framing is wrong and that letting it pass is bad for the field. A defensible nonprobability sample is not a failed probability sample. It is a sample assembled to do a job that probability sampling could not have done.
The HSCI 841 loneliness dataset is a useful concrete case. The 20 transcripts vary deliberately across age (18 to 82), gender (women, men, and non-binary participants), life-stage, immigration status, caregiving role, and identity. The sample is not random — the configuration is engineered. The point of including P14 Kenji (60, late-life coming out after a long heterosexual marriage) is not that he is statistically representative of British Columbian men in his sixties. He almost certainly is not. The point is that his configuration — late-life sexual-identity disclosure interacting with the loss of long-standing social ties — is a kind of loneliness the literature under-describes, and his transcript gives you analytic purchase on that configuration. A randomly drawn sample of 20 BC men in their sixties would almost certainly miss him. A probability sample of 5,000 would catch him as 0.06% of a frequency table and would not produce 8,000 words of his account.
This is the curatorial logic. You are not trying to be representative of a population; you are trying to be representative of a phenomenon's variation. Bernard, Wutich, and Ryan call this maximum-variation sampling when the variation is the explicit goal (we will come back to it as a kind of purposive sampling in Section 2). What you owe your reader, in such a design, is not a calculation of selection probabilities; it is a defence of the variation you chose to capture and an honest accounting of the variation you did not.
1.4 Why the Sample-Size Question Has To Be Different
The shape of the probability-sampling sample-size argument — more cases purchase more precision around the same quantity — does not apply to nonprobability work. The reason is straightforward. In nonprobability sampling you are not estimating a population-level magnitude. There is no parameter that the additional case is sharpening your estimate of. The relevant question is different: at what point do additional cases stop adding new information about the phenomenon?
This is the question of saturation, and it is the operational concept that has replaced the power calculation in nonprobability work. We give it the full treatment in Section 2. For now, the key idea is that nonprobability sample size is governed by informational redundancy, not by precision. When the twenty-first interview produces nothing you had not already heard in the first twenty — no new themes, no new mechanisms, no new variations on the categories you have developed — you have hit the floor of marginal information return, and the design is telling you that you have enough cases for the analytic claims you can defensibly make.
One thing to set aside before going further
If you were trained to read “n = 20” as an underpowered study, you will need to set that reading aside for the rest of this lesson and the rest of the course. The right question for a qualitative study with 20 transcripts is not “was the power adequate?” The right questions are: what configurations were captured? What variation does the sample cover? What claims can the analyst defensibly make on that basis? Those questions are answered by the methods section, not by an arithmetic. Section 5 of this lesson tells you what the methods section has to contain.
Reflection
Think of one qualitative study (published or hypothetical) and one quantitative study addressing the same broad topic. Briefly describe what each sampling logic gives the field that the other cannot. Try to avoid the framing of “the qualitative work is a follow-up to” or “a stepping stone toward” the quantitative work — the two should be co-equal in your answer.
Minimum 20 characters required.
Question 1: What is the defining feature of a probability sample, according to Bernard, Wutich, and Ryan?
Question 2: Which of the following best describes the analytic goal of nonprobability sampling, as Bernard, Wutich, and Ryan frame it?
Question 3: Why does the sample-size argument from probability sampling (“more cases purchase more precision”) not apply to nonprobability work?
Saturation — the Operational Concept and Its Empirical Backbone
Introduction and Overview
Section 1 ended with a claim that needs unpacking: that the sample-size question for nonprobability work is governed by saturation, the point at which additional cases stop adding new information about the phenomenon. Saturation is the most cited and the most contested concept in qualitative sampling. Glaser and Strauss (1967) introduced it as a feature of theoretical sampling in grounded theory; it has since been generalized far beyond that origin to almost every kind of nonprobability design. The concept has empirical legs — there is now a serious literature attempting to measure when saturation actually occurs in practice — and it also has reasonable critics. This section walks through both.
By the end of the section you should be able to define saturation operationally, recite the major empirical findings (Romney/Weller/Batchelder's 4-to-6 rule; Guest, Bunce, & Johnson (2006) finding that saturation arrives around 12 interviews; Hennink & Kaiser’s (2022) systematic review), and explain what saturation does and does not give you the right to claim about your dataset.
Learning Objectives for Section 2
- Define saturation in operational terms.
- Distinguish theoretical saturation, code saturation, and meaning saturation.
- Recall the Romney/Weller/Batchelder 4-to-6 rule and its underlying logic.
- Cite Guest, Bunce, & Johnson (2006) and Hennink & Kaiser (2022) and explain how their empirical findings constrain general claims about sample size.
2.1 Saturation, Operationally
Saturation is the point at which additional data no longer surface new themes, concepts, or relationships. It is the qualitative analog of stopping rules in sequential testing. The concept is widely used and often poorly operationalized; the next three tabs review what the empirical literature actually shows.
For cultural domains with high consensus, the Romney-Weller-Batchelder cultural consensus model shows that 4-6 well-chosen informants can recover the shared knowledge of a domain with high reliability. Most usefully applied to focused, bounded questions in homogeneous populations.
Guest, Bunce, & Johnson (2006) analyzed 60 interviews and showed that 92% of codes were identified by interview 12, with diminishing returns thereafter. Now the most-cited single empirical study on qualitative sample size. Important caveats: their study used a relatively homogeneous sample on a focused question; heterogeneous populations and broader questions require more.
Hennink & Kaiser’s (2022) systematic review of 23 empirical studies found 9-17 interviews typically reach code saturation for relatively homogeneous samples; 20-40+ often required for code-meaning saturation and heterogeneous samples. The 'magic number' is a range, not a single value, and depends on the sample, question, and analytic depth.
The working definition above: a sample is saturated when collecting additional cases stops producing new information relevant to the analytic question (Bernard, Wutich & Ryan, 2017, pp. 40–42). That is the simple statement. The operationalization — how you actually decide that you have hit saturation — requires more care, because what counts as “new information” depends on what kind of analysis you are doing.
Three flavours of saturation get distinguished in the methodological literature, and you should keep them straight because mixing them up is the most common source of methods-section confusion in published qualitative health papers.
Theoretical saturation. The original Glaser/Strauss usage. A sample is theoretically saturated when additional cases stop producing new categories, properties, or relationships in your developing theory. Theoretical saturation is tightly tied to theoretical sampling (Section 4 of this lesson): you sample iteratively, in response to the emerging analysis, until the theoretical structure stops changing. This is the most demanding kind of saturation and the hardest to demonstrate.
Code saturation. A more recent, more operational concept. A sample is code-saturated when additional cases stop producing new codes in your codebook. Code saturation tends to arrive relatively quickly — the major themes show up in the first several transcripts — and is what most empirical saturation studies actually measure.
Meaning saturation. The point at which additional cases stop deepening your understanding of the codes you already have. This is harder and slower than code saturation. You might have identified a code like “loneliness as spatial” after three interviews, but reach a real understanding of the variations within that code — the specific spatial metaphors people deploy, the sense in which space is doing work, the contrasts with non-spatial framings — only after fifteen or twenty.
Hennink, Kaiser, & Marconi (2017) made this distinction matter
In a widely cited methodological study, Monique Hennink and colleagues showed that in a sample of 25 women's-health interviews, code saturation arrived at 9 interviews, but meaning saturation required 16 to 24. The takeaway is that the rosy headline number you sometimes see (“saturation occurs around 9 interviews”) is doing work for code saturation, not for the deeper meaning saturation a serious analysis usually requires. When you defend your sample, be explicit about which kind of saturation you are claiming and on what evidence.
2.2 The Romney/Weller/Batchelder 4-to-6 Rule
The earliest empirically grounded heuristic for qualitative sample size comes from a body of work in cultural domain analysis (Bernard, Wutich & Ryan, 2017, p. 41). Romney, Weller, & Batchelder (1986) studied cultural consensus — the degree to which members of a group share an underlying cultural model of a domain. Their mathematical model showed that if cultural agreement within a group is high (which it typically is for shared cultural domains within a more-or-less culturally coherent group), as few as four to six knowledgeable respondents are enough to recover the shared cultural model with high confidence.
The 4-to-6 rule comes out of free-list, pile-sort, and similar cultural-consensus tasks where the analytic goal is to recover a structured, shared cognitive model. It does not generalize without qualification to depth interviewing about contested or variable phenomena (loneliness, for example, is not a tightly shared cultural domain in the way that, say, the taxonomy of edible mushrooms is for an experienced forager). But the rule established a principle that the empirical saturation literature has largely confirmed: when the group is culturally homogeneous and the domain is reasonably structured, sample sizes in the single digits are defensible. Sample sizes that look implausibly small to a quantitatively trained reader are not implausibly small to a cultural-consensus theorist who has done the math.
2.3 Guest, Bunce, and Johnson (2006): The 12-Interview Threshold
The most cited empirical study of saturation in interview research is Greg Guest, Arwen Bunce, & Laura Johnson’s (2006) paper “How many interviews are enough? An experiment with data saturation and variability,” published in Field Methods. Guest and colleagues coded a corpus of 60 women's-health interviews from West African field sites and tracked when new codes were appearing. They found that 74 percent of all codes had been identified after the first six interviews, and 92 percent after the first twelve. After twelve interviews, the new-code-per-interview rate dropped to near zero.
The Guest et al. study gave qualitative researchers the most cited heuristic in the field: if your sample is reasonably homogeneous and your analytic question is reasonably focused, saturation will arrive around 12 interviews. The 12-interview threshold has been embraced (and often over-applied) by graduate students writing methods sections, by IRBs trying to evaluate qualitative protocols, and by editors weighing the credibility of qualitative health papers. It has also been refined and qualified by subsequent work, which we turn to next.
2.4 Hennink and Kaiser (2022): The Systematic Review
Monique Hennink & Bonnie Kaiser (2022) published a systematic review in Social Science & Medicine titled “Sample sizes for saturation in qualitative research: A systematic review of empirical tests.” They identified 23 empirical studies that had measured saturation in interview research and synthesized the findings. Their conclusions are the contemporary state of the art.
First, saturation in interview research is typically reached between 9 and 17 interviews across the studies they reviewed. Twelve is a reasonable central estimate but is neither a floor nor a ceiling. Second, saturation point varies systematically with study design. Studies with narrower research questions, more homogeneous samples, and more experienced interviewers saturate faster. Studies with broader questions, heterogeneous samples, and less-experienced interviewers require more interviews. Third, different kinds of saturation arrive on different schedules — code saturation early, meaning saturation later, theoretical saturation latest of all. Fourth and most importantly, saturation should be operationalized and reported, not asserted. A methods section that simply says “saturation was reached” without indicating how it was defined and assessed has not done its work.
The honest caveat about saturation
Saturation has been criticized by serious qualitative methodologists (notably Braun and Clarke, 2021, for reflexive thematic analysis) as a concept that does not fit every qualitative method. The Braun-Clarke argument is that for interpretive work, the “new information” framing presupposes a realist epistemology that not every qualitative tradition shares: meaning is co-constructed, not waiting to be inventoried. Bernard, Wutich, and Ryan (and this course) take a pragmatic stance — saturation is a useful default standard for applied health research, and the empirical literature on it is helpful — but you should know that there are traditions in which the concept is contested. If your capstone uses a reflexive-thematic or constructivist-grounded-theory approach, you can defend a sample on grounds other than saturation.
2.5 What Saturation Gives You the Right To Claim
Saturation, when properly assessed and reported, gives you the right to claim that within the variation your sample captured, you have identified the major categories and configurations the phenomenon takes. It does not give you the right to claim that you have found all the categories that exist in the population. It does not give you the right to claim that you have measured the prevalence of any of the categories. And it does not exempt you from describing what variation your sample failed to capture.
For the HSCI 841 loneliness dataset, what saturation can and cannot do is illustrative. With 20 deliberately varied transcripts, you can defensibly claim that you have identified the major kinds of loneliness present in this sample's life-stage and identity range. You have probably saturated on common configurations like existential loneliness in later life (P11 Helen, P13 Margaret, P19 Rose, P20 Frank), situational loneliness following disruption (P03 Sarah following romantic dissolution, P16 Elena following job loss), and the loneliness of cultural-belonging disruption (P15 Amira's wahda, P18 Chen's straddling of Mandarin-speaking and English-speaking community ties). What you have not saturated on, with twenty transcripts, is the rarer configurations — the loneliness of incarcerated parents, the loneliness of unhoused adults, the loneliness of people in long-term institutional psychiatric care. Saturation can be claimed for the territory you mapped; it cannot be claimed for the territory you never went into.
Reflection
You are writing the methods section for your capstone. In one paragraph, draft a defensible saturation claim for the loneliness dataset: which kind of saturation are you claiming (theoretical, code, or meaning), what is the operational evidence, and what configurations are explicitly outside the scope of your saturation claim?
Minimum 20 characters required.
Question 1: Which of the following best operationalizes saturation in nonprobability qualitative sampling?
Question 2: Guest, Bunce, and Johnson (2006) found that, in a corpus of 60 women's-health interviews, what proportion of codes had been identified after the first 12 interviews?
Question 3: The Romney/Weller/Batchelder 4-to-6 rule is derived from work on:
The Six Nonprobability Sampling Strategies
Introduction and Overview
Bernard, Wutich, and Ryan organize nonprobability sampling into six strategies (Chapter 3, pp. 42–56). The strategies are not mutually exclusive — most real studies combine two or three — but each has its own logic, its own appropriate uses, and its own characteristic failure modes. This section walks through all six in turn, with the loneliness dataset as a concrete reference and with notes on when each strategy is the right choice.
Learning Objectives for Section 3
- Identify and contrast the six nonprobability sampling strategies in Bernard, Wutich, and Ryan.
- Recognize the loneliness dataset's sampling logic as purposive with quota elements.
- Distinguish snowball sampling from respondent-driven sampling and explain why RDS's inference machinery matters.
- Recognize when each strategy is the right tool for the analytic question.
3.1 Quota Sampling
Sets quotas for specific subgroups (e.g., 10 men and 10 women; 5 in each age bracket). Within quotas, recruitment is convenient. Useful when you need to ensure representation of categories that matter for your question. Not equivalent to probability stratified sampling.
Deliberate selection of participants who can speak to the question at hand. Maximum-variation (deliberately diverse), homogeneous (deliberately similar), critical-case (theoretically pivotal cases), and deviant-case (extreme exemplars) are common purposive strategies. The dominant strategy in qualitative health research.
Recruits whoever is reachable: undergraduates on campus, patients in clinic, friends of the researcher. Cheap, fast, and almost always biased in ways that limit transferability. Sometimes the right choice for a pilot study; rarely the right choice for a final analysis.
Asks existing participants to refer others. Snowball sampling is straightforward but can be deeply biased. Respondent-driven sampling (RDS), developed by Heckathorn (1997), adds dual-incentive structure and mathematical adjustment to approximate probability sampling in hidden populations — people who inject drugs, sex workers, undocumented migrants. Widely used in HIV epidemiology.
The defining sampling logic of grounded theory. Sampling is iterative: emerging theoretical concepts dictate who or what to sample next, with the goal of elaborating, contrasting, or testing concepts. Sampling and analysis proceed together. Not pre-specified at the start of the study.
Quota sampling is the deliberate construction of a sample to match pre-specified target cells across one or more dimensions of variation. You decide in advance that you want, say, five women and five men; or three participants in each of four age quartiles; or equal representation of immigrants and Canadian-born. You recruit until each cell is filled, accepting whoever in that cell happens to be available.
Quota sampling is the most common nonprobability design in applied health research, and you will encounter it routinely in market research, polling, and the rapid-turnaround qualitative components of evaluation studies. Its strength is that the sample's coverage of the dimensions of interest is guaranteed by construction. Its weakness is that within each cell, selection is typically convenience-based — whoever showed up first, whoever the recruiter knew, whoever responded to the flyer — and the within-cell sample is therefore a convenience sample.
The HSCI 841 loneliness dataset uses quota elements. The interview guide's recruitment notes (Bernard, Wutich & Ryan, 2017, Ch. 3 logic; see also the dataset's Interview Guide document) specify variation targets across four age quartiles spanning 18–80+, across gender (women, men, gender-diverse), across living arrangement, and across major life-stage transitions (recent immigration, recent loss, recent retirement, recent caregiving role, recent relationship dissolution). The sample of 20 was assembled to hit those targets — not as pure quota sampling, because the framework was richer than a fixed-cell design (we will discuss that richness in 3.2 below), but with quota-like discipline about coverage.
3.2 Purposive / Judgment Sampling
Purposive sampling (sometimes called judgment sampling or purposeful sampling) is the deliberate selection of cases on the basis of theoretical or substantive criteria. The analyst chooses cases that are expected to illuminate the phenomenon, on the grounds that those cases are richer, more variable, more strategically located, or more theoretically informative than randomly drawn ones would be. The selection criteria are explicit; the choices are defensible; the logic is curatorial.
Patton (2015; see also Palinkas et al., 2015) catalogued more than a dozen sub-types of purposive sampling, of which the most commonly invoked in health research are:
- Maximum-variation sampling. Deliberately select cases that span the variation in the phenomenon. The loneliness dataset is an instance.
- Homogeneous sampling. Select cases that share key features so within-group patterns become visible.
- Extreme or deviant case sampling. Select unusual or boundary cases on the theory that they make analytically invisible features visible.
- Critical case sampling. Select cases on the theory that if a phenomenon shows up here, it will show up anywhere; if it does not show up here, it does not show up.
- Typical case sampling. Select cases that exemplify the modal pattern.
- Confirming/disconfirming case sampling. Select cases specifically to test or strain an emerging interpretation.
The loneliness dataset is best characterized as purposive with quota elements, leaning toward maximum-variation. The 20 transcripts were not drawn to be statistically representative of British Columbia, and they were not assembled by simply filling demographic cells. They were chosen to capture the variation that the literature suggests matters for loneliness: variation in age (P01 Maya, 22; P11 Helen, 78; P20 Frank, 82), in gender and gender-identity history (P12 Tyler, non-binary; P14 Kenji, late-life coming out), in immigration trajectory (P15 Amira, recent refugee; P18 Chen, decades-long bicultural negotiation), in caregiving role (P05 Linda, daughter-of-aging-parent; P07 Diana, partner-of-someone-with-dementia), in life-stage transition (P03 Sarah, post-romantic-dissolution; P16 Elena, post-job-loss; P19 Rose, late-life widowhood), and in identity configurations the literature under-describes (P14 Kenji's late-life coming out; P12 Tyler's non-binary identity in a small town). The dataset is engineered, not random; that engineering is its strength as a teaching dataset.
3.3 Convenience Sampling
Convenience sampling is the recruitment of whoever is available, accessible, or willing — with no purposive criterion other than availability. Examples include the “intercept” survey at a clinic entrance, the email blast to a departmental listserv, the friend-of-a-friend pilot interview, and the all-too-common “we recruited 18 first-year nursing students from a course taught by the second author.”
Bernard, Wutich, and Ryan (2017, pp. 48–49) are careful, not dismissive, about convenience sampling. There are defensible uses of convenience sampling: piloting an interview guide, testing the recording equipment, training a new interviewer, or building rapport in a community before formal recruitment begins. There are also indefensible uses: a published study that recruited only the readily available and then claims findings that generalize beyond them. The line between the two is what the methods section says. If a convenience sample is reported transparently as such, with limitations honestly acknowledged, it is a defensible piece of empirical work. If it is dressed up as something more representative than it is, it is not.
Most published qualitative health studies are, in fact, convenience samples whether they say so or not. A more honest field would acknowledge this and would more carefully describe what claims a convenience sample does and does not support.
3.4 Network Sampling: Snowball and Respondent-Driven Sampling
Network sampling uses the social networks of initial participants to recruit additional ones. The two main forms are snowball sampling (Goodman, 1961) and respondent-driven sampling (RDS), and the distinction between them is methodologically important.
Snowball sampling. The classical form: initial participants (“seeds”) are asked to refer others, who are asked to refer others, and so on. Snowball sampling is the standard tool for reaching populations that are hidden, stigmatized, hard to identify from sampling frames, or organized around relationships rather than addresses. Examples: people who use drugs, sex workers, undocumented migrants, LGBTQ+ adults in regions where outness is risky, members of religious or political minorities.
The strength of snowball sampling is access. The weakness is that the sample is shaped by the social networks of the initial seeds, and those networks are unlikely to be representative of the target population. A snowball sample of people who use drugs starting from one community-health-centre client will look very different from one starting from a university-affiliated harm-reduction researcher. Bernard, Wutich, and Ryan are clear that snowball samples are inappropriate for prevalence estimation and have to be reported as the network samples they are.
Respondent-driven sampling (RDS). Developed by Douglas Heckathorn (1997; Heckathorn 2002 for the inference machinery), RDS is a sophisticated extension of snowball sampling that adds dual incentives (participants are paid for their own participation and for the participation of recruits they bring in), limited coupons (each participant can recruit only a fixed number of others, typically three), and a tracking-and-weighting framework that allows population-level inference under specific assumptions about network structure.
The reason RDS matters — and the reason it gets discussed in a graduate qualitative methods course even though much RDS work is quantitative — is that it represents the most serious attempt in the methods literature to turn network sampling into something with calculable inferential properties. Under Heckathorn's assumptions (long recruitment chains, accurate self-reported network sizes, random recruitment within social networks), RDS estimates can be weighted to approximate population-level prevalence and association estimates with calculable error bounds. The assumptions are strong and have been criticized in subsequent methodological work (Goel and Salganik, 2010; Gile and Handcock, 2010), but RDS remains the most defensible network-sampling design for many populations of public-health interest.
The HSCI 841 loneliness dataset did not use snowball or respondent-driven sampling. The dataset's purposive-with-quota design was assembled directly through recruiters, not through participant-driven referral. This is important to record in your capstone's methods section: snowball and RDS designs come with specific analytic obligations, and a study that did not use them cannot claim the inferential properties they enable, but also has none of the network-dependence biases they introduce.
3.5 Theoretical Sampling (Glaserian)
For your HSCI 841 capstone, draft a one-page sampling plan:
- Sampling strategy: Which of the five above are you using, and why?
- Sampling matrix: What categories matter for your question (age, role, geography, condition)? How many participants per cell?
- Recruitment: Through what channels will you reach each cell?
- Stopping rule: What is your operationalization of saturation, and what would tell you when you have reached it?
A defensible qualitative sample needs the same explicit planning a survey sample does. The strategy is different; the discipline is the same.
Theoretical sampling is the iterative, emergent sampling strategy developed by Barney Glaser and Anselm Strauss (1967) as part of grounded theory. The procedure: you collect and analyse some data, develop a preliminary theory, identify what additional data would test or extend the theory, sample those additional data deliberately, re-analyse, and continue until theoretical saturation is reached. The sampling and the analysis are not separated: each round of sampling is shaped by what the previous round revealed.
Theoretical sampling differs from purposive sampling in a critical way that is often missed. Purposive sampling is a-priori: you decide before recruitment what kinds of variation you want, and you recruit to those targets. Theoretical sampling is emergent: you cannot say in advance what cases you will want, because the criteria are determined by the developing analysis. The first three participants in a theoretical sampling study might be selected for convenience; the next three might be selected because the analysis of the first three revealed a configuration that needs further exploration; the next three might be selected to test whether a hypothesized boundary condition holds.
The HSCI 841 loneliness dataset is not a theoretically sampled dataset. The 20 transcripts were assembled in a single recruitment phase, with the variation targets specified in advance. This matters because some qualitative-methods textbooks use “theoretical sampling” and “purposive sampling” almost interchangeably; Bernard, Wutich, and Ryan are clear that they are different things, and a methods section that calls a purposive sample a theoretically sampled one is making a category mistake.
Glaser vs. Strauss on theoretical sampling
Within grounded theory, the original co-authors split methodologically in the 1980s. Glaser kept theoretical sampling tightly tied to emergence from the data; Strauss (with Corbin) developed a more structured version that allowed more a-priori coding. Charmaz's constructivist grounded theory (which you will meet in Module 7) sits closer to the Glaserian original. For the purposes of this course, “theoretical sampling” refers to the Glaserian original: iterative, emergent, analysis-driven.
3.6 Key-Informant Sampling
Key informants are individuals selected because they are unusually knowledgeable, articulate, or strategically located with respect to the phenomenon of interest. In a study of clinic operations, the key informants might be the head nurse, the intake coordinator, and the medical director — people whose roles give them an overview no individual patient could provide. In a study of a religious community, the key informants might be elders, clergy, or longtime members. The selection criterion is access to information, not representativeness.
Bernard (a foundational figure in this concept, drawing on cultural anthropology fieldwork) is careful to distinguish two uses (Bernard, Wutich & Ryan, 2017, pp. 54–56). Key informants can replace broader sampling when the research question is about a system or institution and the key informants are who knows it: a study of how a province's overdose-response protocol gets implemented might rely largely on interviews with the small number of people who actually implement it. Key informants can also supplement broader sampling when the analytic question requires both lay perspectives and expert ones: a study of loneliness might interview 20 lay participants (the HSCI 841 dataset) and supplement with two or three key-informant interviews with clinicians, community-organisation directors, or older-adult-services coordinators who see loneliness across many people.
Key-informant interviews are typically deeper, longer, and more iterative than lay interviews. They are also more vulnerable to the informant's own framings being adopted by the analyst (the “going native” problem familiar from anthropology), and to the political dynamics around who gets named a key informant in a community.
3.7 Summary Table
| Strategy | Logic | Best for | Watch out for |
|---|---|---|---|
| Quota | Pre-specified target cells across dimensions of variation | Guaranteed coverage of known dimensions | Within-cell convenience selection |
| Purposive | Theoretically/substantively driven selection | Maximum-variation, extreme-case, critical-case designs | Criteria must be explicit and defensible |
| Convenience | Whoever is available | Piloting, training, equipment testing | Indefensible for substantive claims unless transparently acknowledged |
| Snowball | Participant-driven referral chains | Reaching hidden or stigmatized populations | Seed-network dependence; not for prevalence |
| Respondent-driven (RDS) | Dual-incentive, coupon-tracked snowball with inference machinery | Population estimates for hidden populations | Assumption-heavy; demanding to execute well |
| Theoretical | Iterative, emergent, analysis-driven (Glaserian) | Grounded-theory studies; theory development | Distinct from purposive — do not conflate |
| Key informant | Selection by unusual knowledge or strategic position | System/institution studies; expert supplement | “Going native”; community politics |
Reflection
Imagine you are designing a qualitative study of loneliness among recently arrived refugees in British Columbia. Which two or three of the six strategies would you combine, and why? Be specific: what does each strategy contribute that the others cannot?
Minimum 20 characters required.
Question 1: A study recruits 5 women and 5 men in each of four age quartiles, accepting whoever is available within each cell. The sampling strategy is best characterized as:
Question 2: The HSCI 841 loneliness dataset is best characterized as:
Question 3: Which is the key methodological distinction between snowball sampling and respondent-driven sampling (RDS)?
Defending a Qualitative Sample — and the Week 3 Capstone Milestone
Introduction and Overview
The first three sections gave you the conceptual machinery: probability vs. nonprobability, saturation, and the six strategies. This section turns operational. We work through what the methods section of a qualitative health paper actually has to say about sampling — what Bernard, Wutich, and Ryan call “defending the sample” — and we use that template to set up the Week 3 capstone milestone, in which you document the loneliness dataset's sampling logic in your own writing.
Learning Objectives for Section 4
- Identify the five things a defensible qualitative methods section owes a reader about sampling.
- Recognize the loneliness dataset as a worked example of each element.
- Document the sampling structure of a small dataset in R for an appendix figure.
- Produce the Week 3 capstone deliverable: a 600-word sampling memo and a one-page sampling matrix.
4.1 What the Methods Section Owes the Reader
A qualitative methods section that handles sampling well covers five elements. Each element has a corresponding sub-claim, and a paper that omits any one of them leaves the reader unable to evaluate the design.
(1) The sampling strategy, named and justified. Which of the six (or combination) did you use? On what theoretical or substantive grounds? It is not enough to say “a purposive sample was recruited.” You have to say which kind of purposive sampling (maximum-variation, extreme-case, critical-case), why that kind, and what the alternative strategies would have given or cost you. The HSCI 841 loneliness dataset would be described as purposive sampling with quota elements, organized around a maximum-variation logic across age, gender, life-stage, immigration status, caregiving role, and identity.
(2) The recruitment procedure. How did you find people? Where did you advertise? What was the screening process? Who was screened out and why? For the HSCI 841 dataset (synthetic though it is, for instructional purposes), the recruitment would be described as conducted through community-based recruiters working with settlement agencies, retirement communities, post-secondary student-health offices, and 2SLGBTQ+ community organisations, with screening for adult age, current BC residence, English-language interview capacity, and self-identified loneliness in the past 12 months.
(3) The variation captured. What configurations does the sample actually cover? The cleanest way to report this is a sampling matrix — a table that lists each participant and the values of each variation dimension. A reader looking at the matrix can immediately see what variation the study captured. The Week 3 capstone deliverable asks you to produce exactly such a matrix.
(4) The variation NOT captured. What configurations are explicitly outside the sample? This is the discipline that most published qualitative health papers most reliably skip, and it is the one that most distinguishes a defensible sample from one dressed up to look more comprehensive than it is. For the HSCI 841 dataset, the variation not captured includes (among others): adults under 18, BC residents not interviewable in English, adults living outside BC, currently institutionalised adults (carceral, psychiatric, long-term hospital), unhoused adults, and adults experiencing acute crisis at the time of approach.
(5) The sample-size logic. Why this number of cases? On what evidence is the size defensible? The defensible answers, depending on design, are some combination of: code or meaning saturation reached at case n; cultural-consensus logic (Romney/Weller/Batchelder) for a culturally coherent group; the Guest et al. (2006) and Hennink & Kaiser (2022) empirical literature for an interview study with a reasonably focused question; the information-power framework of Malterud, Siersma, & Guassora (2016); or analytic and resource constraints honestly named.
4.2 What the Methods Section Does NOT Have To Do
A defensible qualitative methods section does not have to defend the sample on the grounds of probability-sample logic. You do not have to apologize for not being a population sample. You do not have to gesture at “future quantitative work with a larger n.” You do not have to invoke the language of generalizability when the inferential target was never the population in the first place.
What you do have to do is name the inferential target you are claiming. Bernard, Wutich, and Ryan call this transferability: the ability of the reader to assess whether the patterns you identified are likely to hold in other settings or populations the reader is interested in. Transferability is not the same as statistical generalizability, and the burden of judging it is partly on the reader. Your job is to give the reader enough information — through the methods section, the sampling matrix, the variation-captured and variation-not-captured statements — that the reader can make a defensible transferability judgment.
4.3 Documenting a Sample in R
Most qualitative analyses spend more time on text than on R. Sampling is one of the exceptions: documenting the sample's structure is the kind of small-dataset reporting that R does well, and a one-page sampling matrix figure goes into your capstone appendix with very little work. The block below sketches the procedure for the loneliness dataset.
Assuming you have created a small participant_attributes.csv file with one row per participant and columns for age, gender, life_stage, immigration_status, caregiving_role, and identity_notes, this block produces summary bar charts you can drop into your capstone appendix.
library(tidyverse)
# Read the per-participant attribute file you built by hand from the transcripts
attrs <- read_csv("../term projects/HSCI_841/participant_attributes.csv")
glimpse(attrs)
# Should show 20 rows and the columns: pid, name, age, gender, life_stage,
# immigration_status, caregiving_role, identity_notes
# Quick age distribution
ggplot(attrs, aes(x = age)) +
geom_histogram(binwidth = 10, fill = "#0B7B6B", colour = "white") +
labs(title = "Age distribution of loneliness sample (n = 20)",
x = "Age (years)", y = "Count") +
theme_minimal()
# Variation across gender x life-stage (a 2-way summary of the sample's coverage)
attrs |>
count(gender, life_stage) |>
ggplot(aes(x = life_stage, y = n, fill = gender)) +
geom_col(position = "dodge") +
coord_flip() +
labs(title = "Sample coverage: gender x life-stage",
x = NULL, y = "Number of participants") +
theme_minimal()
# Sample matrix table (the kind that goes in your appendix)
attrs |>
select(pid, age, gender, life_stage, immigration_status,
caregiving_role, identity_notes) |>
arrange(age) |>
print(n = 20)
What success looks like: An age-histogram, a gender-by-life-stage bar chart, and a printed-to-console 20-row sampling matrix. Save the bar chart as a PNG; export the matrix as a table for your appendix. This is the sampling figure your capstone will cite.
4.4 The Week 3 Capstone Milestone
The Week 3 capstone milestone integrates everything in this lesson. The deliverables are two: a 600-word sampling memo and a one-page sampling matrix. Together they are the first piece of the eventual methods section of your capstone paper.
Reflection
Of the five elements a defensible qualitative methods section owes the reader on sampling (strategy named & justified; recruitment procedure; variation captured; variation NOT captured; sample-size logic), which is the one you are most worried about getting right for your own capstone, and why? Be specific about which loneliness-dataset feature makes this element hardest.
Minimum 20 characters required.
Question 1: Which of the following is NOT one of the five elements a defensible qualitative methods section owes the reader on sampling?
Question 2: What is Bernard, Wutich, and Ryan's concept of transferability in qualitative work?
Question 3: Which deliverable is required for the Week 3 capstone milestone?
Final Assessment
Bringing It All Together
Lesson 3 has given you the sampling vocabulary your capstone methods section will be written in. The probability/nonprobability distinction (Section 1) locates qualitative sampling in its own logic rather than as a deficient cousin of probability work. The saturation framework (Section 2) replaces the power calculation with a defensible operational standard for nonprobability sample size, with empirical anchors from Romney/Weller/Batchelder, Guest et al. (2006), and Hennink & Kaiser (2022). The six-strategy taxonomy (Section 3) gives you the named tools you will use and combine in your own designs, with particular attention to the distinction between purposive and theoretical sampling and to the inferential machinery of RDS. And the methods-section template (Section 4) gives you the five-element discipline a reader of your eventual capstone paper will hold the work to.
What you take away from this lesson directly enables Lessons 4 (Qualitative Data Collection) and 5 (Themes & Codebooks). A defensible sample is the precondition for defensible data collection and defensible analysis; without it, the rest of the methodological discipline of HSCI 841 has nothing to stand on.
Key Takeaways from Lesson 3
- There are two kinds of samples doing two kinds of jobs. Probability sampling estimates population magnitudes with calculable error. Nonprobability sampling characterizes a phenomenon through deliberate case selection. The two are complementary, not hierarchical.
- Nonprobability sample size is governed by saturation, not power. The operational question is when additional cases stop producing new information. Code saturation arrives earlier than meaning saturation; theoretical saturation requires iterative sampling.
- The empirical literature on saturation has converged. Romney/Weller/Batchelder show that 4–6 informants suffice for cultural-consensus tasks. Guest et al. (2006) found ~92% of codes by 12 interviews. Hennink & Kaiser (2022) reviewed 23 studies and report typical saturation between 9 and 17 interviews, varying with study design.
- Six nonprobability strategies cover the field. Quota, purposive, convenience, snowball, RDS, theoretical, and key-informant. Each has its own logic, appropriate use, and characteristic failure modes.
- Theoretical sampling is not the same as purposive sampling. Theoretical sampling is iterative and analysis-driven (Glaserian); purposive is a-priori and theory- or substance-driven. Conflating them is a common methods-section category mistake.
- The HSCI 841 loneliness dataset is purposive with quota elements, leaning maximum-variation. It is not snowball; it is not theoretical sampling. Its strength is variation captured; its limits are the variation not captured.
- A defensible methods section owes the reader five things on sampling: strategy named and justified; recruitment procedure; variation captured; variation NOT captured; sample-size logic.
Core Concepts Reviewed
Section 1: Probability vs. nonprobability sampling as different jobs (extensity vs. intensity); the curatorial logic of nonprobability designs; why “nonprobability is a bad probability sample” is the wrong reading; why the sample-size question has to be different.
Section 2: Saturation as informational redundancy; three flavours (theoretical, code, meaning); the Romney/Weller/Batchelder 4-to-6 rule and its cultural-consensus origin; Guest, Bunce & Johnson (2006) and the 12-interview threshold; Hennink & Kaiser's 2022 systematic review and the 9-to-17 range; what saturation gives you the right to claim.
Section 3: Quota sampling; purposive/judgment sampling and Patton's sub-types (max-variation, homogeneous, extreme-case, critical-case, typical-case, confirming/disconfirming); convenience sampling (defensible and indefensible uses); snowball sampling vs. respondent-driven sampling and Heckathorn's inference machinery; theoretical sampling (Glaserian) vs. purposive sampling; key-informant sampling (replacing vs. supplementing broader samples).
Section 4: The five elements of a defensible methods section on sampling; what the methods section does NOT have to do; transferability vs. statistical generalizability; documenting a sample in R (sampling matrix, gender-by-life-stage coverage figure); the Week 3 sampling-memo capstone milestone.
The final reflection below asks you to commit to a sampling self-discipline for your capstone. There is no single right answer; the goal is to leave the lesson with an articulated stance on what your sample can and cannot claim.
Final Reflection
In one paragraph, articulate the stance you intend to take on sampling in your capstone paper. Specifically: name the sampling strategy of the HSCI 841 loneliness dataset, identify the kind of saturation you will claim, and commit to one specific kind of variation that you will name as not captured in your methods section.
Minimum 30 characters required.
Question 1: What is the defining feature of a probability sample?
Question 2: Bernard, Wutich, and Ryan characterize the goal of nonprobability sampling as:
Question 3: Which empirical study is most associated with the “12-interview threshold” for saturation in interview research?
Question 4: The Romney/Weller/Batchelder 4-to-6 rule applies most directly to:
Question 5: Hennink and Kaiser's 2022 systematic review found that saturation in interview research is typically reached:
Question 6: Which of the following is NOT one of the six nonprobability sampling strategies in Bernard, Wutich, and Ryan's Chapter 3?
Question 7: The HSCI 841 loneliness dataset is best characterized as:
Question 8: The key methodological feature that distinguishes respondent-driven sampling (RDS) from ordinary snowball sampling is:
Question 9: The critical methodological difference between theoretical sampling and purposive sampling is:
Question 10: Hennink, Kaiser, and Marconi (2017) distinguished code saturation from meaning saturation. Which of the following best describes the finding?
Question 11: Which of the following is a defensible use of convenience sampling, according to Bernard, Wutich, and Ryan?
Question 12: Key-informant interviews can either replace or supplement broader sampling. An example of supplementing would be:
Question 13: Which of the following is NOT one of the five elements a defensible qualitative methods section owes the reader on sampling?
Question 14: The Week 3 capstone milestone deliverables are:
Question 15: Saturation, when properly assessed and reported, gives the analyst the right to claim:
Glossary — Sampling Concepts, Strategies & Key People
📚 Reference page — available throughout the lesson
This glossary collects the key concepts, sampling strategies, and people introduced in Lesson 3. Use it as a reference while you work through the material, or as a review before the final assessment. Type in the search box to filter entries.