HSCI 841 — Lesson 12

Computational Text Analysis, Cultural Domain Analysis & LLM-Assisted Coding

The Final Lesson — HSCI 841

Kiffer G. Card, PhD, Faculty of Health Sciences, Simon Fraser University

Learning objectives for this lesson:

  • Apply keyword-in-context (KWIC), word-frequency, keyness, and collocation analysis to interview corpora using quanteda
  • Distinguish TF-IDF, keyness (chi-squared/log-likelihood), and raw frequency, and select the right measure for the analytic question
  • Conduct cultural domain analysis — free listing (with Smith's salience), pile sorts (with MDS), triad tests, and the Romney-Weller-Batchelder consensus model
  • Build a term co-occurrence network, compute centrality and community structure, and visualize it with igraph
  • Articulate the opportunities and risks of LLM-assisted qualitative coding — speed, scale, reproducibility, hallucination, prompt drift, bias amplification
  • Design a calibration-and-validation workflow for LLM coding using a hand-coded reference set and Krippendorff's alpha
  • Write a prompt that applies a project codebook reliably to a transcript, and audit the resulting codings
  • Disclose the use of LLM tools defensibly in a methods section
  • Submit the final HSCI 841 capstone paper in journal-article format, with codebook, audit trail, and positionality statement

This course was developed by Kiffer G. Card, PhD, as a companion to Bernard, H. R., Wutich, A., & Ryan, G. W. (2017). Analyzing Qualitative Data: Systematic Approaches (2nd ed.). SAGE. This lesson covers Chapters 17, 18, and 19, and extends to LLM-assisted coding — a methodology that postdates Bernard, Wutich, and Ryan’s 2017 edition and is the most rapidly changing topic in the field.

Section 1 of 5

KWIC, Word Counts, and Keyness Analysis

⏱ Estimated reading time: 35 minutes

Introduction and Overview

For the first eleven lessons of this course you have read transcripts line by line, hand-coded passages in Taguette, built codebooks, compared subgroups, written analytic memos, and worked your way through grounded theory, content analysis, schema analysis, narrative analysis, discourse analysis, and analytic induction. All of that work is what Bernard, Wutich, and Ryan call close reading. It is irreplaceable. But it does not scale. A 20-transcript loneliness corpus is at the upper edge of what a single analyst can read three or four times during a single graduate term. A 200-transcript corpus is not analyzable in that way. A 20,000-document corpus — the kind that increasingly arrives from social-media scraping, electronic health-record narrative fields, or large-scale open-text survey items — is unanalyzable in that way for any human team.

Computational text analysis is the family of techniques that lets you ask quantitative questions of a text corpus without first reducing the texts to codes (Grimmer, Roberts, & Stewart, 2022; Wikipedia contributors, n.d.-a). The questions are simple: which words appear most often? which words appear distinctively in subgroup A compared to subgroup B? which words travel together? where in the corpus does a given word appear, and in what immediate context? The answers are computed in seconds for corpora the size of yours and in minutes for corpora a thousand times the size. Chapter 17 of Bernard, Wutich, and Ryan covers the foundational techniques: keyword-in-context (KWIC), word frequencies, type-token ratios, TF-IDF, keyness, collocations, and n-grams. This first section of Lesson 12 walks you through each of them, applied to the loneliness corpus.

One framing point before we begin. Computational text analysis is sometimes presented as an alternative to close reading. It is not. Bernard, Wutich, and Ryan's stance — and the stance of this course — is that computational techniques are front-loaders for close reading. They surface candidates for the analyst to read, in context, and decide whether the pattern is real. A keyness analysis that identifies tired as significantly more frequent in older participants than in younger participants is not a finding. It is a pointer toward passages worth reading closely. The finding is what you say after you have read those passages and decided whether the word is doing analytically interesting work or simply registering a generational vocabulary tic.

Learning Objectives for Section 1

  • Apply keyword-in-context (KWIC) analysis to the loneliness corpus using quanteda::kwic().
  • Compute and interpret word frequencies, type-token ratios, and lexical diversity.
  • Distinguish raw frequency, TF-IDF, and keyness, and select the right measure for an analytic question.
  • Conduct a keyness analysis comparing older and younger participants in the loneliness corpus using quanteda.textstats::textstat_keyness().
  • Identify collocations (words that co-occur more often than chance) using quanteda.textstats::textstat_collocations().
  • Understand n-grams as multi-word features and when to use them.

1.1 KWIC — The Oldest Computational Text-Analysis Technique

Keyword in Context. The oldest computational text-analysis technique, dating to medieval concordances and computerized in the 1950s. Generates lists of every occurrence of a target word with surrounding context. Used to disambiguate polysemy and to see how a term is actually deployed in a corpus before interpreting frequency counts.

Term Frequency × Inverse Document Frequency. A weighting that boosts terms frequent in a particular document and penalizes terms common across the whole corpus. Surfaces what is distinctive about each document. The standard input to document classification and retrieval systems.

Keyness. Statistical comparison of word frequencies in a target corpus vs a reference corpus. 'Which words are unusually frequent in 2026 health policy documents compared to 2010?' Log-likelihood and chi-squared tests are common. Output: a ranked list of distinguishing terms.

Collocations: words that co-occur within a defined window more often than chance would predict ('mental + health', 'public + health'). N-grams: consecutive word sequences (2-grams, 3-grams). Both reveal multi-word concepts that single-word analysis misses.

The keyword-in-context concordance is the oldest digital text-analysis technique, predating personal computers. The original KWIC concordances were produced on mainframes in the 1950s and 1960s, most famously for biblical scholarship and for early lexicography (Bernard, Wutich & Ryan, 2017, Ch. 17). The idea is simple: pick a target word; for every occurrence of that word in the corpus, print the word plus a fixed number of words on either side. The output is a vertical list of one-line excerpts with the target word column-aligned in the middle. A human can scan a 200-line KWIC of a single word in two or three minutes and develop a sense of how the word is being used that no other technique offers as cheaply.

The reason KWIC remains analytically valuable in 2026 is that it does something that quantitative text statistics do not: it preserves local context. A word frequency table tells you that chair appears 47 times in the loneliness corpus; it does not tell you that 32 of those 47 are references to a chair where an absent person used to sit, that 11 are references to the participant's own seated immobility, and that 4 are incidental mentions of furniture. The KWIC reveals the distribution of senses in seconds. The numerical-frequency analyst who skipped the KWIC step might mistakenly write “chairs are an important physical motif in the loneliness corpus,” which is true but misses the more specific finding that empty chairs are an important physical motif.

The technical mechanics of KWIC are trivial. The interpretive demands are not. A good KWIC reading is one where you have looked at every line of the concordance, classified each occurrence into a small number of senses, and decided which sense is the analytically productive one. You should treat KWIC as a five-to-ten-minute exercise per target word, not as a one-line command whose output you skim.

RKWIC analysis of the loneliness corpus

This block assumes you have the loneliness_corpus and loneliness_tokens objects in memory from Lesson 5 (Section 4.1–4.2). If not, re-run those code blocks first.

library(tidyverse)
library(quanteda)
library(readtext)

# Reload corpus if needed (copy-paste from Lesson 5)
loneliness_rt <- readtext(
  "term projects/HSCI_841/transcripts/P*.txt",
  docvarsfrom = "filenames",
  docvarnames = c("participant_id", "pseudonym"),
  dvsep = "_"
)
loneliness_corpus <- corpus(loneliness_rt)
loneliness_tokens <- tokens(loneliness_corpus, remove_punct = TRUE) |>
  tokens_tolower()

# KWIC for "chair" across the 20 transcripts, 6 words of context on either side
chair_kwic <- kwic(loneliness_tokens, pattern = phrase("chair*"), window = 6)
print(chair_kwic, max_ndoc = 20)

# Convert to a tibble so you can write it out for close reading
chair_df <- as_tibble(chair_kwic)
write_csv(chair_df, "outputs/wk12_kwic_chair.csv")

# Contrastive KWIC: "alone" vs "lonely" -- different words, different meanings
alone_kwic  <- kwic(loneliness_tokens, pattern = "alone",  window = 5)
lonely_kwic <- kwic(loneliness_tokens, pattern = "lonely", window = 5)

# Quick descriptive comparison
cat("alone:  ", nrow(as_tibble(alone_kwic)),  "occurrences\n")
cat("lonely: ", nrow(as_tibble(lonely_kwic)), "occurrences\n")

# KWIC for multiple terms at once
candidate_terms <- c("chair", "empty", "hollow", "fading", "tired", "wahda")
multi_kwic <- kwic(loneliness_tokens, pattern = candidate_terms, window = 5)
head(multi_kwic, 20)

What to do with this output: Read the chair KWIC line by line and classify each occurrence into a sense (empty-chair-of-absent-person, participant's-own-chair-as-immobility-marker, incidental furniture mention). The numerical distribution of senses is part of your eventual findings. Repeat for the contrast between alone and lonely — Bernard, Wutich, and Ryan (and your transcripts) treat these as distinct, and a KWIC contrast surfaces the distinction concretely.

1.2 Word Frequencies, Type-Token Ratios, and Lexical Diversity

The simplest computational text statistic is the word frequency. After removing stopwords (the, and, of, but, was, are…), you compute how many times each remaining word appears in the corpus and rank them. The top 50 or 100 words are usually a mix of the obvious (loneliness, alone, people, feel) and the surprising. The surprising ones are the analytically valuable findings. In the loneliness corpus, the top 50 content words include chair, quiet, radio, wednesday, and fading; each of those repays a KWIC read and yields a candidate theme.

Word frequency analysis is also the foundation for two further statistics: the type-token ratio (TTR) and lexical diversity. A token is any occurrence of a word; a type is a distinct word. The phrase “the chair, the empty chair” contains 5 tokens and 3 types (the, chair, empty). The ratio of types to tokens is one measure of how lexically varied a text is. A high TTR means the speaker uses many different words; a low TTR means they recycle a small vocabulary. TTR is sensitive to document length (longer documents inevitably have lower TTRs because common words repeat), so for cross-document comparison you typically use a length-corrected measure such as the Moving-Average Type-Token Ratio (MATTR) or Mean Segmental TTR (MSTTR), implemented in quanteda.textstats::textstat_lexdiv().

In health research, lexical diversity has been used as a coarse proxy for cognitive function (lower diversity in dementia transcripts) and for emotional constriction (lower diversity in some depression measures). In your loneliness corpus, a comparison of lexical diversity across participants is a defensible analytic move — it would let you ask, for example, whether the participants with the most circumscribed social worlds also speak with the most circumscribed vocabularies. Bernard, Wutich, and Ryan would emphasize that any such finding requires close-reading confirmation; a numerical difference is a pointer, not a conclusion.

RWord frequencies, type-token ratios, lexical diversity
library(quanteda)
library(quanteda.textstats)

# Stopword removal (English stopwords) and build a document-feature matrix
loneliness_tokens_nostop <- tokens_remove(loneliness_tokens, stopwords("en"))
loneliness_dfm <- dfm(loneliness_tokens_nostop)

# Top 50 content words across the corpus
top50 <- topfeatures(loneliness_dfm, n = 50)
print(top50)

# Same thing in tidy form, with frequency and rank
freq_tbl <- textstat_frequency(loneliness_dfm, n = 100)
head(freq_tbl, 20)
write_csv(freq_tbl, "outputs/wk12_word_frequencies.csv")

# Type-token ratio per transcript (basic, length-sensitive)
ttr_basic <- textstat_lexdiv(loneliness_dfm, measure = "TTR")
print(ttr_basic)

# Better: length-corrected lexical diversity (MATTR and MSTTR)
ttr_mattr <- textstat_lexdiv(
  loneliness_tokens_nostop,
  measure = c("TTR", "MATTR", "MSTTR"),
  MATTR_window = 100,
  MSTTR_segment = 100
)
print(ttr_mattr)

# Visualize lexical diversity across participants
ttr_mattr |>
  as_tibble() |>
  ggplot(aes(reorder(document, MATTR), MATTR)) +
  geom_col(fill = "#0B7B6B") +
  coord_flip() +
  labs(x = "", y = "MATTR (length-corrected lexical diversity)",
       title = "Lexical diversity across the 20 loneliness transcripts") +
  theme_minimal()

What to look for: The top-50 list will surface the candidate words for KWIC analysis. The MATTR ranking will identify the participants with the most and least varied vocabularies — read those transcripts again, in light of their MATTR rank, and decide whether the numerical difference is registering something analytically meaningful (genuine constriction of social or emotional vocabulary) or merely a stylistic / dialectal difference.

1.3 TF-IDF — Term Frequency × Inverse Document Frequency

Raw word frequency tells you which words appear most often in the corpus. It does not tell you which words are distinctive of a particular document. The word loneliness appears many times in every transcript; it is not informative about which transcript is which. The word wahda appears in only one transcript (P15, Amira) and there several times; it is highly informative about that transcript.

Term Frequency × Inverse Document Frequency (TF-IDF) is the classical measure designed to surface this kind of document-distinctive vocabulary (Wikipedia contributors, n.d.-b). The idea is to weight each word's frequency in a document by how rarely it appears in the rest of the corpus. The term-frequency component rewards words that are common in the focal document; the inverse-document-frequency component penalizes words that are common in many other documents. The product is highest for words that are common in this document and rare elsewhere. TF-IDF was developed for information retrieval (which documents are most relevant to a search query?) in the 1970s and has become the workhorse statistic for document similarity, search ranking, and feature selection in supervised text classification.

In qualitative analysis, TF-IDF is most useful for surfacing what a single transcript is distinctively about. A TF-IDF ranking of P15 (Amira, recent refugee from Syria) is likely to surface wahda, Aleppo, family, before, and other terms that capture what is distinctive about Amira's account of loneliness compared to the other nineteen participants. Bernard, Wutich, and Ryan treat this as a tool for case characterization: which words best describe what makes this case different from the rest of the corpus?

RTF-IDF for case-distinctive vocabulary
library(quanteda)

# Compute TF-IDF weighted DFM
loneliness_tfidf <- dfm_tfidf(loneliness_dfm)

# Top 10 most-distinctive words per transcript
top_tfidf_per_doc <- apply(loneliness_tfidf, 1, function(x) {
  names(sort(x, decreasing = TRUE))[1:10]
})
print(top_tfidf_per_doc)

# Tidy version: long-format table of (doc, word, tfidf)
library(tidytext)
loneliness_long <- as_tibble(convert(loneliness_dfm, to = "data.frame")) |>
  pivot_longer(-doc_id, names_to = "word", values_to = "n") |>
  filter(n > 0)

loneliness_tfidf_tidy <- loneliness_long |>
  bind_tf_idf(word, doc_id, n)

# Top 5 distinctive words for each transcript
loneliness_tfidf_tidy |>
  group_by(doc_id) |>
  slice_max(tf_idf, n = 5, with_ties = FALSE) |>
  arrange(doc_id, desc(tf_idf)) |>
  print(n = 100)

What the output gives you: A vocabulary fingerprint for each participant. P15 will surface wahda and Syria-related terms; P11 (Helen) will surface fading, radio, marie, wednesday; P05 (Linda) will surface terms tied to her widowhood and her terrier. These fingerprints are useful both as descriptive case sketches in your findings section and as cross-checks on the candidate themes that emerged in earlier coding.

1.4 Keyness — Comparing Two Subcorpora

Keyness analysis is the comparative cousin of TF-IDF. Instead of asking which words distinguish a single document from the corpus, it asks which words distinguish a group of documents (a subcorpus) from another group of documents. The standard implementation uses either the chi-squared test or the log-likelihood ratio test on the 2×2 frequency table of (word X / not-word X) by (group A / group B), one word at a time. The output is a ranked list of words with their test statistic, p-value, and frequency in each subcorpus. The words at the top are the words that are statistically more distinctive of one subcorpus than the other.

Keyness is widely used in corpus linguistics (e.g., comparing British versus American English corpora) and increasingly in health research (comparing patient-experience text by diagnosis, by gender, by treatment arm in a trial). For your loneliness corpus, the most natural comparison is across age. The corpus contains four participants in their seventies and eighties (P05 Linda, P11 Helen, P17 Jacob, P20 Frank) and four in their twenties (P01 Maya, P10 Daniel, P12 Tyler, P19 Rose). A keyness analysis comparing these two subcorpora surfaces the vocabulary of late-life loneliness against the vocabulary of early-adult loneliness — and the contrast is consequential. The older subcorpus has more keywords like dead, quiet, visit, fading, radio, walker; the younger has more keywords like online, screen, followers, group chat, discord, scrolling.

The interpretive point is not that the words are themselves the finding. The interpretive point is that the structural worlds in which the two cohorts experience loneliness are different, and the keyness list is a vocabulary-level signature of that structural difference. The finding the keyness analysis points you toward — that the technology mediation of loneliness in young adulthood and the embodied immobility of loneliness in old age are different empirical objects deserving different policy responses — is the analytic claim that goes in your paper. The keyness list is the evidence trail.

RKeyness analysis: older vs younger participants
library(quanteda)
library(quanteda.textstats)
library(quanteda.textplots)

# Define the two subcorpora by participant ID
older <- c("P05", "P11", "P17", "P20")   # 70s and 80s
younger <- c("P01", "P10", "P12", "P19") # 20s

# Tag each document with a group variable
docvars(loneliness_dfm, "age_group") <- case_when(
  docvars(loneliness_dfm, "participant_id") %in% older   ~ "older",
  docvars(loneliness_dfm, "participant_id") %in% younger ~ "younger",
  TRUE                                                        ~ NA_character_
)

# Subset to only the documents in the two groups
keyness_dfm <- dfm_subset(loneliness_dfm, !is.na(age_group))

# Run the keyness test (log-likelihood ratio)
keyness_older <- textstat_keyness(
  keyness_dfm,
  target  = docvars(keyness_dfm, "age_group") == "older",
  measure = "lr"   # log-likelihood ratio; "chi2" is also available
)

head(keyness_older, 30)  # top 30 words distinctive of OLDER participants
tail(keyness_older, 30)  # top 30 words distinctive of YOUNGER participants

# Visualize: side-by-side keyness plot
textplot_keyness(keyness_older, n = 20,
  color = c("#0B7B6B", "#CC0033"),
  labelsize = 3.5
)

# Export for the methods section
write_csv(as_tibble(keyness_older), "outputs/wk12_keyness_older_v_younger.csv")

What to expect: The keyness plot will be two bars in opposite directions for each word — one bar showing the words most distinctive of older participants, the other showing the words most distinctive of younger participants. Two interpretive cautions: (1) keyness with small subcorpora (4 documents per group) is statistically fragile; treat individual word p-values as pointers, not as confirmed tests. (2) Read every keyword in its KWIC context before drawing any interpretive conclusion. The numerical contrast is the front end; the close read is the analysis.

1.5 Collocations — Words That Travel Together

A collocation is a pair (or larger sequence) of words that co-occurs more often than chance would predict. The standard test is again log-likelihood or chi-squared on the 2×2 contingency table of (word A present / absent) by (word B present / absent) within a sliding window of n words. High-scoring collocations tell you which word pairs are linguistically bonded in the corpus. The output for a loneliness corpus might include empty chair, quiet apartment, group chat, tired all the time, nobody there, last conversation, and similar pairings.

Collocations are useful for two analytic purposes. First, they surface candidate multi-word concepts that single-word frequency analysis misses. Empty chair as a collocation captures something neither empty nor chair alone does; it is the unit of meaning. Second, they surface candidate conventional metaphors — recurring word combinations that participants are drawing on a shared symbolic vocabulary to use. In the loneliness corpus, recurring phrases like fading at the edges (Helen), shrinks around you (also Helen), and empty space (multiple participants) are conventional metaphors in the Lakoff-Johnson sense; collocation analysis surfaces them efficiently.

RCollocations: empty chairs, quiet apartments, group chats
library(quanteda.textstats)

# Collocations on the stopword-removed tokens, looking for 2-3 word phrases
# size = 2 finds bigrams; size = 2:3 finds bigrams and trigrams
collocs <- textstat_collocations(
  loneliness_tokens_nostop,
  size = 2:3,
  min_count = 4
)

# Sort by log-likelihood (z statistic for collocation strength)
collocs |>
  as_tibble() |>
  arrange(desc(z)) |>
  head(40)

# What collocates with "alone" specifically? Slide a window over the corpus
alone_neighbors <- tokens_select(loneliness_tokens_nostop,
  pattern = "alone",
  window = 4,
  selection = "keep"
) |>
  dfm() |>
  topfeatures(n = 30)
print(alone_neighbors)

# What collocates with "lonely"?
lonely_neighbors <- tokens_select(loneliness_tokens_nostop,
  pattern = "lonely",
  window = 4,
  selection = "keep"
) |>
  dfm() |>
  topfeatures(n = 30)
print(lonely_neighbors)

# Compare the two neighbor sets -- which words appear with one but not the other?
setdiff(names(alone_neighbors),  names(lonely_neighbors))
setdiff(names(lonely_neighbors), names(alone_neighbors))

What the contrast reveals: The alone neighborhood will tend to surface neutral or even positive context words (home, quiet, peace, comfortable) because participants describe being alone in many ways, not all of them lonely. The lonely neighborhood will tend to surface affectively heavier neighbors (tired, hollow, fading, empty, dead, miss). The contrast operationalizes Bernard, Wutich, and Ryan's claim — and your participants' direct statements — that alone and lonely are distinct concepts.

1.6 N-grams

An n-gram is a sequence of n consecutive words. Unigrams (n=1) are single words; bigrams (n=2) are pairs; trigrams (n=3) are triples. The collocations above were a special case of n-grams: bigrams and trigrams whose components co-occur statistically more than chance. Plain n-gram frequency analysis — without the statistical-collocation filter — is also useful, especially when you want to find conventional fixed phrases (at the end of the day, most of the time, I don't know) or when you are setting up a more elaborate downstream model that consumes n-gram features (a topic model, a supervised classifier, a semantic network on bigrams).

The technical implementation in quanteda is one line: tokens_ngrams(tokens, n = 2) returns a tokens object where each “token” is an underscore-joined bigram. You can then feed it to dfm() and treat it like any other feature set. The cost is that the resulting feature space is much larger than the unigram space (typically 10–20× more features), which slows downstream computation; the benefit is that n-grams capture lexical structure that unigrams cannot.

1.7 What These Techniques Do and Do Not Tell You

+
Free listing & free recall
Tap to reveal
+
Code suggestion
Tap to reveal
+
Inter-coder benchmarking
Tap to reveal
+
Critical cautions
Tap to reveal

Computational text statistics are powerful diagnostics. They are not, by themselves, qualitative analysis. Bernard, Wutich, and Ryan are explicit on this: a frequency table, a keyness list, a TF-IDF ranking, and a collocation report are pointers. They surface candidates for close reading. The analytic move — deciding what the patterns mean, whether they support a theme, whether they tell you something about the phenomenon or merely about the vocabulary your participants happen to share — remains the analyst's work. Treat Section 1's tools as the first hour of a longer day. They tell you where to look. The looking is still your job.

Reflection

Of the techniques in this section — KWIC, frequency, TTR/MATTR, TF-IDF, keyness, collocations — which one would be most analytically useful for the specific research question your capstone paper is addressing? Why? Be specific about what it would surface and how that would feed into a close-reading move.

Model answerA strong response names a specific technique, names what it would surface, and names the close-reading move that follows. Examples: “My capstone compares the loneliness of recent immigrants with that of long-time residents. Keyness analysis is the right tool because it operationalizes the question directly — which words are distinctively used by the immigrant subcorpus — and the resulting keyword list (e.g., before, family, language, belong, possibly an Arabic loan-word like wahda) will point me to passages where the comparison is most concretely visible.” Or: “My capstone is about the embodied, sensory dimension of late-life loneliness. KWIC analysis of bodily-sense words (tired, cold, quiet, still, shrinks) is the right tool because it lets me sample every occurrence in context and decide which are genuine embodiment claims versus filler usage.” A weak answer names a technique without specifying what the surfaced output would be or what you would do with it.

Minimum 20 characters required.

✓ Reflection saved
Knowledge Check — Section 1

Question 1: Which of the following is the analytically most defensible use of a keyness analysis comparing older and younger participants in the loneliness corpus?

Keyness is a pointer for close reading, not a finding by itself. With small subcorpora, individual p-values are fragile, but the ranked list of candidate words is still useful as a starting point for close-reading interpretation.

Question 2: Why is raw word frequency a poor measure for surfacing the words most distinctive of a single transcript?

TF-IDF weights term frequency by inverse document frequency, so words common across the whole corpus are de-emphasized and document-distinctive vocabulary surfaces. Raw frequency does not do this and is therefore dominated by common words.

Question 3: KWIC analysis is most accurately described as:

KWIC is a concordance technique — for every occurrence of a target word, it shows the word in its immediate context. Its analytic value comes from making local context inspectable cheaply.
Section 2 of 5

Cultural Domain Analysis — Free Lists, Pile Sorts, Triads, Consensus

⏱ Estimated reading time: 35 minutes

Introduction and Overview

Cultural domain analysis is a family of techniques developed in cognitive anthropology in the 1950s through 1980s to study how people in a cultural group mentally organize a delimited domain of knowledge: kinds of illness, kinds of edible plants, kinds of kin, kinds of risk. The starting premise is that a cultural domain is a shared mental model. The techniques in this section — free listing, pile sorts, triad tests, and consensus analysis — are designed to measure the structure of that shared model and to estimate each participant's degree of competence: how closely their understanding of the domain tracks the group consensus.

For public-health audiences, cultural domain analysis is consequential because most health categories (kinds of risk, kinds of symptom, kinds of treatment, kinds of support) are domain-organized in exactly this way. A public-health communication campaign that treats “the public” as a single audience with a single mental model often fails because the model is in fact heterogeneous across groups. Cultural domain analysis measures the heterogeneity precisely and tells you which sub-populations share a model and which do not.

The intellectual lineage is short and important. The technique was systematized by Romney, Weller, and Batchelder in their 1986 paper Culture as Consensus: A Theory of Culture and Informant Accuracy, published in American Anthropologist. The Romney-Weller-Batchelder (RWB) consensus model treats culture statistically: there is a true cultural answer for each item in the domain, and participants approximate it with varying competence. The model estimates competence from inter-informant agreement (using a factor-analytic decomposition) and produces, for each item, a best-estimate cultural answer that is the competence-weighted average of all participants' answers. The model has been applied to hundreds of public-health questions, from kinds of risk for HIV transmission to the cultural domain of acceptable cooking fuels in low-income households.

Learning Objectives for Section 2

  • Conduct a free-listing exercise and compute Smith's salience.
  • Distinguish single-sort and successive-sort pile sorts, and analyze pile-sort data with multidimensional scaling (MDS).
  • Conduct a triad test and read its output.
  • State the Romney-Weller-Batchelder consensus model assumptions and interpret the eigenvalue ratio test for cultural agreement.
  • Apply cultural domain methods to a public-health question: which dimensions of social support, kinds of risk, or kinds of treatment do your participants share a model of?

2.1 Free Listing — The Foundational Elicitation Technique

Free listing — the foundationv

Ask 30+ informants to list all the items they can think of in a domain. Aggregate the lists. Two key outputs: (1) salience scores per item (frequency + rank), (2) the empirical map of the domain’s vocabulary. Most cultural-domain studies start here.

Pile sorts — revealing structurev

Give informants 20-40 items (typically from free lists) and ask them to sort into piles. Aggregate proximity matrices. Multidimensional scaling and hierarchical clustering reveal the implicit category structure. The result is a map of how the domain hangs together.

Triad tests — precision toolv

Present three items and ask which one doesn’t belong. Repeated across many triads, this elicits the underlying dimensions along which informants distinguish items. More cognitively demanding but more discriminating than pile sorts; useful for refining a known structure.

Consensus analysis (Romney-Weller-Batchelder)v

A statistical model that uses the pattern of agreement across informants to (a) estimate each informant’s competence in the domain, (b) estimate the ‘answer key’ that the group implicitly shares. The defensible alternative to majority voting in cultural surveys. Particularly important when cultural consensus itself is the research question.

Free listing is the simplest cultural domain technique and almost always the one to start with. You ask a participant a simple prompt: “List all the kinds of X you can think of.” You let them list freely, in their own order, for as long as they wish to continue. You record both the items and the order. You collect free lists from a sample of n participants (typically 20–40 is sufficient; less than 15 is usually too few for reliable salience estimates). You then analyze the combined list to identify (a) which items are mentioned by the most participants (frequency), (b) which items tend to be mentioned early (rank), and (c) which items are both common and early (Smith's salience).

Smith's salience is a single number per item, calculated as:

S = ( ∑ ((L − Rp + 1) / L) ) / N

where, for each item, you sum across the N participants the quantity (L − Rp + 1) / L, with L being the length of participant p's list and Rp being the rank of the item in that list (R = 1 for first-mentioned). The numerator gives full credit (= 1) to an item mentioned first on a participant's list, half credit to an item halfway down a list of length 2L, and so on. The sum is divided by N (the total number of participants). The resulting score ranges from 0 to 1; higher is more salient.

Items with the highest Smith's salience are the core items of the domain — the ones a randomly chosen member of the group is most likely to think of first when asked about the domain. Items with low salience are peripheral — idiosyncratic to one or two participants, or sub-domain-specific. In a free-listing exercise on “kinds of social support,” the high-salience items might be family, friends, partner; the low-salience items might be online community, religious community, therapist. The contrast tells you what the participants' default model of social support contains, and what they leave out.

🔎 Free-listing exercise (worked example using the loneliness corpus)

The loneliness interviews are semi-structured rather than free-list, but several questions in the interview guide elicit list-like responses (Domain 2, Q7: “Are there situations or times that reliably bring it on?”; Domain 4, Q11: “What helps?”). For the worked example below, treat each participant's response to Q11 (“What helps?”) as their free list of coping strategies. Extract the items mentioned and their order. Compute Smith's salience across the 20 participants. The high-salience items are the cultural core of coping with loneliness in this sample; the low-salience items are idiosyncratic.

A real free-listing study would use the prompt “List all the things that help you when you are lonely” and would record the listing directly. The retrospective extraction from semi-structured interviews is a defensible second-best.

RSmith's salience from a free-listing dataset

The AnthroTools package implements Smith's salience and several other cultural-domain statistics. If you do not have it installed, run install.packages("AnthroTools"); if the CRAN version is unavailable, the package is also distributed via GitHub (Jamieson-Lane & Purzycki).

library(tidyverse)
library(AnthroTools)

# Hand-extracted free-listing data from the "What helps?" question
# Format: one row per (participant, item, rank)
free_lists <- tribble(
  ~Subj,   ~Order,  ~CODE,
  "P01",    1,       "friends",
  "P01",    2,       "music",
  "P01",    3,       "running",
  "P05",    1,       "dog",
  "P05",    2,       "phone-pal",
  "P05",    3,       "church",
  "P11",    1,       "Better-at-Home",
  "P11",    2,       "phone-pal",
  "P11",    3,       "radio",
  "P15",    1,       "family-back-home",
  "P15",    2,       "language-class",
  "P15",    3,       "WhatsApp",
  "P20",    1,       "grandchildren",
  "P20",    2,       "dog"
  # ... and so on across all 20 participants
)

# Compute Smith's salience for every item
salience <- CalculateSalience(
  free_lists,
  Subj = "Subj", Order = "Order", CODE = "CODE",
  Salience = "Smith.Salience"
)
head(salience)

# Aggregate to one Smith's salience score per item
salience_summary <- SalienceByCode(
  salience,
  Subj = "Subj", CODE = "CODE",
  Salience = "Smith.Salience",
  dealWithDoubles = "MAX"
)
salience_summary |>
  as_tibble() |>
  arrange(desc(SmithsS)) |>
  head(20)

How to read the output: The SmithsS column is Smith's salience. The top items (highest salience) are the cultural core of the domain — the things most people think of and think of first. The bottom items are peripheral. For a public-health intervention designer, the cultural core tells you what most people will already know about; the periphery tells you what an information campaign would need to introduce.

2.2 Pile Sorts — Eliciting Implicit Structure

A pile sort is conducted as follows. You write each item of the domain on a card (or display them on a screen). You hand the deck to a participant and say: “Sort these into piles so that the things in each pile are similar to each other. Make as many or as few piles as you like.” You record the resulting partition: which items the participant placed together. You repeat across n participants and aggregate the pile sorts into a single similarity matrix: cell (i, j) of the matrix is the number of participants who put items i and j in the same pile.

The aggregate similarity matrix is then analyzed with multidimensional scaling (MDS) or hierarchical clustering to recover the implicit structure of the domain. MDS produces a 2-D map where items are points and the distances between them reflect their dissimilarity. Items that participants reliably grouped together appear close on the map; items that were almost never grouped together appear far apart. The resulting map is the visualization of the group's shared mental structure of the domain.

A successive pile sort is the same exercise repeated: after the first sort, you ask the participant to combine piles (“Which of these piles could be combined into a larger pile? Which combinations would still feel similar?”) until they reach a small number of super-piles, and you record the hierarchy. The result is a hierarchical clustering for each participant, which can also be aggregated. The successive pile sort is more informative than the single sort but slower; the single sort is the standard for most studies.

For a worked example consistent with the loneliness study: imagine 30 cards, each naming a kind of social relationship (mother, father, sibling, partner, best friend, ex-partner, coworker, neighbour, pet, online friend, religious-community member, therapist, GP, phone-pal, group-chat acquaintance, …). A pile-sort study with 25 BC residents would produce an MDS map of how these relationships cluster in our shared mental model. The clustering would tell you which relationships participants treat as functionally equivalent for the purpose of countering loneliness, which they treat as substitutes, and which they treat as categorically different. Such a study would be policy-actionable: if “pet” clusters with “close friend” in the shared model, then a pet-based intervention is closer to the cultural category of friendship than to the cultural category of distraction, with consequences for how the intervention is framed and evaluated.

RMDS on aggregate pile-sort data
library(tidyverse)

# Simulated pile-sort data: 5 items, 6 participants, sorted into 2 piles each
# Each row is a participant's partition: 1 and 2 are pile labels
piles <- tibble(
  participant = paste0("P", 1:6),
  mother      = c(1, 1, 1, 1, 1, 1),
  partner     = c(1, 1, 1, 1, 1, 2),
  best_friend = c(1, 1, 2, 1, 1, 1),
  pet         = c(2, 1, 1, 2, 1, 1),
  group_chat  = c(2, 2, 2, 2, 2, 2)
)

# Build co-membership similarity matrix:
# entry (i, j) = number of participants who put items i and j in the same pile
items <- setdiff(names(piles), "participant")
M <- matrix(0, length(items), length(items),
            dimnames = list(items, items))
for (p in 1:nrow(piles)) {
  for (i in items) for (j in items) {
    if (piles[[i]][p] == piles[[j]][p]) M[i, j] <- M[i, j] + 1
  }
}
print(M)

# Convert similarity to dissimilarity (max - similarity)
D <- max(M) - M
diag(D) <- 0

# Classical MDS to 2 dimensions
mds_fit <- cmdscale(as.dist(D), k = 2)

# Plot
tibble(item = rownames(mds_fit),
       x = mds_fit[, 1], y = mds_fit[, 2]) |>
  ggplot(aes(x, y, label = item)) +
  geom_point(size = 4, color = "#0B7B6B") +
  geom_text(aes(label = item), vjust = -1.2) +
  labs(title = "MDS of pile-sort co-membership: kinds of social relationship",
       x = "Dimension 1", y = "Dimension 2") +
  theme_minimal()

What success looks like: The 2-D plot will cluster items that participants consistently grouped together (mother, partner, best_friend on one side; group_chat alone on the other; pet floating between). The dimensions themselves are unlabeled by the algorithm; part of your analytic work is to inspect the plot and propose a labelling (e.g., dim 1 = “intimacy”, dim 2 = “embodiment”). The proposed dimensional labels are an interpretive claim that needs argument; they are not output of the algorithm.

2.3 Triad Tests

The triad test addresses a methodological problem with pile sorts: pile sorts assume participants can hold and partition an entire deck of items at once, which is cognitively heavy for large decks. The triad test breaks the problem into manageable pieces. You present three items at a time and ask: “Which of these three is most different from the other two?” (Some variants instead ask which two are most similar; the information is equivalent.) Across many triads, you accumulate enough pairwise-similarity data to reconstruct the same kind of similarity matrix that the pile sort produces, but with finer resolution.

The cost is the number of triads. For a domain of n items, there are C(n, 3) = n!/(3!(n−3)!) possible triads; for n = 15, that is 455 triads, which is too many for any participant to complete. Triad-test designers use balanced incomplete block designs (Borgatti's UCINET implementation has a wizard for this) that present each participant with a stratified subset of triads such that every pair of items appears in approximately equal numbers of triads across the sample. The standard analysis is again MDS on the aggregated similarity matrix.

Triad tests have largely been displaced in contemporary practice by pile sorts (which are faster) and by ratings (which are easier to administer remotely). They remain the gold standard for very small domains (n < 10) and for studies where the cognitive load of holding the full deck would compromise data quality (children, participants with cognitive impairment).

2.4 Consensus Analysis — The Romney-Weller-Batchelder Model

The pile sort and the triad test produce an aggregate similarity matrix for the group. The Romney-Weller-Batchelder consensus analysis goes one step further: it treats the participants themselves as items to be analyzed, and asks how much do they agree with each other? The output is two-fold: a per-participant competence score (how closely the participant's answers track the group consensus) and an estimated cultural answer for each item (the competence-weighted average of all participants' answers).

The model has three formal assumptions: (1) there is a single shared cultural truth for each item in the domain (one knowledge system, not several); (2) participants' answers are independent samples from their own competence-driven approximation to the truth; (3) each participant's competence is approximately constant across items. The first assumption is the most consequential and the most testable. If the data violate assumption 1 — if there are two or more distinct cultural sub-models in your sample — the consensus analysis will tell you so.

The diagnostic is the eigenvalue ratio test. Consensus analysis runs a factor analysis on the participant-by-participant agreement matrix. If there is a single shared culture, the first eigenvalue will be much larger than the second — the rule of thumb is that the first should be at least three times the second. If the first/second ratio is closer to 1, there is no single consensus; the sample contains multiple sub-cultures. In that case you should not report a single “cultural answer”; you should report separate analyses for each subgroup (and explain in your discussion why the consensus broke down).

In contemporary public health, consensus analysis has been used to study, among other things, lay theories of HIV transmission risk in different communities (where the eigenvalue ratio test repeatedly identifies sub-cultures), kinds of postpartum mood, kinds of acceptable harm-reduction supply, and the cultural domain of “a good death.” The technique's strength is that it gives a formal statistical test for whether your participants share a cultural model; its weakness is the strong single-culture assumption and the modest sensitivity of the eigenvalue test in small samples.

RConsensus analysis with the AnthroTools package
library(AnthroTools)

# Toy consensus-analysis dataset:
# Rows = participants (10), Columns = items in the domain (12),
# Values = the participant's answer to each item (here, 1 = yes, 0 = no)
set.seed(42)
# Simulate a "true" cultural answer for 12 items
true_answers <- sample(c(0, 1), 12, replace = TRUE)
# Simulate 10 participants whose competence determines how often they match the truth
competence <- runif(10, min = 0.5, max = 0.95)
answers <- sapply(1:10, function(i) {
  ifelse(runif(12) < competence[i], true_answers,
         1 - true_answers)
})
rownames(answers) <- paste0("item_", 1:12)
colnames(answers) <- paste0("P", 1:10)

# Run consensus analysis (the eigenvalue test is reported in the output)
consensus <- ConsensusPipeline(answers, numQ = 12)

# Key outputs:
# - consensus$Competence       per-participant competence scores
# - consensus$EigenRatio       ratio of 1st to 2nd eigenvalue (>= 3 supports single culture)
# - consensus$Answers          model-estimated "cultural answer" for each item
cat("Eigenvalue ratio:", round(consensus$EigenRatio, 2), "\n")
cat("Mean competence:", round(mean(consensus$Competence), 2), "\n")
cat("Cultural answers:\n"); print(consensus$Answers)

How to interpret: If EigenRatio is at least 3, your sample supports a single-culture model and the cultural-answer column is interpretable as the group's best estimate of the truth for each item. If the ratio is between 2 and 3, the support is marginal; treat the cultural answers with caution and explore subgroups. If the ratio is below 2, there is no single consensus; partition your sample (e.g., by age, by community) and run consensus separately on each subgroup, then compare.

2.5 When to Reach for Cultural Domain Analysis in a Health Study

Key insight - Augment, don't replace

The 2026 state of qualitative analysis: LLMs and computational text tools can augment the careful researcher but cannot replace them. They speed up scaffolding tasks (KWIC, code suggestion, summarization), allow new kinds of analysis (corpus-scale keyness comparison, semantic clustering), and lower the cost of methodological triangulation (human + computational coding). What they cannot do is warrant interpretive claims — that work still belongs to a human analyst who can be held accountable for it. The defensible 2026 workflow combines computational and manual methods explicitly, names each method’s contribution, and reports the human judgment that arbitrated between them.

Cultural domain analysis is the right tool when your research question is about a shared mental model: a structured set of categories that the participants treat as having a known list of members, a known set of relations among the members, and a known set of acceptable answers. It is the wrong tool when the phenomenon is fundamentally idiographic (each participant's experience is its own object) or when the “domain” is too open-ended to have a closed list of items.

For your loneliness capstone, cultural domain methods are not the dominant analytic approach (the interviews are too open-ended), but they could plausibly feature in a sub-analysis. The most natural application would be a free-listing study of kinds of social support or kinds of coping, possibly with an MDS on a derived pile-sort matrix. The result would be a structured map of the shared cultural model of how loneliness is countered, which would complement the close-reading account that anchors the rest of the paper.

2.6 Applications in Health Research

ACTIVITY Try it - Prompt an LLM as a code-suggestion engine

Take 3-5 short passages from your corpus that you have already coded by hand. Draft a prompt:

  • Role: 'You are a qualitative health research assistant helping me code interview transcripts about [topic]'.
  • Codebook: Provide your existing codes with one-line definitions.
  • Examples: Provide 2-3 already-coded passages.
  • Task: Provide one new passage and ask: 'Which codes apply, and to which phrase? Provide a one-sentence rationale per code.'

Then compare LLM output to your own coding. Where does it agree? Where does it diverge? Why?

This exercise reveals both LLM strengths (consistency, speed) and weaknesses (over-literal reading, missed irony, stereotyping). Both are useful diagnostic findings.

ApplicationTechniqueWhat it surfaces
Lay theories of HIV transmissionFree listing + consensusThe cultural core of perceived risks; identification of sub-cultures with discrepant models
Kinds of postpartum moodFree listing + pile sort + MDSThe lay nosology of mood states that screening tools must map onto
Acceptable harm-reduction suppliesPile sort + consensusWhich supplies participants treat as functionally equivalent for provincial policy
A “good death”Free listing + consensus by cohortGenerational and cultural variation in end-of-life values
Kinds of social supportFree listing + pile sort + MDSWhich relationships participants treat as substitutes for one another
Indigenous concepts of wellnessFree listing + community-based consensusConcepts not captured by Western biomedical frameworks

Reflection

Imagine your capstone has space for one cultural-domain-analysis sub-study. Which of the four techniques (free listing, pile sort, triad test, consensus analysis) would you propose, and what would the prompt be? What would a positive result look like? What would a negative result (e.g., low eigenvalue ratio) tell you?

Model answerA good answer names a domain, a technique, a prompt, and an interpretation rule. Example: “I would conduct a free-listing study with the prompt ‘List all the things that help when you are lonely’ across 30 participants stratified by age. Smith's salience would surface the cultural core of coping (likely family, friends, pets, activities). I would then run consensus analysis on a follow-up acceptability rating of each item across participants. A high eigenvalue ratio (≥ 3) would license the conclusion that participants share a single cultural model of coping; a low ratio (< 2) would license the more interesting conclusion that kinds of coping is not a unified domain in this population but instead splits across age strata, which would have direct implications for designing age-tailored loneliness interventions.” A weak answer names a technique without specifying the prompt or the interpretation criterion.

Minimum 20 characters required.

✓ Reflection saved
Knowledge Check — Section 2

Question 1: Smith's salience is calculated such that an item is most salient when it is:

Smith's S combines frequency (how many participants mention the item) and rank (how early they mention it). The most salient items are common AND mentioned early; the least salient are mentioned rarely AND late.

Question 2: The Romney-Weller-Batchelder consensus model's eigenvalue ratio test is used to assess:

The first/second eigenvalue ratio of the participant-by-participant agreement matrix is the diagnostic for the single-culture assumption. A ratio ≥ 3 supports single-culture; a low ratio indicates multiple sub-cultures or no consensus.

Question 3: A pile-sort study presents 25 cards to each of 30 participants and asks them to sort the cards into similarity piles. The aggregated similarity matrix is then analyzed with multidimensional scaling. The resulting 2-D map will best show:

MDS produces an unlabeled 2-D map where items participants grouped together are close and items they rarely grouped are far. The analyst must propose interpretations of the dimensions; the algorithm does not label them. The result is a visualization of the implicit (shared) structure.
Section 3 of 5

Semantic Network Analysis — Text as Network

⏱ Estimated reading time: 35 minutes

Introduction and Overview

Chapter 19 of Bernard, Wutich, and Ryan treats text as network. The idea is straightforward: take a corpus, identify the content words, and treat each word as a node. Connect two words with an edge whenever they co-occur within a defined window — in the same sentence, the same paragraph, or some sliding span of n tokens. The result is a weighted graph: nodes are words, edges are co-occurrences, edge weights are co-occurrence counts. Once you have the graph, the entire toolkit of network analysis becomes available: centrality measures, community detection, density, path length, visualizations.

Semantic network analysis is one of the most interpretively rich computational text techniques in Bernard, Wutich, and Ryan’s discussion because the network is a visualization of the corpus's conceptual structure. Words that participants talked about in the same breath cluster together in the network. Words that mediate between clusters — high-betweenness words — are the conceptual bridges of the corpus. Word communities — dense sub-graphs identified by modularity-based clustering algorithms — are candidate themes that no human coder identified explicitly but that the participants' language organized implicitly.

The technique has been used productively in studies of patient experience (semantic networks of pain descriptions), public communication (semantic networks of vaccine-hesitant tweets), policy text (semantic networks of climate-adaptation reports), and qualitative methods textbooks themselves (semantic networks of methods-section vocabularies across decades). For your loneliness corpus, a semantic network of the top 30–50 content words will visualize the conceptual scaffolding of how the participants collectively articulate the experience of loneliness, with the major thematic clusters (embodiment / coping / loss / connection) emerging as graph communities rather than as analyst-imposed codes.

Learning Objectives for Section 3

  • Build a term co-occurrence matrix using quanteda::fcm() with sentence, paragraph, or sliding-window definitions.
  • Convert the matrix to an igraph network and visualize it.
  • Compute degree, betweenness, closeness, and eigenvector centrality, and interpret each.
  • Run community detection (Louvain method) and read the resulting partitions as candidate themes.
  • Compute network-level statistics: density, mean degree, clustering coefficient.
  • Recognize the limits of semantic network analysis — what it surfaces and what it misses.

3.1 Building a Co-Occurrence Matrix

The starting point of a semantic network is the feature co-occurrence matrix (FCM). For a vocabulary of v words, the FCM is a v × v symmetric matrix; entry (i, j) is the number of times words i and j appeared in the same context unit. The context unit can be any of several things, and the choice matters:

  • Sentence-level co-occurrence: two words co-occur if they are in the same sentence. Tightest, most likely to reflect direct conceptual association. Produces sparser networks.
  • Paragraph-level co-occurrence: two words co-occur if they are in the same paragraph. Looser; reflects topical co-occurrence over a few sentences of conversation.
  • Document-level co-occurrence: two words co-occur if they appear anywhere in the same transcript. Loosest; suitable for small corpora where finer-grained co-occurrence would be too sparse.
  • Sliding-window co-occurrence: two words co-occur if they are within n tokens of each other. The sliding window is the standard in computational linguistics; n = 5 to 10 is typical for content-word co-occurrence.

For interview transcripts of the length typical of your loneliness corpus (2,000–4,000 tokens per transcript, 30–90 minute interviews), a sentence-level or 5-token-window FCM is the right starting point. Document-level co-occurrence is too coarse: every word would co-occur with every other word in a single transcript, and the resulting network would be uninformative.

RBuilding a feature co-occurrence matrix with quanteda
library(quanteda)

# Filter to the top 50 content words by frequency (keeps the network readable)
top50_words <- names(topfeatures(loneliness_dfm, n = 50))

# Subset the tokens to only those 50 words
top_tokens <- tokens_select(loneliness_tokens_nostop,
  pattern = top50_words,
  selection = "keep"
)

# Build sliding-window feature co-occurrence matrix, window = 5
fcm_window <- fcm(top_tokens, context = "window", window = 5, tri = FALSE)
dim(fcm_window)   # 50 x 50
topfeatures(fcm_window)   # which words have the highest total co-occurrence weight

# Alternative: document-level co-occurrence (too coarse for a network of 50 words)
fcm_doc <- fcm(top_tokens, context = "document", tri = FALSE)

# Quick visualization with quanteda.textplots::textplot_network
library(quanteda.textplots)
textplot_network(fcm_window,
  min_freq = 0.5,                                # show only the strongest edges
  edge_color = "#0B7B6B",
  edge_alpha = 0.6,
  vertex_labelfont = "sans",
  vertex_labelsize = 3.5
)

What to expect: The plot shows 50 word nodes connected by edges whose thickness reflects co-occurrence weight. The most central words (loneliness, alone, feel, people, time) sit in the middle; the more specific words (chair, radio, wednesday, fading, group, screen) sit on the periphery in clusters. textplot_network is a fast first look; for serious analysis, switch to igraph (next code block) to compute centrality measures and run community detection.

3.2 From Quanteda FCM to igraph Graph Object

The igraph package (Csardi & Nepusz, 2006) is the de-facto standard for network analysis in R. To use it on your semantic network, convert the quanteda FCM to an igraph object, which is a one-line operation. Once you have the igraph object, you have access to dozens of network statistics, multiple community-detection algorithms, and the full ggraph visualization ecosystem.

RConvert FCM to igraph and compute centrality
library(igraph)
library(quanteda)

# Convert the FCM to an igraph object
# quanteda has a built-in converter; the network is undirected and weighted
g <- as.igraph(fcm_window, min_freq = 1, omit_isolated = TRUE)
summary(g)
vcount(g)   # number of vertices (words)
ecount(g)   # number of edges

# Degree centrality: how many neighbors does each word have?
deg <- degree(g)
sort(deg, decreasing = TRUE)[1:15]

# Weighted degree (strength): sum of edge weights for each node
str <- strength(g)
sort(str, decreasing = TRUE)[1:15]

# Betweenness centrality: how often each word is on the shortest path between
# two other words. High-betweenness words are conceptual bridges.
btw <- betweenness(g, weights = 1 / E(g)$weight)
sort(btw, decreasing = TRUE)[1:15]

# Closeness centrality: how short the average path is from each word to all others.
# High-closeness words are "near the centre" of the conceptual graph.
cls <- closeness(g, weights = 1 / E(g)$weight)
sort(cls, decreasing = TRUE)[1:15]

# Eigenvector centrality: a node is central if it is connected to other central nodes.
# Reflects "prestige" in the network sense.
eig <- eigen_centrality(g)$vector
sort(eig, decreasing = TRUE)[1:15]

# Assemble all centralities in one table
centrality_tbl <- tibble(
  word        = V(g)$name,
  degree      = deg,
  strength    = str,
  betweenness = btw,
  closeness   = cls,
  eigenvector = eig
) |> arrange(desc(eigenvector))
print(centrality_tbl, n = 20)
write_csv(centrality_tbl, "outputs/wk12_semantic_centrality.csv")

How to interpret each centrality: Degree: a measure of breadth — how many different words this one connects to. Betweenness: a measure of bridging — this word sits on many shortest paths between other words; it is the conceptual hinge of the network. Closeness: a measure of accessibility — this word is on average close to everything else; it is in the conceptual middle. Eigenvector: a measure of prestige — this word is connected to other well-connected words; it is at the heart of the conceptual structure. In the loneliness corpus, expect loneliness, alone, feel, and people to score high on most measures; expect words like tired, quiet, radio, group, family to score high on degree but lower on betweenness (they are nodes of specific thematic clusters, not bridges between clusters).

3.3 Community Detection — Louvain and Beyond

A network community is a sub-set of nodes that are more densely connected to each other than to the rest of the graph. Community detection algorithms partition a graph into communities by optimizing some criterion — most commonly modularity, which measures how much denser the within-community edges are compared to what would be expected if edges were placed randomly. High modularity means a strong community structure; low modularity means the graph is more like a single dense blob.

The Louvain method (Blondel et al., 2008) is the most widely used modularity-optimization algorithm. It is fast (runs in seconds on networks of tens of thousands of nodes), deterministic up to node ordering, and produces interpretable communities of varying granularity. The output is a partition: each node is assigned to one community. Alternative algorithms (Walktrap, Infomap, label propagation) can be applied as cross-checks; in qualitative work, if the same broad communities emerge under two different algorithms, you can be more confident that the partition is robust.

For your loneliness semantic network, the Louvain method will likely identify three to five communities, which often map onto interpretable theme clusters: an “embodied / sensory” cluster (tired, body, sleep, quiet, fading), a “social / structural” cluster (family, friends, group, chat, work), a “coping / activity” cluster (dog, walk, music, radio, church), and perhaps a “meaning / time” cluster (life, years, dead, gone, missing). The communities are candidates for thematic labels, not labels themselves; the labelling is your interpretive work.

RCommunity detection with Louvain
library(igraph)

# Louvain community detection (on the undirected weighted graph)
set.seed(841)  # for reproducibility
comm_louvain <- cluster_louvain(g, weights = E(g)$weight)

# Inspect the partition
membership(comm_louvain)
sizes(comm_louvain)              # size of each community
modularity(comm_louvain)         # modularity score (higher = stronger community structure)
length(comm_louvain)             # number of communities

# Tidy the community assignment
community_tbl <- tibble(
  word      = V(g)$name,
  community = as.factor(membership(comm_louvain))
) |>
  arrange(community, word)
print(community_tbl, n = 50)

# Cross-check with another algorithm: Walktrap
comm_walktrap <- cluster_walktrap(g, weights = E(g)$weight, steps = 4)
modularity(comm_walktrap)
length(comm_walktrap)

# Cross-check agreement (Rand index): do the two algorithms agree?
compare(membership(comm_louvain), membership(comm_walktrap), method = "rand")

What success looks like: Modularity above ~0.3 indicates a meaningful community structure; above 0.5 is strong. Three to five communities is typical for a 30–50 node semantic network from a 20-transcript corpus. The Rand index near 1 means Louvain and Walktrap agree closely; near 0 means they disagree and you should be cautious about treating either partition as the authoritative grouping.

3.4 Visualizing the Semantic Network

A semantic-network figure is one of the most striking visualizations available for a qualitative paper. It looks scientific to a quantitative reader, it is genuinely informative to a qualitative reader, and it gives a journal reviewer something concrete to look at. The standard layout algorithms for force-directed graph drawing (Fruchterman-Reingold, Kamada-Kawai, ForceAtlas2) all aim to position nodes so that connected nodes are close and disconnected nodes are far, while avoiding overlap. Choose a layout, color nodes by their community, size nodes by their degree or eigenvector centrality, and label nodes with their words.

RForce-directed visualization with ggraph
library(ggraph)
library(igraph)
library(tidygraph)

# Wrap the igraph object into a tidygraph and attach community + centrality as node attributes
g_tidy <- as_tbl_graph(g) |>
  activate(nodes) |>
  mutate(
    community = as.factor(membership(comm_louvain)),
    degree    = centrality_degree(),
    btwn      = centrality_betweenness(weights = 1 / weight)
  )

# Force-directed layout (Fruchterman-Reingold) with community coloring
set.seed(841)
ggraph(g_tidy, layout = "fr") +
  geom_edge_link(aes(width = weight, alpha = weight),
                 edge_colour = "#888", show.legend = FALSE) +
  scale_edge_width(range = c(0.1, 1.5)) +
  scale_edge_alpha(range = c(0.1, 0.7)) +
  geom_node_point(aes(size = degree, colour = community)) +
  geom_node_text(aes(label = name), repel = TRUE, size = 3.2) +
  scale_colour_brewer(palette = "Set2") +
  labs(title = "Semantic network of the loneliness corpus (top 50 content words)",
       subtitle = "Edges = sliding-window co-occurrence (window = 5); communities by Louvain",
       colour = "Community", size = "Degree") +
  theme_graph()

What the figure communicates: Each colored cluster is a candidate theme. Reading the cluster as a list of words and then returning to the transcripts to find the passages where those words actually co-occurred is how you turn a graph community into an analytic theme. Save the figure for the findings section of your paper.

3.5 Network-Level Statistics

Beyond per-node centralities and per-community partitions, a few network-level statistics characterize the corpus as a whole. Density is the fraction of all possible edges that are actually present; high density means the vocabulary is tightly interconnected. Average path length is the mean shortest-path distance between any two nodes; low average path length means concepts are conceptually close even when they are not directly linked. The clustering coefficient measures the extent to which neighbors of a node are also neighbors of each other; high clustering means the corpus has small densely-connected sub-graphs (theme clusters), low clustering means it is more like a single distributed web.

RNetwork-level statistics
edge_density(g)                 # proportion of possible edges present
mean_distance(g, weights = NA)   # average shortest path length (unweighted)
transitivity(g, type = "global")   # global clustering coefficient
diameter(g, weights = NA)         # longest of all shortest paths

# Compare to a random graph of the same size: is the loneliness network more clustered
# than chance?
set.seed(841)
g_random <- sample_gnm(vcount(g), ecount(g), directed = FALSE)
transitivity(g_random, type = "global")
mean_distance(g_random)

# A real semantic network should have substantially higher clustering than the random
# graph and shorter mean distance than the random graph -- the "small world" property

What the comparison tells you: If your loneliness network has higher clustering and similar mean distance compared to a random graph of the same density, it has the “small-world” property: locally dense, globally short. This is the expected shape for a semantic network of natural language because the underlying conceptual structure of human discourse is small-world. If your network is not small-world, something has probably gone wrong (often: the FCM window is too coarse and the network is fully connected).

3.6 What Semantic Network Analysis Shows You and What It Misses

Semantic networks visualize conceptual co-occurrence. They do not measure what participants meant by the co-occurring words. The word chair connected to the word empty in the network is a finding only if the chair-empty pairs in the corpus are doing the same analytic work; the network statistic does not tell you that. The interpretive work — reading the actual passages in which the words co-occur — remains the analyst's task, exactly as for the KWIC and keyness techniques in Section 1.

What semantic networks add over those simpler techniques is a synoptic view of the corpus's conceptual structure. The Louvain communities are a kind of unsupervised theme proposal: clusters of words that the corpus itself groups together. When those communities track the themes you developed through hand-coding, you have computational corroboration of your analytic structure. When they diverge from your hand-coded themes, the divergence is itself a finding: either your hand coding missed something the corpus is doing, or the corpus is doing something at the level of vocabulary co-occurrence that does not survive when you read it closely. A related class of fully probabilistic theme-discovery methods — latent Dirichlet allocation (Blei, Ng, & Jordan, 2003; Wikipedia contributors, n.d.-c) and its structural-topic-model extension (Roberts, Stewart, Tingley, et al., 2014) — treats each document as a mixture of latent topics and is widely used in the broader computational-text-as-data tradition (Wikipedia contributors, n.d.-d); the co-occurrence-network approach in this lesson is the more interpretively transparent cousin.

Reflection

Sketch (in words) what you predict the Louvain communities of your loneliness corpus's semantic network would look like. How many communities would there be? What kinds of words would cluster together? Then think about which of those predicted clusters would correspond to a theme you have already identified through close reading, and which would represent something the close reading missed.

Model answerA defensible prediction names three to five communities, each with a few representative words. Example: “I would expect (a) an embodied / sensory cluster: tired, body, sleep, quiet, fading, ache, slow; (b) a social-structural cluster: family, friends, group, kids, work, partner; (c) a coping / activity cluster: walk, dog, music, radio, church, book; (d) a temporal / loss cluster: years, dead, gone, miss, used-to, before. The first two cleanly map onto themes I have already coded. The fourth probably also does. The third may surface a sub-structure I had treated as a single theme — the difference between embodied solo coping (walk, dog, music) and mediated solo coping (radio, screen, online) might emerge as two sub-clusters within what I had hand-coded as a single 'coping' category. That would be a useful refinement of the codebook for the final paper.” The point is not to be right; the point is to make a specific, testable prediction that the data can then confirm or refine.

Minimum 20 characters required.

✓ Reflection saved
Knowledge Check — Section 3

Question 1: In a semantic network built from interview transcripts, an edge between two words represents:

Edges in a co-occurrence network represent the simple fact that the two words appeared in the same context unit. The context unit (sentence / paragraph / sliding window) is a design choice that affects how the network looks. Word embeddings and dependency parses are different network constructions, not the standard for the technique in Bernard, Wutich & Ryan, Ch 19.

Question 2: A word with high betweenness centrality in a semantic network is best interpreted as:

Betweenness measures how often a node is on the shortest path between pairs of other nodes. High-betweenness words are conceptual hinges: removing them would disconnect parts of the network. This is distinct from high-frequency, which is captured by degree or strength.

Question 3: The Louvain community-detection algorithm partitions a network by optimizing:

Louvain (and most modern community-detection algorithms) optimize modularity. High modularity means strong community structure; modularity above ~0.3 is meaningful, above 0.5 is strong.
Section 4 of 5

LLM-Assisted Qualitative Coding — Opportunities, Risks, Discipline

⏱ Estimated reading time: 45 minutes

Introduction and Overview

Bernard, Wutich, and Ryan's second edition was published in 2017. ChatGPT was released to the public in late 2022. Claude was released in 2023. Llama-class open-weight models capable of running locally on a laptop arrived in 2023–2024. The single most important technical development in qualitative methods since Bernard, Wutich, and Ryan’s second edition went to press — the arrival of large language models capable of reading a transcript and applying a codebook to it in seconds — is not covered in Bernard, Wutich, and Ryan (2017). This section is your introduction to it.

The methodological literature on LLM-assisted qualitative coding has grown rapidly since 2023. Gilardi, Alizadeh, & Kubli (2023) showed that ChatGPT outperformed crowdworkers on several annotation tasks at a fraction of the cost. Ziems et al. (2024) systematically benchmarked LLMs on computational social science tasks and concluded that they perform credibly on many classification and extraction tasks but unevenly on tasks requiring deep cultural knowledge. Bail (2024) has been the leading methodological skeptic, cataloguing failure modes that are not visible from accuracy benchmarks alone — hallucination, prompt drift, bias amplification, and what he calls “methodological invisibility” (researchers using LLMs and not disclosing it). Bender, Gebru, McMillan-Major, & Shmitchell (2021) provide the field's most-cited articulation of the ethical and epistemic risks of treating large language models as authoritative knowledge sources.

This section's purpose is to give you a disciplined working stance: LLMs are a tool, like Taguette or NVivo or quanteda; they require validation, disclosure, and human oversight; they are not a replacement for analytic judgment. We walk through what LLMs can and cannot do, what the risks are, what the methodological discipline looks like, and how to actually use an LLM to apply a codebook to a transcript with the validation that any defensible qualitative method requires.

Learning Objectives for Section 4

  • Define what an LLM is, in plain terms, at the level needed to use it responsibly as a research tool.
  • Identify the three main opportunities of LLM-assisted coding (speed, scale, reproducibility) and the five main risks (hallucination, prompt drift, opaque reasoning, bias amplification, methodological invisibility).
  • Specify a calibration-and-validation workflow for LLM coding using a hand-coded reference set and an agreement statistic.
  • Write a prompt that applies a codebook to a transcript reliably.
  • Audit LLM outputs against the source transcripts and detect hallucination.
  • Disclose LLM use defensibly in a methods section.
  • Submit the final capstone paper.

4.1 What an LLM Is, in Plain Terms

A large language model is a neural network with billions of parameters that has been trained on a very large text corpus to predict the next token (roughly, the next word-fragment) given a preceding context (Wikipedia contributors, n.d.-e). The training corpus typically includes a substantial portion of the open internet, books, academic papers, and code. After the initial “pre-training” step, contemporary models are further fine-tuned on human-feedback data so that they produce outputs that humans rate as helpful, honest, and safe. The resulting object is a function from text inputs to text outputs. You prompt it with text; it returns text. The contemporary architecture is a descendant of two earlier waves of work: distributional word-vector models such as word2vec (Mikolov, Chen, Corrado, & Dean, 2013) and GloVe (Pennington, Socher, & Manning, 2014) that learned dense vector representations of words from co-occurrence statistics (Wikipedia contributors, n.d.-f), and the transformer-based bidirectional encoders and decoders such as BERT (Devlin, Chang, Lee, & Toutanova, 2018) and GPT-3 (Brown et al., 2020) that scaled the same general approach by orders of magnitude.

For qualitative coding, the relevant capability is that contemporary LLMs (Claude 3.5/4, GPT-4/5, Llama 3+ and similar) can read a several-thousand-token transcript and a codebook of a few dozen codes, and produce a coded version of the transcript — either as a list of (passage, code) pairs or as an annotated copy of the original text. They do this in seconds. A 20-transcript codebook application that would take a careful human coder two weeks of focused work runs in under an hour.

What you should hold in your head as a working model: an LLM is a plausibility engine. It produces outputs that are plausible continuations of the prompt, given its training data. Sometimes “plausible” means “correct”: when the task is well-specified and falls within the model's training distribution, the output is often impressively accurate. Sometimes “plausible” means “fluent and confident but wrong”: the model invents a quote, attributes it to a participant, and presents it with the same confidence as a real quote. The latter failure mode — hallucination — is the central risk of using LLMs for qualitative coding, and Section 4.3 is devoted to it.

A note on terminology

“Generative AI”, “LLM”, “foundation model”, and “chatbot” are sometimes used interchangeably in the methodological literature but they mean different things. Generative AI is the broadest term, including image and audio generation as well as text. LLM is specifically a text-generation model trained on natural language. Foundation model is the most general term for a large pre-trained model adaptable to multiple downstream tasks. Chatbot is a user-facing application built on top of an LLM (ChatGPT, Claude.ai, Gemini). In a methods section you should be specific: name the model and version (e.g., “Claude 3.5 Sonnet, accessed via the Anthropic API in November 2025”), not the umbrella category.

4.2 The Three Opportunities

The case for LLM-assisted coding is built on three claims. None are absolute; each requires the disciplinary work that the rest of this section is about.

Speed. A codebook of 30 codes applied to a 20-transcript corpus by a human team takes weeks. The same application by an LLM takes minutes to hours, depending on whether you process one transcript at a time or run them in parallel. For a research program where the bottleneck has been the cost of human coder time, this is transformative. It enables analyses of corpora that were previously infeasible, and it enables iteration on the codebook with feedback turnaround in minutes rather than weeks.

Scale. Closely related but distinct. With human-only coding, the size of the corpus an academic team can analyze is bounded by the team's labor budget. A doctoral student plus a research assistant might analyze 30–60 transcripts in a year. The same team using an LLM as a coding assistant can scale to thousands of transcripts — with the caveats below. This unlocks new research designs (longitudinal corpora, multi-site studies, social-media analyses) that simply did not have a human-coding answer.

Reproducibility. An LLM applied with a fixed prompt to a fixed text, at fixed inference parameters (temperature, top-p, seed), produces approximately the same output every time. Two researchers running the same prompt against the same model and the same data should get nearly identical codings — nearly, not exactly, because some randomness remains in commercial APIs and because the underlying model is occasionally re-versioned. The reproducibility is, in this narrow sense, better than the reproducibility of two human coders, who never produce identical codings even with the best calibration. This is a real methodological advantage, but it cuts both ways: the model also reproduces its mistakes exactly. A systematically wrong human coder can be caught by a co-coder noticing the disagreement; a systematically wrong LLM has no internal co-coder.

4.3 The Five Risks

The five risks below are not arguments against using LLMs. They are arguments against using them carelessly. Each risk has a known mitigation; the disciplined LLM-assisted workflow in Section 4.4 builds on those mitigations.

Risk 1: Hallucination. The model produces fluent, confident output that is not supported by the source text. For coding tasks, this most often manifests as the LLM inventing a quote attributed to a participant, or applying a code to a passage that does not contain the content the code names. Hallucination is the central methodological risk. A coding output where 8 of 100 codes are well-applied to invented or misread passages will look as professional as a coding output where all 100 codes are correct, because LLMs produce uniformly fluent prose either way. The only mitigation is auditing: read a random sample of the LLM's codes back against the source transcripts.

Risk 2: Prompt drift. Two prompts that look almost identical can produce divergent outputs. The difference between “Apply the codebook below to the transcript” and “Identify which codes from the codebook apply to each passage in the transcript” is small to a human reader and material to the LLM. Prompt drift becomes a methodological risk when the prompt is being iteratively refined during a study without version control: the codings produced under prompt v2 are not directly comparable to the codings produced under prompt v1. The mitigation is to fix the prompt before doing the production run and to record both the prompt and the model version in the methods section.

Risk 3: Opaque reasoning. When a human coder applies a code, you can ask them why and they can give a defensible answer. When an LLM applies a code, the “reasoning” it can articulate is itself another generated text and may or may not reflect the actual computation that produced the code assignment. Asking the model “why did you apply code X here?” produces a plausible-sounding rationalization that you cannot independently verify. This is a real epistemic limit and you should be honest about it in the methods section: the LLM is a black box that produces outputs; the audit, not the model's self-explanation, is what licenses you to trust an output.

Risk 4: Bias amplification. LLMs reflect the biases in their training data. For most qualitative coding tasks, the relevant biases are the cultural and linguistic biases of contemporary English-language web text. The model may under-recognize concepts from non-Western or non-English knowledge traditions; it may over-recognize concepts that are heavily represented in the training distribution; it may apply codes with subtle valence shifts that map onto the dominant culture's framing of the phenomenon. For studies with participants whose worldviews are not well-represented in the training data (Indigenous communities, recent immigrants, queer / trans participants, religious minorities), this risk is particularly acute. The mitigation is a high-quality hand-coded calibration set drawn from the actual study population — not a benchmark from another study.

Risk 5: Methodological invisibility. A researcher uses an LLM to do some or all of the coding, does not disclose it, and presents the resulting analysis as if it were hand-coded. This is a research-integrity issue, not a technical one, but it is widespread. Several recently retracted papers used LLMs without disclosure. The mitigation is straightforward and is required by emerging journal policies: disclose every use of an LLM in the methods section, including the model name and version, the date of access, the prompt or prompts, and the validation procedure. Treat LLM use the way you treat any other analytic tool: name it, validate it, and report the validation.

The HSCI 841 stance, stated plainly

LLMs are admissible in this course and in your capstone, with three non-negotiable conditions. (1) You disclose every LLM use in the methods section, with model name, version, date, and prompt. (2) You validate every LLM-coded analysis against a hand-coded reference set drawn from your own data, and you report the validation statistic (Krippendorff's alpha or similar). (3) You audit a random sample of LLM outputs against the source transcripts for hallucination, and you report the audit. An analysis that does not meet all three conditions is not admissible. An analysis that meets all three is treated like any other defensible qualitative-methods choice.

The course's deeper position is that LLMs are neither a savior nor a threat. They are a tool that handles a specific kind of work (large-scale, well-specified, repetitive coding) better than humans, and a different kind of work (interpretive judgment, theoretical synthesis, ethical reasoning about cases) substantially worse than humans. Treating them as the right tool for the right tasks — with validation — is what disciplined practice looks like.

4.4 The Disciplined Workflow

The following workflow is the methodological default for LLM-assisted coding in this course. It is not the only defensible workflow, but it is the simplest one that satisfies the three non-negotiable conditions above.

  1. Hand-code a calibration set first. Take a stratified sample of 3–5 transcripts from your corpus and code them by hand, the way you have done in Modules 5, 7, and 8. This is your reference set. The codebook you use must be the same codebook you will apply with the LLM; the reference codings are what you will validate the LLM against.
  2. Write the prompt. The prompt has three components: (a) a role / task statement, (b) the full codebook with definitions, (c) instructions for the output format. See Section 4.5 for a worked example.
  3. Apply the LLM to the calibration set. Run the prompt against each transcript in the calibration set. Save the outputs.
  4. Compute agreement against the hand-coded reference. For each passage in the calibration set, compare the human code to the LLM code. Compute Krippendorff's alpha or Cohen's kappa across the set. If agreement is low (alpha < 0.6), iterate on the prompt. If agreement is acceptable (alpha ≥ 0.7), proceed. The thresholds are debated but 0.7 is the conventional acceptable-agreement floor for nominal data.
  5. Document the prompt-validation history. Record every prompt iteration in your audit trail along with the resulting agreement statistic. The final accepted prompt is the one you report.
  6. Apply the accepted prompt to the full corpus. Run the LLM on the remaining transcripts.
  7. Audit a random sample for hallucination. Take 30 randomly sampled (passage, code) pairs from the LLM outputs. For each, look up the passage in the source transcript and confirm that (a) the passage exists as quoted, (b) the assigned code is plausibly supported. Report the audit results in the methods section.
  8. Disclose. Methods section names the model and version, the prompt, the calibration sample size, the agreement statistic, and the audit result. An appendix carries the full prompt verbatim and the codebook.

4.5 A Sample Prompt That Applies a Codebook

The prompt below is a template that has been tested across several qualitative studies. Adapt the codebook section to your specific codebook; the surrounding scaffolding is what makes the output auditable.

PROMPTSample prompt: applying a loneliness codebook to a transcript

This is the prompt you would paste into Claude, ChatGPT, or any equivalent LLM interface. The pattern is platform-agnostic; you can also call the same prompt programmatically via the API (the httr2 or ellmer R packages, or any equivalent).

# --- ROLE AND TASK ---
You are a qualitative-research coding assistant. You will receive a codebook
and a transcript of an interview about loneliness. Your task is to apply the
codebook to the transcript, identifying every passage that matches one or more
codes.

# --- CODEBOOK ---
The codebook has 12 codes. Apply them strictly according to the definitions
below. Do not invent new codes. Do not modify the codes.

1. somatic-absence  -- Reference to a physical place or object whose meaning
   is the absence of a specific person (e.g., the empty chair, the unused
   side of the bed, the silent kitchen).

2. embodied-loneliness  -- Reference to a bodily sensation of loneliness:
   tiredness, ache, cold, weight, hollow, fading. Bodily as opposed to
   affective.

3. affective-loneliness  -- Reference to a feeling of loneliness in affective
   terms: sad, empty, low, isolated, longing.

4. social-shrinkage  -- Reference to the participant's social world becoming
   smaller over time, either through death, mobility loss, or other
   structural change.

5. mediated-connection  -- Reference to maintaining connection through
   technology, phone, or other mediation rather than in person (group chat,
   WhatsApp, phone-pal, video call, social media).

6. embodied-coping  -- Reference to a coping strategy that involves the body
   or an activity: walking, the dog, exercise, music, gardening.

7. relational-coping  -- Reference to a coping strategy involving another
   person or relationship: friends, family, partner, neighbor, professional.

8. loneliness-as-cost-of-love  -- Statement that loneliness is the price of
   having loved (typically older participants reflecting on widowhood or
   long-loss).

9. choosing-solitude  -- Statement that the participant has chosen to be
   alone and that the alone-ness is desired rather than imposed.

10. structural-cause  -- Attribution of loneliness to a structural cause:
    immigration, housing, work conditions, the pandemic, ageism.

11. internal-cause  -- Attribution of loneliness to an internal cause:
    personality, mental-health history, neurodivergence, self-criticism.

12. existential-meaning  -- Statement that the loneliness is meaning-making
    or meaning-giving: solitude as part of the human condition, loneliness
    as teacher.

# --- OUTPUT FORMAT ---
Return a JSON array. Each element of the array is an object with three keys:
  "passage":     the exact verbatim passage from the transcript, no edits
  "code":        the code from the codebook above, exact name
  "rationale":   one short sentence explaining why this code applies

Rules:
- Use only the codes above. If a passage matches no code, do not include it.
- A passage may receive more than one code. Return one JSON object per code
  per passage (so a doubly-coded passage produces two objects with the same
  "passage" but different "code" values).
- Do not summarize, paraphrase, or shorten passages. Quote verbatim.
- Do not invent passages. If you are uncertain whether a passage is verbatim,
  do not include it.
- Return only the JSON array. Do not include any other text.

# --- TRANSCRIPT ---
[Paste the transcript here, including the participant ID and metadata header]

Why this scaffolding is necessary: The verbatim-quote requirement and the JSON output format make the LLM's output mechanically auditable. You can take any output passage, search for it in the source transcript, and verify it exists. If it does not exist, the LLM has hallucinated — and you have caught it. Without these structural constraints, the audit step becomes much harder.

4.6 The Worked Example: 5 Loneliness Transcripts, Hand vs LLM

The following walkthrough mirrors what your final capstone work might include. The numbers are illustrative; they are typical of what students in earlier offerings of this course have observed when running the workflow end-to-end.

StageWhat you doTypical result
1. Hand code 5 transcripts P01 Maya, P05 Linda, P11 Helen, P15 Amira, P20 Frank. Apply the 12-code loneliness codebook in Taguette. ~120 passages coded total; ~140 (passage, code) pairs because some passages are double-coded.
2. Apply the LLM prompt to the same 5 transcripts Run the prompt in Section 4.5, copying each transcript in turn. Save the JSON outputs. ~130 passages identified; ~150 (passage, code) pairs.
3. Align passages between hand and LLM For each LLM passage, find the closest hand-coded passage (overlapping text). Build a paired data frame. ~85% of LLM passages match a hand-coded passage; ~15% are LLM-only (some legitimate, some hallucinated).
4. Compute Krippendorff's alpha Treat the human and LLM as two coders; compute alpha on the matched passages. First prompt: alpha = 0.62. Below threshold; iterate.
5. Iterate the prompt Reading the disagreement matrix: LLM over-applies structural-cause; under-applies somatic-absence. Tighten the definitions in the prompt; add 2 worked examples for each problematic code. Second prompt: alpha = 0.78. Acceptable. Lock the prompt.
6. Apply the locked prompt to the remaining 15 transcripts Run on P02–P04, P06–P10, P12–P14, P16–P19. ~430 (passage, code) pairs across the 15 transcripts.
7. Audit 30 random pairs for hallucination For each, verify the passage exists verbatim in the source. Typical: 28–30 are verbatim. 0–2 are paraphrased or invented. Report the rate.
8. Methods section paragraph Disclose: model + version + access date; prompt in appendix; alpha; audit result. One transparent paragraph that a reviewer can evaluate.
RCompute human-LLM agreement with Krippendorff's alpha
library(tidyverse)
library(irr)

# Read both coding sets (long format: doc, passage_id, code)
hand <- read_csv("outputs/wk12_hand_codings.csv")
llm  <- read_csv("outputs/wk12_llm_codings.csv")

# Join by passage_id (you have constructed this in step 3 above)
agreement <- hand |>
  rename(code_hand = code) |>
  left_join(rename(llm, code_llm = code), by = c("doc", "passage_id"))

# Encode codes as integers (irr requires numeric)
all_codes <- unique(c(agreement$code_hand, agreement$code_llm))
agreement <- agreement |>
  mutate(code_hand_int = match(code_hand, all_codes),
         code_llm_int  = match(code_llm,  all_codes))

# Build the 2-row matrix that irr expects (rows = coders, columns = passages)
codings_matrix <- rbind(
  hand = agreement$code_hand_int,
  llm  = agreement$code_llm_int
)

# Krippendorff's alpha (nominal codes)
kripp.alpha(codings_matrix, method = "nominal")

# Cohen's kappa for direct comparison
kappa2(t(codings_matrix), weight = "unweighted")

# Confusion matrix: which codes does the LLM systematically disagree on?
agreement |>
  count(code_hand, code_llm) |>
  pivot_wider(names_from = code_llm, values_from = n, values_fill = 0)

What to do with the confusion matrix: The diagonal counts are agreements; off-diagonal counts are disagreements. Read the rows and columns of the largest off-diagonal counts and you will identify which code pairs the LLM systematically confuses (e.g., embodied-loneliness vs affective-loneliness, or structural-cause vs internal-cause). Those confusions are exactly what to address in the next prompt iteration — tighten the definitions, add worked examples, and re-run the agreement check.

4.7 Auditing for Hallucination

The most important single audit you can run on LLM-coded data is the verbatim-quote check. The prompt in Section 4.5 instructed the model to return passages verbatim; the audit verifies that the model obeyed. The procedure is simple: take the LLM output, randomly sample 30 (passage, code) pairs, and for each one, search the source transcript for the passage. If the passage exists as quoted, the audit passes for that pair. If it does not exist, the model has hallucinated.

RHallucination audit: verbatim-quote verification
library(stringr)
library(tidyverse)

# Sample 30 random LLM-coded passages
set.seed(841)
audit_sample <- llm |> slice_sample(n = 30)

# For each row, load the source transcript and check whether the passage is verbatim
audit_sample <- audit_sample |>
  rowwise() |>
  mutate(
    transcript_text = paste(readLines(file.path(
      "term projects/HSCI_841/transcripts", paste0(doc, ".txt"))),
      collapse = "\n"),
    is_verbatim = str_detect(transcript_text,
                              fixed(str_squish(passage)))
  ) |>
  ungroup()

# Summary: hallucination rate
mean(audit_sample$is_verbatim)   # proportion verbatim (target: >= 0.95)
audit_sample |>
  filter(!is_verbatim) |>
  select(doc, code, passage)        # inspect the hallucinations directly

# Save the audit log for the appendix
write_csv(audit_sample, "outputs/wk12_hallucination_audit.csv")

What success looks like: A verbatim rate above 95% is typical for well-prompted contemporary LLMs on transcripts of the length and style you are working with. A rate below 80% means the prompt is failing to enforce the verbatim constraint and should be re-engineered; you should also consider whether the model version you are using is appropriate for the task. Report the verbatim rate in your methods section as the headline hallucination-audit statistic.

4.8 What an LLM Cannot Do

Closing this section honestly requires naming what the LLM workflow does not replace. The LLM does not develop the codebook (you do, in Modules 5 and 7). It does not interpret the findings (you do, in the discussion). It does not write the positionality statement (only you can). It does not decide which themes are central to the paper's argument (you decide). It is not a co-author. It is a coding assistant whose output you validate, audit, and take responsibility for. The methods section reports its role; the paper's interpretive claims are yours.

One specific limitation that is often missed: the LLM is poor at recognizing the absence of a code. Hand coders are trained to notice when a participant says something that conspicuously does not deploy a code that the rest of the corpus deploys frequently. That absent-pattern detection is part of why qualitative analysis is interpretive. LLMs do not currently do it well; they apply codes to passages but rarely flag passages-where-an-expected-code-would-go. This is a fundamental rather than passing limitation, and the appropriate response is to do the absence-checking yourself, by hand, after the LLM has done the routine application work.

4.9 The Disclosure Paragraph for Your Methods Section

Here is a template disclosure paragraph that satisfies the three non-negotiable conditions of Section 4.3. Adapt the numbers and the model name to your actual use.

Methods section disclosure (template)

“Codebook application across the 20-transcript corpus was assisted by Claude 3.5 Sonnet (Anthropic), accessed via the API in November 2025. The full prompt is provided in Appendix B. The prompt was calibrated against a hand-coded reference set of five transcripts (P01, P05, P11, P15, P20), which I coded in Taguette using the codebook reported in Appendix A. After two prompt-iteration cycles, Krippendorff's alpha between the human reference codings and the LLM codings on the calibration set reached 0.78 (n = 142 passages; nominal codes); both the iterations and the final agreement are documented in the audit trail (Appendix C). The locked prompt was then applied to the remaining 15 transcripts. A random sample of 30 LLM-coded (passage, code) pairs was audited against the source transcripts to detect hallucination; 29 of 30 passages were verbatim, yielding a verbatim rate of 96.7%. The one non-verbatim case is documented and discussed in Appendix C. All interpretive claims in the findings and discussion are mine; the LLM functioned as a coding assistant only.”

Reflection

You are about to write a methods section that discloses your use (or non-use) of an LLM in the capstone. Sketch the disclosure paragraph. If you used an LLM, what was the prompt-iteration history, what was the validation agreement statistic, and what was the hallucination audit result? If you did not use one, why not — was it a methodological choice, an access choice, or something else? Either answer is defensible if it is reflective.

Model answerA defensible used-LLM answer specifies the model and version, names a prompt-iteration history with at least two cycles, names the final agreement (e.g., Krippendorff's alpha around 0.75–0.80), and names the audit result (e.g., 95% verbatim). It also acknowledges what the LLM did not do: it did not develop the codebook, it did not interpret the themes, it did not write the discussion. A defensible did-not-use-LLM answer names a real reason: the small corpus size made hand coding feasible and analytically richer; the population (e.g., refugee participants whose worldview is poorly represented in LLM training data) made bias amplification an unacceptable risk; the time required to validate the LLM exceeded the time saved on a 20-transcript corpus. Both stances are admissible. What is not admissible is a stance that does not engage the question — quietly using an LLM and not disclosing, or refusing to use one without a stated reason. The capstone methods section needs you on the record.

Minimum 20 characters required.

✓ Reflection saved
Knowledge Check — Section 4

Question 1: The central methodological risk of LLM-assisted qualitative coding is:

Hallucination is the central risk because LLM outputs are uniformly fluent regardless of correctness. An 8%-hallucination output looks as professional as a 0%-hallucination output. The mitigation is auditing — reading a random sample of outputs against the source transcripts.

Question 2: The HSCI 841 stance on LLM use specifies three non-negotiable conditions:

The three conditions are disclosure, validation against a hand-coded reference, and a hallucination audit. The course is neither against LLM use nor uncritical of it; the position is that LLMs are a tool that requires the same kind of methodological discipline as any other.

Question 3: In the disciplined LLM-assisted workflow, the calibration step compares LLM codings to:

The reference set must come from the same study (the same codebook applied to the same kind of transcripts) because it is the only basis for testing whether the LLM has captured the codebook's specific operationalizations. Benchmarks from other studies cannot substitute.
Section 5 of 5

Final Assessment & Course Completion

⏱ Estimated time: 30 minutes

Bringing It All Together — And Closing the Course

Lesson 12 is the final lesson of HSCI 841. The five sections of this module — KWIC and keyness (Section 1), cultural domain analysis (Section 2), semantic network analysis (Section 3), LLM-assisted coding (Section 4), and this final assessment (Section 5) — complete the methodological arc that began in Lesson 1 with the operational definition of qualitative data analysis. The arc has carried you through theory and literature (Lesson 2), sampling (Lesson 3), data collection (Lesson 4), themes and codebooks (Lesson 5), analytic frameworks (Lesson 6), constant-comparative grounded theory (Lesson 7), content analysis (Lesson 8), schema and narrative analysis (Lesson 9), discourse analysis (Lesson 10), analytic induction and QCA (Lesson 11), and now into the computational and machine-assisted methods of this last lesson.

The final capstone paper is the synthesis. It is your demonstration that you can take a real qualitative dataset, ask a defensible research question, choose appropriate methods, apply them transparently, produce findings that are evidenced and interpretive, and write the whole thing up in journal-article format. The capstone is also a piece of qualitative scholarship that you should be proud to put your name on. If, two years from now, you are submitting a real qualitative public-health paper to a real journal, the moves you make in that paper should be moves you first made in your HSCI 841 capstone.

Key Takeaways from Lesson 12

  • Computational text analysis is a front-loader for close reading, not a replacement for it. KWIC, frequency, TF-IDF, keyness, collocations, and n-grams surface candidates; the close reading turns the candidates into findings.
  • KWIC is the oldest computational text technique and remains analytically valuable because it preserves local context, enabling rapid sense-disambiguation that pure numerical statistics cannot deliver.
  • TF-IDF surfaces document-distinctive vocabulary; keyness surfaces subcorpus-distinctive vocabulary. Both are versions of comparing focal vocabulary against a background.
  • Cultural domain analysis (Bernard, Wutich & Ryan Ch 18) measures shared mental models with free listing (Smith's salience), pile sorts (with MDS), triad tests, and consensus analysis (Romney-Weller-Batchelder).
  • The eigenvalue ratio test is the diagnostic for the single-culture assumption of consensus analysis: a ratio ≥ 3 supports a shared model; a low ratio indicates multiple sub-cultures.
  • Semantic network analysis treats text as a co-occurrence graph; centrality measures (degree, betweenness, closeness, eigenvector) and community detection (Louvain) reveal implicit conceptual structure.
  • LLM-assisted coding is admissible in this course under three non-negotiable conditions: disclosure, validation against a hand-coded reference set, and a hallucination audit.
  • The central risk of LLM coding is hallucination; the mitigation is a verbatim-quote constraint in the prompt plus a random-sample audit against the source transcripts.
  • The capstone is the course's exit deliverable: a 4,000–6,000 word journal-article paper with codebook, audit trail, positionality statement, and methods section that meets the systematic-transparent-replicable manifesto of Lesson 1.

Core Concepts Reviewed

Section 1: KWIC concordances; word frequencies; type-token ratio and lexical diversity (MATTR, MSTTR); TF-IDF; keyness (chi-squared and log-likelihood); collocations; n-grams; the principle that computational text statistics are pointers for close reading.

Section 2: Free listing and Smith's salience; pile sorts and successive pile sorts; multidimensional scaling (MDS); triad tests; the Romney-Weller-Batchelder consensus model; competence scores; the eigenvalue ratio test for the single-culture assumption; applications in public-health domains.

Section 3: Term co-occurrence matrices (FCM); context units (sentence, paragraph, window); the four centralities (degree, betweenness, closeness, eigenvector); community detection (Louvain, Walktrap, modularity); network-level statistics (density, mean path length, clustering coefficient); the small-world property of semantic networks.

Section 4: LLMs as plausibility engines; the three opportunities (speed, scale, reproducibility); the five risks (hallucination, prompt drift, opaque reasoning, bias amplification, methodological invisibility); the disciplined workflow (calibrate, prompt, iterate, lock, apply, audit, disclose); the prompt template; agreement statistics for human-LLM comparison; the hallucination audit; the disclosure paragraph.

The final reflection below is the last reflection of the course. It is forward-looking. It asks what you carry from this course into your next piece of qualitative research — the one not assigned by a syllabus.

Final Reflection — A Forward-Looking Closing

You are about to submit a journal-article paper that began as a 500-word positionality memo twelve weeks ago. Looking back across the term, name one method or stance from HSCI 841 that you can imagine actually using in your next piece of work — the one not assigned by a course. What is the method or stance, what kind of question or data would it be answering, and what do you need to be able to do it well that you did not know how to do twelve weeks ago?

Model answerA strong closing reflection names something specific and connects it back to the trained capability. Examples: “Constant-comparative grounded theory (Lesson 7) is the method I expect to actually use, on a planned study of why nurses leave acute-care wards. What I now know how to do that I did not before is develop a codebook in dialogue with the data rather than imposing one and writing memos that themselves become part of the analytic record.” Or: “The disciplined LLM-coding workflow (Lesson 12) is the method I expect to use most, because most of my future studies will involve corpora too large to hand-code. What I now know how to do is validate an LLM against a hand-coded reference set and audit for hallucination; what I did not know twelve weeks ago is that doing this is not just permitted but methodologically required.” Or: “The stance — transparency about positionality — is the durable thing. I now know how to write a methods section that declares who I am to the analysis, and that is something I will carry into every paper, qualitative or quantitative, for the rest of my career.” The point is to leave the course with a specific, articulable carry-forward. Vague answers (“I learned to be more reflective”) miss the opportunity that this last reflection is offering.

Minimum 30 characters required.

✓ Reflection saved
Final Assessment — Lesson 12: Computational Text Analysis, Cultural Domain Analysis & LLM-Assisted Coding (15 Questions)

Question 1: Which of the following best characterizes the analytic role of KWIC, frequency, TF-IDF, keyness, and collocations in a qualitative study?

Computational text statistics are pointers for close reading. The numerical output is the front edge of the analysis, not the finding. Bernard, Wutich, and Ryan are explicit on this and the course adopts the same stance.

Question 2: A keyness analysis comparing older and younger participants in the loneliness corpus identifies radio, fading, visit, and walker as words distinctive of the older subcorpus. Which of the following is the most defensible analytic next step?

Keyness words are pointers. The next step is to read the words in their KWIC contexts and develop interpretive claims supported by the passages, not by the test statistic alone.

Question 3: Smith's salience for an item in a free-listing study combines two pieces of information:

Smith's S = (∑((L − R + 1) / L)) / N, combining how many participants mention the item with how early they mention it. The highest-salience items are common AND early.

Question 4: The Romney-Weller-Batchelder consensus model's eigenvalue ratio test diagnoses:

The first/second eigenvalue ratio of the participant agreement matrix tests the single-culture assumption. A ratio ≥ 3 supports one shared model; lower ratios suggest sub-cultures, requiring stratified analysis.

Question 5: A pile-sort study produces an aggregate similarity matrix that is then analyzed with multidimensional scaling. The 2-D MDS map produces:

MDS produces an unlabeled map. The interpretation of the dimensions is the analyst's interpretive contribution and must be argued, not assumed from the algorithm.

Question 6: In a semantic network built from interview transcripts, an edge between two words represents:

Edges in a co-occurrence semantic network are simple co-occurrence within a context unit. The context unit is a design choice with real consequences for how the network looks.

Question 7: A word with high betweenness centrality in a semantic network is best interpreted as:

Betweenness measures how often a node lies on the shortest paths between pairs of other nodes. High betweenness identifies bridges between communities, not raw frequency.

Question 8: The Louvain community-detection algorithm partitions a network by optimizing:

Louvain optimizes modularity. Modularity > 0.3 indicates meaningful community structure; > 0.5 is strong.

Question 9: The central methodological risk of LLM-assisted qualitative coding is:

Hallucination is the central risk because LLM output is uniformly fluent regardless of correctness. The mitigation is a verbatim-quote constraint in the prompt and a random-sample audit against the source transcripts.

Question 10: The HSCI 841 stance on LLM-assisted coding specifies three non-negotiable conditions:

The three conditions are disclosure, validation, and audit. The course's position is that LLMs are admissible with discipline; they are not admissible without it.

Question 11: In the disciplined LLM-coding workflow, the calibration reference set should consist of:

The reference set must come from the same study because it tests whether the LLM has captured the codebook's specific operationalizations. Benchmarks from elsewhere cannot substitute.

Question 12: The Krippendorff's alpha (or Cohen's kappa) threshold conventionally accepted as the floor for adequate human-LLM agreement in qualitative coding is approximately:

0.7 is the conventional floor for acceptable agreement on nominal data; 0.8 is preferred. Below 0.6 is generally not adequate and requires further prompt iteration or methodological revision.

Question 13: A verbatim-quote audit on 30 randomly sampled LLM-coded passages from a 20-transcript corpus shows that 29 of 30 passages are verbatim. The most defensible report in the methods section is:

Disclosure is unconditional. Always report the audit numbers (even — especially — when they are good) and locate the failure cases in an appendix so reviewers can see exactly what failed.

Question 14: Which of the following is NOT one of the five risks of LLM-assisted coding identified in this lesson?

The five risks are hallucination, prompt drift, opaque reasoning, bias amplification, and methodological invisibility. Computational cost is a practical consideration but not one of the methodological risks identified.

Question 15: The HSCI 841 capstone paper is a:

The capstone is the journal-article paper. Length is 4,000–6,000 words for the main text plus unlimited appendices, including the codebook, audit-trail summary, and positionality statement.
✦ Complete the final reflection above before submitting

Congratulations — You Have Completed HSCI 841!

You have finished Lesson 12: Computational Text Analysis, Cultural Domain Analysis & LLM-Assisted Coding — the final lesson of the course.

Over twelve weeks you have built up an analytic vocabulary that runs from the operational definition of QDA (Lesson 1), through theory and literature (Lesson 2), sampling and data collection (Lessons 3–4), themes and codebooks (Lesson 5), conceptual models (Lesson 6), grounded theory (Lesson 7), content analysis (Lesson 8), schema and narrative analysis (Lesson 9), discourse analysis (Lesson 10), analytic induction and QCA (Lesson 11), and the computational and LLM-assisted methods of this final lesson. You have applied each of those toolkits to the loneliness capstone dataset. You now have an analytic kit you can carry into your next study.

What is next: submit the capstone paper. It is the work the course has been pointed at since Week 1. The Week 12 milestone in Section 4 of this module contains the full brief: 4,000–6,000 word journal-article-format paper, codebook in appendix, audit trail in appendix, positionality statement, references. If you used an LLM at any stage, the disclosure paragraph in Section 4.9 is your template. Submit as a PDF with the supplementary materials (Taguette export, R scripts, codebook .csv) attached.

Beyond the capstone — treat the course as the beginning of a longer engagement with qualitative research, not the end. Bernard, Wutich, and Ryan is a reference text; keep your copy. Re-read it as you encounter new methodological problems in your own work. The reference text is denser than any single course can make legible; you will get more out of it on the second pass than on the first.

Reference

Glossary — Key Terms, People & Concepts

📚 Reference page — available throughout the lesson

This glossary collects the key concepts, people, and computational tools introduced in Lesson 12. Use it as a reference while you work through the material, or as a review before the final assessment. Type in the search box to filter entries.

Computational Text Analysis
KWIC (Key Word In Context) A concordance technique that displays each occurrence of a target word with a window of surrounding context (typically 5–10 words on each side). KWIC makes local context inspectable cheaply and is the standard bridge between computational pattern-finding and close reading.
Word Frequency The raw count of how often each word (or stem) appears in a corpus, usually after tokenization, lowercasing, and stop-word removal. Useful as a starting descriptive but dominated by globally common words; rarely the final analytic output.
TF-IDF Term-Frequency × Inverse-Document-Frequency. A weighting that down-weights words that appear in many documents (so “the,” “and,” and topic-specific common words like “loneliness” get demoted) and elevates words that are distinctive to specific documents. Useful for surfacing document-distinctive vocabulary.
Keyness A statistic (typically log-likelihood or chi-squared) comparing word frequencies between a target subcorpus and a reference subcorpus. Words with high keyness are over-represented in the target. Keyness is a pointer to passages worth close reading, not a finding in itself.
Collocations Word pairs that appear together more often than chance would predict, identified with statistics like log-likelihood, MI (mutual information), or t-score. Collocations surface multi-word units (e.g., “social isolation,” “mental health”) that single-word frequency misses.
N-grams Contiguous sequences of n tokens. Bigrams (n=2) and trigrams (n=3) are the most common in qualitative analysis; they capture short phrases that single-word tokenization destroys.
Corpus / Subcorpus A corpus is the full body of texts under analysis (e.g., the 20 loneliness interview transcripts). A subcorpus is a defined subset (e.g., transcripts from participants aged 65+) used for comparative analyses such as keyness.
Document-Feature Matrix (DFM) The core data structure in quanteda: a sparse matrix with one row per document and one column per feature (typically a word or n-gram), cells containing counts. All downstream computations (frequencies, TF-IDF, keyness, network construction) are built from the DFM.
Cultural Domain Analysis
Cultural Domain Analysis (CDA) A family of structured-elicitation techniques (free listing, pile sorts, triad tests, rankings) for mapping the shared mental categories a group uses to organize a topic. Developed in cognitive anthropology and adopted in public health for codebook development and intervention design.
Free Listing & Smith's S A free list asks participants to list all the items in a domain (e.g., “all the things that make someone feel lonely”). Smith's S combines frequency (how many participants list an item) with rank (how early they list it) to produce a salience score: items that are common AND early have the highest salience.
Pile Sort Participants are given a deck of cards (one item per card) and asked to sort them into piles of similar items. Aggregated across participants, the co-pile frequencies form a similarity matrix that can be analyzed with MDS or clustering to reveal the implicit category structure of the domain.
Triad Tests Participants are shown sets of three items and asked which one is the “odd one out.” Repeated across triads, the choices generate an item-by-item similarity matrix usable for the same downstream analyses as pile sorts but in higher resolution.
Consensus Analysis A statistical procedure (Romney, Weller & Batchelder, 1986) that uses participant-by-participant agreement to estimate (a) whether the participants share a single cultural model and (b) each participant's cultural competence — how closely their answers match the group's shared model.
Cultural Competence In the consensus model, the proportion of the shared cultural answer key each participant has command of. Competence scores range from 0 to 1; high-competence participants are the de facto cultural experts on the domain.
Eigenvalue Ratio Test In consensus analysis, the ratio of the first to the second eigenvalue of the participant-by-participant agreement matrix. A ratio ≥ 3 supports the single-culture assumption (one shared model); a lower ratio indicates sub-cultures and the analysis should be stratified.
Multidimensional Scaling (MDS) A dimensionality-reduction technique that takes a similarity matrix (e.g., from a pile sort) and produces a low-dimensional spatial map — items that are similar are close in the map, dissimilar items are far apart. The analyst must propose substantive interpretations of the dimensions; the algorithm does not label them.
Semantic Network Analysis
Semantic Network A graph in which nodes are words (or concepts) and edges represent some form of association, most commonly co-occurrence within a context unit (sentence, paragraph, or sliding window) across a corpus. Used in Bernard, Wutich & Ryan, Ch 19 as a tool for visualizing the conceptual structure of a body of text.
Co-occurrence Matrix A symmetric word-by-word matrix in which cell (i, j) records how often word i and word j appear together within the chosen context unit. The co-occurrence matrix is the input to semantic-network construction.
Degree / Strength Centrality Degree is the number of edges a node has; strength is the sum of edge weights for a node. In a semantic network, high-degree words are the “hubs” that connect to many other words.
Betweenness Centrality A centrality measure capturing how often a node sits on the shortest path between other nodes. High-betweenness words are conceptual bridges: removing them would disconnect parts of the network. Distinct from raw frequency.
Community Detection Algorithms (Louvain, Leiden, walktrap, etc.) that partition a network into densely-connected clusters (communities). In semantic networks, communities surface thematic clusters of co-occurring words.
Louvain Algorithm A widely-used community-detection algorithm that optimizes modularity through a fast, iterative, agglomerative procedure. Implemented in R via igraph::cluster_louvain().
Modularity A quality score for a network partition: the extent to which within-community edges are denser than would be expected if edges were placed at random. Modularity above ~0.3 indicates meaningful community structure; above 0.5 is strong.
igraph The dominant R (and Python) library for network construction, manipulation, and analysis. Provides functions for centrality, community detection, visualization, and graph-theoretic computation. Used throughout Section 3 for the semantic-network workflow.
LLM-Assisted Coding
Large Language Model (LLM) A neural network (transformer architecture) trained on very large text corpora to predict the next token, then fine-tuned for instruction-following. In qualitative analysis, LLMs (GPT-4, Claude, Gemini, open-source models) can be prompted to apply codebooks, summarize transcripts, or surface candidate themes. Admissible with discipline; not admissible without.
Hallucination An LLM output that is fluent, confident, and wrong — invented quotes, misattributed claims, codes applied to passages that don't support them. The central methodological risk in LLM-assisted coding because hallucinated output is stylistically indistinguishable from correct output. Mitigated by verbatim-quote prompt constraints and random-sample audits.
Prompt Drift The phenomenon that LLM outputs change as the prompt is refined across iterations, often without the analyst noticing the cumulative effect. Mitigated by version-controlling prompts and reporting the final prompt verbatim in the methods section or appendix.
Opaque Reasoning LLMs do not produce inspectable rationales for their classifications. Even “chain-of-thought” outputs are post-hoc narrations, not the actual computation. Implication: an LLM coding cannot be defended on the grounds that the model “reasoned” correctly; it must be defended by validation against a hand-coded reference set.
Bias Amplification LLMs are trained on internet-scale text and inherit its demographic, linguistic, and ideological skews. When applied to qualitative data they may systematically under-code or over-code particular kinds of speech (e.g., AAVE, non-native English, working-class registers). Mitigated by stratified validation across participant subgroups.
Methodological Invisibility The temptation to use LLM assistance without disclosing it — a violation of the transparency commitment. The course's position is that all LLM use must be disclosed in the methods section, with the prompt, the model and version, the date of use, and the validation results.
Calibration / Reference Subset A hand-coded set of transcripts (typically 10–20% of the corpus, drawn from the same study) used to validate LLM coding output. The LLM is run on the same transcripts and the two codings are compared with an intercoder-agreement statistic. The reference set must come from the same study; benchmarks from other studies cannot substitute.
Intercoder Reliability (LLM) An agreement statistic (Cohen's κ, Krippendorff's α) computed between LLM coding and a hand-coded reference. Conventional floors: 0.7 acceptable, 0.8 preferred. Below 0.6 the LLM should not be used as-is; prompt or codebook must be revised.
LLM Disclosure A paragraph in the methods section that states (a) which model and version was used, (b) on what date(s), (c) the exact prompt, (d) the validation results against a hand-coded reference, (e) the hallucination audit results. Unconditional — required even when results are good.
Verbatim-Quote Constraint A prompt instruction requiring the LLM to return only verbatim text from the source transcript when quoting, and to flag any inability to do so. Reduces (but does not eliminate) hallucination of fabricated quotes; the audit checks whether the constraint operated.
Key People
A. Kimball Romney (1925–2016) Cognitive anthropologist; lead author of the foundational cultural-consensus paper (Romney, Weller & Batchelder, 1986, American Anthropologist). His work established that cultural shared-ness could be measured as a statistical property of participant agreement.
Susan C. Weller Medical anthropologist and co-author of the cultural-consensus model. Her later textbook, Systematic Data Collection (with A. Kimball Romney, 1988), is the standard reference for free listing, pile sorts, and triads.
William H. Batchelder (1940–2018) Mathematical psychologist who, with Romney and Weller, provided the formal statistical machinery of the consensus model. The Romney-Weller-Batchelder (RWB) procedure is named for the three.
Klaus Krippendorff (1932–2022) Communication scholar; author of Content Analysis: An Introduction to Its Methodology (multiple editions). Developed Krippendorff's α, the most general intercoder-agreement statistic and the recommended choice for LLM-vs-human comparisons.
James W. Pennebaker Social psychologist; developer of LIWC (Linguistic Inquiry and Word Count), a dictionary-based text-analysis tool that scores texts on linguistic and psychological dimensions. A canonical example of category-based computational text analysis and a useful contrast with embedding-based approaches.
Kenneth Benoit Political scientist (LSE); principal developer of the quanteda R package, the standard contemporary toolkit for computational text analysis. Section 1's workflow is built on quanteda.
Julia Silge & David Robinson Data scientists; authors of Text Mining with R (O'Reilly, 2017) and developers of the tidytext package. The tidytext approach is an alternative to quanteda built on tidyverse principles.
Gábor Csárdi & Tamás Nepusz Developers of the igraph library (R, Python, C). The dominant open-source toolkit for network analysis and the workhorse of Section 3's semantic-network workflow.
No matching entries. Try a different search term.