Content Analysis
Qualitative Research Methods & Analysis in Public Health
Kiffer G. Card, PhD, Faculty of Health Sciences, Simon Fraser University
Learning objectives for this lesson:
- Trace content analysis from Lasswell's WWII propaganda studies through Berelson's 1952 codification to Krippendorff's contemporary synthesis
- Distinguish manifest from latent content and explain why most contemporary content analysis blends both
- Treat codes as variables — the operational move that turns content analysis into a hybrid quantitative/qualitative method
- Sample text systematically: choose sampling units, recording units, and context units that match the research question
- Develop a content-analytic coding scheme that is exhaustive, defensibly exclusive (or explicitly multi-coded), and reliably applied
- Compute Krippendorff's alpha as the standard reliability statistic and interpret its acceptable thresholds
- Test hypotheses on content-analytic data using chi-squared, distributional comparisons, and trend analysis
- Apply dictionary-based and computational content analysis to the loneliness corpus and complete the Week 8 capstone milestone
This course was developed by Kiffer G. Card, PhD, as a companion to Bernard, H. R., Wutich, A., & Ryan, G. W. (2017). Analyzing Qualitative Data: Systematic Approaches (2nd ed.). SAGE. Lesson 8 covers Chapter 11 (pp. 243–268).
What Content Analysis Is — History, Manifest/Latent Content, and Codes as Variables
Introduction and Overview
Content analysis is the oldest systematic method for analysing text in the social sciences, and it is also the bridge between the qualitative and quantitative traditions of this course. Where the thematic analysis you learned in Module 5 stops once you have a defensible set of themes, content analysis keeps going: it counts the themes, distributes the counts across subgroups, and tests whether the differences are larger than chance. The method is qualitative in its first move (identifying what the relevant categories are) and quantitative in its second (counting their occurrence). It is the technique you reach for when your research question is not just what kinds of loneliness do people describe but also how often, and does that distribution differ by age, gender, or caregiver status.
This section traces how content analysis became the workhorse it is today. We start with Harold Lasswell's WWII analysis of Axis propaganda, watch Bernard Berelson codify the method in 1952, follow Klaus Krippendorff's modernisation into the canonical 2018 monograph, and end with the contemporary synthesis offered by Bernard, Wutich, and Ryan (2017, Ch. 11). Along the way we will pull apart the distinction between manifest and latent content — the line that separates "what the text literally says" from "what it means" — and we will explain the single operational move that turns content analysis into the hybrid method it is: treating codes as variables. By the end of the section you will be able to say not just what content analysis is, but why it is methodologically different from the thematic work you have already done.
Learning Objectives for Section 1
- Locate content analysis historically: from Lasswell (1927, 1942) and Berelson (1952) to Krippendorff (2018) and the contemporary synthesis.
- Distinguish manifest from latent content and recognise that most contemporary content analysis blends both.
- Explain the operational move that makes content analysis a hybrid method: codes as variables.
- Identify the kinds of public-health research questions for which content analysis is the right tool.
1.1 A Short History — From Lasswell to Krippendorff
Harold Lasswell’s wartime propaganda analysis at the Library of Congress operationalized content analysis: who says what to whom in which channel with what effect. The first attempt to quantify symbolic content systematically. Many of today’s content-analysis categories descend directly from his coding sheets.
Bernard Berelson’s textbook defined content analysis as 'the objective, systematic, and quantitative description of the manifest content of communication.' This narrow definition dominated for two decades and is still the working definition in much applied communications research.
Klaus Krippendorff’s textbook broadened the field to include latent meaning and explicitly engaged the reliability and validity questions that remain central. Krippendorff’s alpha statistic, developed across multiple editions, is now the standard inter-coder reliability measure for content analysis.
Content analysis today spans manual coding, computational text analysis, supervised machine learning, and now LLM-assisted approaches. All trace back to the Lasswell-Berelson-Krippendorff lineage. The methods have changed; the underlying logic — turning text into countable, comparable data — has not.
The modern history of content analysis begins with Harold D. Lasswell, the political scientist who in 1927 published Propaganda Technique in the World War, the first systematic effort to study communication content as a window onto political intent. Lasswell's question was not "what does this propaganda mean to its readers?" but "what categories of appeal does it use, and in what proportion?" The methodological commitment was to systematic counting of explicit features — mentions of leaders, mentions of enemies, deployment of symbols — rather than impressionistic close reading. The pay-off was that two analysts working independently could produce comparable numbers.
During the Second World War, Lasswell led the Experimental Division for the Study of War-Time Communications at the U.S. Library of Congress. The Division analysed German, Italian, and Japanese propaganda in volume that no individual reader could have made sense of impressionistically. By systematic coding of explicit features — the frequency with which Axis broadcasts mentioned specific Allied generals, the proportion of broadcast time devoted to economic versus military themes, the rise and fall of named enemies week by week — the Division produced intelligence assessments that helped guide both counter-propaganda efforts and broader policy decisions. The work was content analysis as inference about a producer: from the text, back to the propagandist's strategic state of mind. The method got its first major public-policy validation in this period (Lasswell, Lerner, & Pool, 1952).
The post-war codification belongs to Bernard Berelson. His Content Analysis in Communication Research (1952) gave the field its first textbook and its most-cited definition: content analysis is "a research technique for the objective, systematic, and quantitative description of the manifest content of communication" (Berelson, 1952, p. 18). Each of the four adjectives in that definition matters. Objective meant that the analysis was repeatable by other analysts. Systematic meant that the same procedure was applied across the entire sample. Quantitative meant counting, not impression. Manifest meant the literal surface of the text — what was said — rather than what the analyst imagined the author meant. Berelson's stance was, in effect, the high-modernist version of content analysis: positivist, quantitative, and resolutely uninterested in latent meaning.
The next major recodification came from Klaus Krippendorff, whose Content Analysis: An Introduction to Its Methodology (1980, 2004, 2013, and now in its 4th edition, 2018) brought the method into the contemporary social-science mainstream. Krippendorff's most important moves were two. First, he redefined content analysis as "a research technique for making replicable and valid inferences from texts (or other meaningful matter) to the contexts of their use" (Krippendorff, 2018, p. 24) — broadening Berelson's "objective description" to "inference about context," which made room for latent content. Second, he gave the field its statistical centre of gravity by developing Krippendorff's alpha, the reliability statistic that has since become the methodological standard for content analysis (we will use it in Section 2). Krippendorff's text remains the field's most-cited methodological reference and the standard against which contemporary content-analytic studies are judged.
The synthesis you are reading in HSCI 841 comes from Bernard, Wutich, and Ryan (2017), who present content analysis as part of a continuum of text-analytic methods rather than as a standalone enterprise. Their stance is that the strict separation Berelson drew between qualitative and quantitative analysis was always more rhetorical than real, and that contemporary content analysis is best understood as a method that blends the two: qualitative judgement determines the coding scheme; quantitative procedures determine how the codes are distributed and whether the differences are reliable. This is the position that Hsieh and Shannon (2005) articulated influentially in the health-research literature when they distinguished conventional, directed, and summative content analysis — a typology that has since structured most qualitative-content-analytic work in nursing, health services research, and public health.
| Tradition | Approximate dates | Defining commitment | What you would publish |
|---|---|---|---|
| Lasswell propaganda analysis | 1927–1948 | Systematic counting of explicit features to infer producer intent | Frequency tables of symbols, names, and themes — usually classified documents |
| Berelson classical content analysis | 1952–1970s | Objective, systematic, quantitative description of manifest content | Tables of code frequencies, with explicit coding rules and percent-agreement reliability |
| Krippendorff contemporary content analysis | 1980–present | Replicable and valid inference from text to context; explicit room for latent content; rigorous reliability statistics | Code distributions, inferential tests, Krippendorff's alpha as the reliability statistic, explicit sampling logic |
| Qualitative content analysis (Hsieh & Shannon, 2005) | 2000s–present | Conventional / directed / summative variants; categories may emerge from data, be imposed from theory, or be derived from word counts | Coded extracts plus a frequency table; mixed-method publications in health journals |
| Bernard/Wutich/Ryan synthesis (this course) | 2017 | Content analysis as a hybrid method that explicitly turns codes into variables and analyses them with descriptive and inferential statistics | A coded corpus, a frequency-by-subgroup table, a defensible inferential test, and an interpretive narrative |
Why the history matters for your capstone
When you write your eventual capstone methods section, you will need to position your content-analytic work in this lineage. Most contemporary health-research applications cite Krippendorff (2018) for the reliability machinery, Hsieh & Shannon (2005) for the typology, and Bernard, Wutich, and Ryan (2017) or Schreier (2012) for the practical workflow. Knowing where each one sits in the genealogy lets you make defensible methodological choices — for example, whether to treat your codes as mutually exclusive (Berelson) or to allow multi-coding (Krippendorff/Schreier).
1.2 Manifest Versus Latent Content
Key insight - Manifest vs latent is a continuum, not a binary
The classical distinction holds that manifest content is what is literally on the page ('the word vaccination appears') and latent content is what is implied or interpreted ('this passage frames vaccination as a moral obligation'). In practice, every coding decision involves some interpretation — even counting the word 'vaccine' requires deciding whether 'vaccinated' counts. The more latent the code, the more important reliability evidence and explicit decision rules become. A defensible content analysis names where its codes sit on the manifest-to-latent continuum.
The single most-asked methodological question about content analysis is: are you coding what the text literally says, or what it means? Bernard, Wutich, and Ryan's answer — and the answer of the modern field — is that the choice is yours, but it must be explicit.
Manifest content is the literal, surface-level content of the text. If your code is "mentions of pets," a manifest coder counts every appearance of the words pet, dog, cat, Rufus, Marie's cat, parrot, and so on. The advantage is that two manifest coders working from a clear word-list will produce nearly identical counts; the disadvantage is that manifest content misses everything the participant is doing with the text other than literal naming.
Latent content is the underlying meaning that requires interpretation to surface. If your code is "loneliness framed as the cost of having loved," a latent coder reads a passage and decides whether the participant is articulating the idea, regardless of whether the specific words "cost" or "love" appear. Linda's account of Bill's empty chair (P05) is a latent expression of "loneliness-as-residue-of-marriage" — she does not use that phrase, but the meaning is unmistakable to a competent reader. Latent coding is more interpretively powerful but harder to do reliably.
Berelson's classical definition (1952) restricted content analysis to manifest content. Krippendorff and the subsequent generation rejected the restriction, on the grounds that inference about context — the contemporary purpose of content analysis — almost always requires reading for latent meaning. The modern compromise is that content analysis routinely codes both, but the analyst must declare which is which and must demonstrate that latent codes can be applied reliably.
| Feature | Manifest coding | Latent coding |
|---|---|---|
| Unit examined | Words, phrases, named entities | Passages, propositions, implicit meanings |
| Decision procedure | Match against a dictionary or word-list | Read for meaning; apply rule from codebook |
| Reliability difficulty | Lower — identical word lists give identical counts | Higher — requires training and an explicit codebook |
| Interpretive depth | Shallow but defensible | Deeper but contestable |
| Typical reliability target (Krippendorff's alpha) | α ≥ 0.80 readily achievable | α 0.67–0.80 acceptable for tentative inference; α ≥ 0.80 for definitive claims |
A worked example from the loneliness corpus
Consider the question: how often do participants describe loneliness in terms that involve a piece of household furniture?
Manifest version: Count every occurrence of the words chair, couch, sofa, bed, table across all 20 transcripts. Linda mentions "chair" 8 times in P05; Helen mentions "chair" 3 times in P11. Manifest count: straightforward, replicable, and largely uninformative on its own.
Latent version: Code any passage in which a piece of furniture stands in for an absent person or a lost role. Linda (P05) describes Bill's chair as the empty space he used to fill — that is a clear latent instance. Helen (P11) describes her armchair as the place where she does not have anyone to read with — also latent, but with a different propositional structure (her loss is structural, not bereavement-specific). Linda's mention of the kitchen table where she ate dinner with Bill — latent, same theme. The latent code furniture-as-trace-of-absent-other may apply to 9 transcripts (vs. 11 transcripts for the manifest word-list), but the latent code is doing meaningful analytic work while the manifest word-list is not.
In your capstone, you will frequently want both: the manifest word-list as a fast first pass, and the latent code as the substantive analytic move.
1.3 Codes as Variables — The Operational Move
Here is the single move that makes content analysis what it is. In thematic analysis (Module 5), a code is a label you attach to a passage; the code's job is to organise interpretation. In content analysis, the same code becomes a variable: a column in a data frame, with a value for every unit in the corpus. Once codes are variables, the entire apparatus of descriptive and inferential statistics is available.
The reframing is subtle but transformative. Consider the code loneliness-as-residue-of-marriage. In thematic analysis, the code is a finding: "this is one of the kinds of loneliness participants describe." In content analysis, the code becomes a variable that takes the value 1 for every transcript in which the theme appears and 0 elsewhere. Now you can ask: in what proportion of transcripts does the theme appear? Does that proportion differ between widowed and non-widowed participants? Is the difference larger than chance? Does the proportion grow over the course of the interview — that is, do participants need to talk for a while before they articulate the theme? Each of these is a statistical question that the codes-as-variables move makes available.
Bernard, Wutich, and Ryan are explicit that this move is what places content analysis in the hybrid position it occupies. It is not "qualitative analysis with numbers attached" (which is what Berelson thought it was), nor "a quantitative method applied to text" (which is what nineteenth-century philology was). It is a method in which qualitative judgement is required to define the variables, and quantitative procedures are used to analyse them. Both halves are necessary; neither is optional.
| Move | Thematic analysis (Module 5) | Content analysis (this module) |
|---|---|---|
| What a code is | A label organising interpretive material | A variable with a value for every unit |
| What you report | The code, with illustrative quotes | The code's frequency, distribution, and inferential tests |
| What you defend | That the code names something real in the data | That the code is reliably applied AND names something real in the data |
| How you defend it | By showing illustrative passages | By showing inter-coder agreement (Krippendorff's alpha) AND illustrative passages |
| What "more data" buys you | Greater confidence that you have saturated the themes | Statistical power to detect distributional differences |
1.4 When Content Analysis Is the Right Tool
Not every qualitative research question calls for content analysis. The method's distinctive payoff is in questions that involve distributional comparison: how often something appears, whether subgroups differ in their use of a code, whether something changes over time. Where the question is genuinely interpretive — what does this single passage mean, or how does this participant make sense of their experience — thematic analysis (Module 5), schema analysis (Module 9), or narrative analysis (also Module 9) will give you more analytic traction than content analysis will.
Concretely, you should reach for content analysis when one or more of the following are true:
- You have a comparison question. Caregivers versus non-caregivers; younger versus older; immigrant versus native-born; pre-pandemic versus post-pandemic. Content analysis is built for these.
- Your corpus is large enough that close reading alone would be impractical. Twenty transcripts is on the lower end — you can certainly do content analysis on a corpus this size, and you will in this module — but for corpora of 50, 200, or 5,000 documents, content analysis is one of the few options that scales.
- You need replicable counts that other researchers can audit. Where policymakers or journal reviewers will demand "how many" rather than "how rich," content analysis is the methodological answer.
- Your research question is longitudinal. Trends, shifts, before/after comparisons — content analysis is the standard tool for them in the qualitative literature.
- You are writing for a quantitative audience. Public-health journals routinely ask for the kind of distributional evidence that content analysis produces. Where thematic analysis would be rejected as too impressionistic, content analysis can be persuasive.
Reflection
Pick one of your candidate themes from the Week 5 codebook you built in Module 5. Now answer two questions about it: (1) Is the theme manifest, latent, or both? (2) If you turned it into a variable, what comparison or distribution would you most want to examine? Be specific — name the subgroups, the cases, or the dimension of variation you would test.
Minimum 20 characters required.
Question 1: Which historical figure is most associated with the WWII propaganda-analysis programme that established content analysis as a serious social-science method?
Question 2: The operational move that makes content analysis a hybrid quantitative/qualitative method is best described as:
Question 3: Which of the following best characterises the relationship between manifest and latent content in contemporary content analysis?
Designing a Content Analysis — Sampling Text, Recording Units, and Reliability
Introduction and Overview
Section 1 framed content analysis methodologically. This section is about the design decisions that determine whether your content analysis will be defensible. There are three of them, in order: what counts as the text to be analysed (sampling), what counts as the unit being coded (recording units and context units), and how the codes are applied with sufficient consistency that the resulting counts mean something (reliability). Get any one of these wrong and the analysis is suspect; get all three right and you have a piece of content-analytic work that a sceptical quantitative reviewer will accept.
Bernard, Wutich, and Ryan are unusually explicit on these matters because content analysis, more than any other qualitative method, can be done badly in ways that produce numbers that look authoritative but are not. The history of the field is littered with frequency tables built on inconsistent coding of poorly specified units sampled from non-representative corpora. The remedy is procedural: you commit, in writing, to a sampling rule, a unit-definition rule, a coding scheme, and a reliability target, before you begin coding. You then execute the procedure transparently, report what happened, and let the reader audit it.
Learning Objectives for Section 2
- Distinguish sampling units, recording units, and context units, and choose each defensibly for your research question.
- Build a content-analytic coding scheme that is exhaustive, defensibly exclusive (or explicitly multi-coded), and clearly defined.
- Distinguish a priori (deductive) from emergent (inductive) coding schemes, and recognise the typical hybrid in practice.
- Compute Krippendorff's alpha and interpret it against the field's reliability thresholds.
2.1 Sampling Text — The Three Kinds of Units
Krippendorff's most enduring methodological contribution after the alpha statistic is the distinction between three levels of unit in any content-analytic design:
- Sampling units — the chunks of text that are drawn from the population. In a corpus of 20 loneliness transcripts, each transcript is a sampling unit. In a corpus of newspaper coverage of the overdose crisis, each article is a sampling unit. In a Twitter dataset, each tweet is a sampling unit.
- Recording units — the unit you actually code. The recording unit is where the variable takes its value. Recording units may be the same as sampling units (you code each transcript as a whole) or smaller (you code each sentence, each paragraph, each turn-at-talk, or each named theme-bearing passage).
- Context units — the surrounding text the coder is permitted to consult when deciding how to code a recording unit. If your recording unit is a sentence and your context unit is the paragraph, the coder reads the sentence in the context of the paragraph but assigns the code to the sentence alone.
The trio matters because the choice of each one shapes both what the analysis can show and how reliable it will be. A study that uses the whole transcript as the recording unit produces a frequency-per-transcript table; a study that uses the sentence as the recording unit produces a frequency-per-sentence table. These two analyses can give different answers to the same surface question, and the difference is methodological, not substantive.
| Recording unit choice | What it lets you measure | What it makes harder | Typical use in loneliness corpus |
|---|---|---|---|
| The whole transcript | Presence/absence of each code per participant | Intensity, frequency within a transcript, longitudinal pattern within an interview | "Of the 20 participants, how many invoke the loneliness-as-residue-of-marriage code at any point?" |
| The paragraph or speaker turn | Within-transcript distribution; co-occurrence of codes | Word-level features; reliability is harder than transcript-level | "In how many turns of Linda's interview does she invoke the absent-Bill theme, and where in the interview do those turns cluster?" |
| The sentence | Fine-grained distribution; sequence; intensity by participant | Higher coding burden; lower reliability for latent codes | "What proportion of Linda's sentences contain spatial metaphors for absence?" |
| The word or phrase (dictionary-based) | Massively scalable counts | Latent meaning entirely; ambiguity resolution; context | "How often does each transcript contain the word 'chair', and does the count correlate with widowhood status?" |
For your capstone, recommend: transcript-level recording units with paragraph-level context
In a 20-transcript corpus, the most defensible choice is to code each transcript as either containing or not containing each code (a binary recording-unit-per-transcript design), with each paragraph as the context unit you read to make the decision. This gives you a clean codes × participants matrix that supports chi-squared comparisons across subgroups. The trade-off is that you cannot measure within-transcript intensity — but for a corpus this size, between-transcript distributional analysis is the analytically productive level.
2.2 Developing a Coding Scheme
What counts as a case? Documents? Paragraphs? Speaker turns? Articles? The sampling unit is what your eventual frequencies will be ratios of. Decide it before coding.
What counts as a single coded instance? A word? A sentence? A paragraph? A clause? The recording unit determines how granular the codes are and how much data each code generates.
Three sources: (1) prior literature/theory (deductive codes), (2) pilot reading of data (inductive codes), (3) iterative refinement from the first 10-20% of the corpus. The final scheme almost always combines deductive and inductive codes.
Two or more coders apply the scheme to a pilot set (typically 10-15% of the corpus). Compare disagreements line-by-line. Refine code definitions, add inclusion/exclusion rules, add exemplars. Re-pilot until reliability is acceptable.
Full-corpus coding by trained coders. Spot-check reliability on a held-out sample (typically 10%). Re-train if reliability drifts.
Frequency tables, cross-tabulations, chi-squared tests, dictionary expansions, time trends. The countable output is the comparative advantage of content analysis over thematic analysis.
The coding scheme — the codebook, in the language of Module 5 — is the spine of any content analysis. In Module 5 you built one for thematic analysis; here you adapt it for the variable-treatment of content analysis. Three properties matter more in content analysis than they did in thematic analysis.
Exhaustive coverage. Your coding scheme should cover the content you care about. For content analysis specifically, exhaustive coverage means that every recording unit can be assigned at least one code (or the explicit "not-coded" residual category). The classical Berelson position is that codes should be mutually exclusive — each unit gets exactly one code — on the grounds that overlapping codes make the resulting frequencies hard to interpret. Krippendorff and the contemporary field reject the mutual-exclusivity requirement: a passage can simultaneously instantiate "loneliness-as-existential-fact" and "loneliness-coped-with-by-pet-companionship" and there is no good reason to force the coder to choose. The modern compromise is multi-coded passages are permitted if the codebook says so and the multi-coding is consistent.
Clear operational definitions. Each code in your scheme needs a definition that another coder could apply without consulting you. The definition has three parts: a brief substantive description, an inclusion rule (what counts), and an exclusion rule (what doesn't count). Inclusion and exclusion rules are where reliability is won or lost. "Mentions of loneliness" is a definition without inclusion/exclusion rules. "Any statement in which the participant attributes loneliness to a specific event or relationship (inclusion); excluding generic statements about loneliness in the population or society at large (exclusion)" is a definition that can be applied reliably.
A priori versus emergent. Content-analytic codebooks come from one of two directions, or (usually) both. A priori codes come from the literature, your conceptual framework, or your interview guide; you decide before you read the data that you will code for, say, "stigma," "social-comparison," and "coping-by-substance-use" because the literature on loneliness says these are the right categories. Emergent codes come from the data — you read the corpus, notice what is there, and let the codebook grow to fit. Hsieh and Shannon (2005) call the first directed content analysis and the second conventional content analysis; their summative third type starts from word counts and works upward. Elo and Kyngäs (2008) and Mayring (2000) offer complementary process descriptions widely cited in nursing and European health-research traditions.
| Approach | Codebook source | Advantage | Risk |
|---|---|---|---|
| Conventional (inductive / emergent) | Codes emerge from close reading of the corpus | Sensitive to participants' own categories; surfaces unexpected themes | May reproduce the analyst's preconceptions; can be unsystematic |
| Directed (deductive / a priori) | Codes drawn from theory, prior literature, or instrument | Replicable; comparable across studies; testable against theory | May miss what the data are actually doing if the codebook is wrong |
| Summative | Codes start as word counts, then expand to latent meaning | Scales; obvious replicability; computational tractability | Manifest-only by default; risks counting words without coding meaning |
| Hybrid (most contemporary work) | A priori scaffold plus emergent expansion | Best of both; transparent about origin of each code | Requires explicit documentation of which codes came from where |
Recommended scaffold for the HSCI 841 content analysis
For your Week 8 milestone, the recommended workflow is hybrid: take 5–7 of the codes you already developed in your Module 5 Taguette codebook, declare them as the a priori scaffold, then apply them systematically across 8–12 transcripts. As you code, allow up to two emergent codes if you find a category that cannot be accommodated by the original 5–7. Document each emergent code with the same care as the a priori ones — brief description, inclusion rule, exclusion rule.
2.3 Inter-coder Reliability — Stricter Than Thematic Analysis
Reliability is the test of whether two coders, working independently with the same codebook, would assign the same codes to the same passages. In thematic analysis (Module 5), reliability matters but it is one consideration among several; in content analysis, reliability is load-bearing — the frequencies you report are only as good as the consistency with which the codes were applied. If two coders applying the same codebook produce different counts, the counts are noise.
The field's standard reliability statistic is Krippendorff's alpha (Krippendorff, 2004, 2018; Hayes & Krippendorff, 2007). Alpha has three properties that recommend it over the older Cohen's kappa: it accommodates any number of coders (not just two); it accommodates any level of measurement (nominal, ordinal, interval, ratio); and it handles missing data gracefully. The arithmetic compares the observed disagreement among coders to the disagreement expected by chance, and produces a coefficient that runs from 1 (perfect agreement) through 0 (chance) to negative values (worse than chance).
| Krippendorff's alpha | Interpretation | Typical recommendation |
|---|---|---|
| α ≥ 0.80 | Strong agreement | Acceptable for definitive content-analytic claims (Krippendorff, 2018) |
| 0.67 ≤ α < 0.80 | Acceptable for tentative inference | Report; flag as tentative; discuss areas of disagreement |
| α < 0.67 | Inadequate | Revise the codebook and re-train; do not publish counts from this code |
The thresholds are convention, not law, and they vary slightly between sources. The 0.80/0.67 dichotomy comes from Krippendorff's own writing and is the most commonly cited in the field. Some health-research applications adopt a stricter 0.70 floor and a 0.80 publication target. Whatever you adopt, declare it in your methods section before you compute it.
Reliability is computed on a subset of the corpus — typically 10–20% — that two coders code independently. The remaining 80–90% is then coded by one coder alone, with the assurance that the reliability of the system is documented. The reliability subset is usually drawn randomly from the corpus to ensure that the reliability estimate is generalisable.
What to do when reliability is low
A low alpha is not the end of the analysis; it is a diagnosis. The cause is almost always one of: (a) a code definition that is too vague (revise inclusion/exclusion rules); (b) a code definition that covers too much (split into sub-codes); (c) insufficient coder training (run additional training sessions on a separate set of practice passages); or (d) the code is genuinely contested in the corpus (which is itself a finding — report it). The workflow is iterative: train, code a subset, compute alpha, revise the codebook, re-train, re-code, re-compute. Two cycles is typical; three is not unusual.
2.4 The Workflow End-to-End
Take 5-10 short documents (e.g., recent news headlines about a public health topic). Develop a small coding scheme:
- Define 3-5 codes that capture features you care about (e.g., 'mentions vaccination', 'cites a Canadian source', 'uses risk framing').
- Write a one-sentence definition + a one-sentence inclusion rule for each code.
- Code the documents yourself once. Set the codes aside.
- The next day, re-code the same documents without looking at your original codes. Compute your own intra-coder reliability.
If you can’t agree with yourself, two independent coders will agree even less. Most content-analysis projects need 2-3 iterations before reliability is acceptable.
A complete content-analytic study, in Bernard, Wutich, and Ryan's framing, follows these steps in order:
- Formulate the research question — explicitly distributional or comparative.
- Define the population of texts — what corpus is the analysis drawn from?
- Sample the texts — if the population is small enough, the entire population may be sampled; otherwise, define a sampling frame and a sampling procedure.
- Decide on sampling, recording, and context units — declare them in writing.
- Develop the codebook — a priori, emergent, or hybrid; with operational definitions for every code.
- Train coders — on a practice subset that does not enter the final analysis.
- Compute reliability on a 10–20% subset — Krippendorff's alpha; revise codebook if needed.
- Apply the final codebook across the corpus — one coder for the remaining 80–90% is typical for a small-corpus study.
- Analyse the resulting frequency table — descriptive statistics, then inferential tests (Section 3).
- Report transparently — corpus, sampling, units, codebook (in an appendix), reliability, analysis, interpretation.
Reflection
For your Week 8 capstone milestone, declare the design decisions you will make. Which 5–7 codes from your Module 5 codebook will you use? What is your recording unit (whole transcript / paragraph / sentence)? What is your context unit? What reliability target will you adopt, and how will you handle codes that do not meet it?
loneliness-as-residue-of-marriage, cultural-untranslatability-of-loneliness, coping-by-pet-companionship, identity-of-being-lonely-person, technology-as-double-edged, fading-at-the-edges, and shame-prevents-disclosure. My recording unit will be the whole transcript: each transcript is coded as containing or not containing each code, producing a 20 × 7 binary matrix. My context unit is the paragraph — if a passage is ambiguous, the coder reads the surrounding paragraph before deciding. I will adopt Krippendorff's α ≥ 0.70 as my acceptability threshold and α ≥ 0.80 as my publication target. Codes that fall between 0.67 and 0.80 will be reported but flagged as tentative; codes below 0.67 will be revised and re-coded or dropped." A weak answer either does not commit to specific codes or does not name a reliability rule.Minimum 20 characters required.
Question 1: In Krippendorff's terminology, what is a recording unit?
Question 2: Which reliability statistic is the contemporary standard for content analysis, accommodating any number of coders and any level of measurement?
Question 3: What is the typical interpretation of a Krippendorff's alpha of 0.74 on a content-analytic code?
Analyzing Coded Text — Hypothesis Testing, Comparisons, and Dictionaries
Introduction and Overview
Once your coding is done and your reliability is acceptable, you have a coded matrix: rows are recording units (transcripts, paragraphs, or sentences), columns are codes, and the cell values are either binary indicators (1 if the code applied, 0 otherwise) or counts (the number of times the code applied to that unit). This matrix is the input to the inferential half of content analysis. This section walks through what you can do with it: descriptive frequency tables, cross-tabulation by subgroup, chi-squared tests for distributional differences, and trend analyses where the corpus is longitudinal. It then introduces dictionary-based content analysis — LIWC and sentiment dictionaries — and previews the computational content analysis that Module 12 will develop in depth.
The intellectual point of this section is that content analysis is a hypothesis-testing method when it wants to be. Whatever you tested in your HSCI 410 regression work, you can test on content-analytic codes: differences in proportions, associations, trends, interactions. The variables are codes rather than survey items, but the inferential machinery is the same. This is what Bernard, Wutich, and Ryan mean when they say content analysis is the bridge between the qualitative and quantitative traditions.
Learning Objectives for Section 3
- Compute and present a content-analytic frequency table at three levels: overall, by subgroup, and over time.
- Apply chi-squared tests of independence (and Fisher's exact for small cells) to codes × subgroup matrices.
- Interpret distributional differences in content-analytic data substantively, not only statistically.
- Recognise when dictionary-based content analysis (LIWC, sentiment dictionaries) is appropriate.
- Locate computational content analysis (Module 12) as the scalable extension of the dictionary approach.
3.1 Frequency Tables — The Starting Point
Every content-analytic study starts with descriptive frequencies. For your loneliness corpus, a typical first-pass frequency table looks like the one below: each of seven codes, the count of transcripts in which the code appears (out of 20), and the percentage. The codes here are illustrative; your own codebook will produce different numbers.
| Code | Transcripts (n / 20) | Percentage |
|---|---|---|
| loneliness-as-residue-of-marriage | 6 | 30% |
| cultural-untranslatability-of-loneliness | 4 | 20% |
| coping-by-pet-companionship | 9 | 45% |
| identity-of-being-lonely-person | 11 | 55% |
| technology-as-double-edged | 14 | 70% |
| fading-at-the-edges | 5 | 25% |
| shame-prevents-disclosure | 13 | 65% |
Already this table is doing analytic work. The most prevalent code — technology-as-double-edged — appears in 14 of 20 transcripts, suggesting that ambivalence about phones, social media, and video calls is a near-universal feature of contemporary loneliness narratives, regardless of who the participant is. The least prevalent — cultural-untranslatability — appears in 4 transcripts, and you can already predict who those four participants are: the ones who emigrated to Canada from elsewhere (Amira, Maya's mother, two others). The frequency table tells you both what is shared across the corpus and what is concentrated.
3.2 Cross-Tabulation by Subgroup — The Comparison Move
Content analysis's real value emerges when you cross-tabulate the codes by a participant-level variable: age band, gender, caregiving status, immigration status, life-stage. Here is the same code distribution disaggregated by life-stage in our worked corpus (10 caregivers, 10 non-caregivers):
| Code | Caregivers (n=10) | Non-caregivers (n=10) | Total (n=20) |
|---|---|---|---|
| loneliness-as-residue-of-marriage | 2 (20%) | 4 (40%) | 6 (30%) |
| cultural-untranslatability-of-loneliness | 2 (20%) | 2 (20%) | 4 (20%) |
| coping-by-pet-companionship | 3 (30%) | 6 (60%) | 9 (45%) |
| identity-of-being-lonely-person | 6 (60%) | 5 (50%) | 11 (55%) |
| technology-as-double-edged | 7 (70%) | 7 (70%) | 14 (70%) |
| fading-at-the-edges | 1 (10%) | 4 (40%) | 5 (25%) |
| shame-prevents-disclosure | 9 (90%) | 4 (40%) | 13 (65%) |
Now the analytic action begins. The shame-prevents-disclosure code appears in 9 of 10 caregivers but only 4 of 10 non-caregivers — a striking difference that is the kind of finding content analysis is built to produce. The fading-at-the-edges code shows the reverse pattern: more common in non-caregivers (likely the older, more isolated participants like Helen) than in caregivers. The technology-as-double-edged code is invariant by subgroup, supporting the earlier hypothesis that this is a near-universal feature of contemporary loneliness regardless of life-stage. These patterns are descriptive; the next move is to ask whether they are larger than chance.
3.3 Chi-Squared Tests on Codes × Subgroup
The classical inferential test for a 2 × 2 (or larger) contingency table of categorical data is the chi-squared test of independence. Applied to a content-analytic frequency table, the chi-squared test asks: is the distribution of this code statistically different between the subgroups? The mechanics are familiar from HSCI 230 and 410; here we apply them to qualitative codes treated as variables.
Take the shame-prevents-disclosure code from the table above: 9 of 10 caregivers vs. 4 of 10 non-caregivers. The chi-squared test on this 2 × 2 table gives χ² = 5.49 with 1 degree of freedom, p = 0.019. The difference is statistically significant at conventional thresholds. Practically — and this is the part the qualitative analyst contributes — the analysis suggests that caregiver participants in this corpus systematically describe shame as a barrier to disclosing their loneliness, in a way that non-caregivers do not. The substantive interpretation might be that caregiving identity carries a moral demand of unselfish endurance, and that admitting loneliness conflicts with that demand. The chi-squared test is the warrant for the claim; the qualitative reading is the claim.
When to use Fisher's exact instead
Chi-squared tests assume that expected cell counts are at least 5 (some sources say 1). In small qualitative corpora — especially those with rare codes — this assumption frequently fails. The standard remedy is Fisher's exact test, which gives an exact p-value regardless of cell size. In R, fisher.test() is a drop-in replacement for chisq.test(); for a 20-transcript corpus with codes appearing in 4–14 transcripts, Fisher's exact is almost always the more defensible choice. Report both if you wish, but lead with Fisher's.
3.4 Trend Analysis — Codes Over Time
Content analysis is at its strongest in longitudinal applications. Where your corpus has a time dimension — coverage of a topic across years of newspaper articles, posts on a forum across the pandemic, or transcripts collected at multiple time points — the codes can be plotted over time and tested for trend. The classical example is Pool's (1952) analysis of editorial coverage of the Soviet Union across decades; the contemporary example is the proliferation of computational content analyses of social-media discourse before, during, and after specific events (elections, crises, vaccine roll-outs).
Your loneliness capstone is cross-sectional, so trend analysis is not the dominant move — but you may have a quasi-temporal variable. For example, you could ask whether participants who were interviewed earlier in the corpus (the first 10 transcripts) differ systematically from those interviewed later (the last 10) in any code, as a way of probing whether your codebook was over-fitted to early transcripts. Or you could examine whether participants who lived through the pandemic as adults (versus those who were teenagers at the time) describe loneliness differently. These are pseudo-trend analyses on a cross-sectional corpus, but the inferential logic is the same.
3.5 Dictionary-Based Content Analysis
So far we have assumed that humans are applying the codes. Dictionary-based content analysis is the variant in which a pre-specified word list (a "dictionary") is applied to the corpus by a computer, and the resulting counts are treated as content-analytic variables. The approach is best understood as the manifest-content extreme of content analysis: it counts what is explicitly there, fast and at scale, at the cost of any latent-meaning sensitivity.
The best-known dictionary in the social sciences is the Linguistic Inquiry and Word Count (LIWC) dictionary, developed by James Pennebaker and colleagues from the early 1990s (Pennebaker, Boyd, Jordan, & Blackburn, 2015; Boyd, Ashokkumar, Seraj, & Pennebaker, 2022). LIWC is a curated set of about 90 categories — positive emotion, negative emotion, anxiety, sadness, body, health, family, social processes, cognitive processes, and so on — each defined by a list of words and word stems. Run LIWC over a transcript and it returns the percentage of words in each category. A transcript that scores high on "sadness" contains a high proportion of words like sad, lonely, grief, cry, miss, lost; one that scores high on "social processes" contains a high proportion of pronouns referring to other people.
LIWC has been used extensively in health-related text analysis: predicting depression from writing samples (Pennebaker, Mehl, & Niederhoffer, 2003), characterising trauma narratives, distinguishing the writing of patients on different medications, predicting suicide risk from social-media posts. It is the methodological ancestor of contemporary sentiment-analysis systems and remains in active use; LIWC-22 is the current version.
Sentiment dictionaries are the lighter-weight cousin of LIWC. The best-known are Bing Liu's Bing lexicon (about 6,800 words classified as positive or negative), the AFINN lexicon (a scored version, with words assigned valence from -5 to +5), and the NRC Word-Emotion Association Lexicon (which classifies words into Plutchik's eight basic emotions plus positive/negative). All three are available in the R tidytext package and can be applied to a transcript corpus in under 50 lines of code (you will do this in the Module 12 capstone).
Dictionary-based vs. human-coded content analysis
The choice is not exclusive. Most rigorous contemporary content-analytic studies use both: dictionary-based methods for the manifest, scale-sensitive, replicable counts, and human-coded latent analysis for the meaning-sensitive, ambiguity-tolerant, theory-engaged interpretation. The two halves answer different questions and are reported alongside each other. The danger is treating dictionary output as the whole analysis — LIWC says a transcript scores 4.2% on "sadness," but it cannot tell you that the sadness is bereavement-specific, that it co-occurs with relief, or that the participant is describing the sadness ironically. Only human coding can.
3.6 Computational Content Analysis — A Preview of Module 12
Computational content analysis is the scalable extension of the dictionary approach, plus the addition of unsupervised methods that discover categories rather than counting pre-specified ones. The three families you will meet in Module 12 are: topic models (Latent Dirichlet Allocation and its successors, which discover thematic structure in large corpora without a codebook); word-embedding methods (which represent words as vectors in a high-dimensional space, enabling semantic similarity and analogy operations); and large-language-model-based coding (GPT-4-class systems applied as zero-shot or few-shot coders).
The methodological status of computational content analysis is still being worked out. The advantages are obvious: speed, scale, replicability of the computational pipeline. The disadvantages are real: opacity (LDA and LLMs cannot show their work the way a human coder's audit trail can), brittleness (a topic-model solution is sensitive to hyperparameter choices in ways the analyst rarely audits), and the recurring failure to validate computational outputs against human-coded gold standards. Bernard, Wutich, and Ryan's stance — and the stance of this course — is that computational content analysis is a powerful complement to human content analysis, not a replacement for it. Module 12 will give you the operational machinery; Module 8 (this module) is establishing the framework that machinery extends.
| Variant | What it counts | Scalability | Reliability | Latent-content sensitivity |
|---|---|---|---|---|
| Human-coded latent content analysis (this module) | Codebook-defined themes | ~50–200 documents per project | Krippendorff's alpha computed and reported | High |
| Human-coded manifest content analysis | Word or phrase occurrences | ~500–5,000 documents per project | Near-perfect | Low |
| Dictionary-based (LIWC, sentiment) | Pre-specified word lists | Millions of documents | Perfect within the dictionary | Very low |
| Topic models (LDA; Module 12) | Unsupervised thematic structure | Millions of documents | Depends on hyperparameter validation | Moderate but opaque |
| LLM-based coding (Module 12) | Anything specifiable in a prompt | Limited by API cost, not by labour | Variable; requires gold-standard validation | High but unverifiable |
Reflection
Consider the comparison in section 3.3: shame-prevents-disclosure appeared in 9 of 10 caregivers vs. 4 of 10 non-caregivers, χ² p = 0.019. Imagine a reviewer who is sceptical of qualitative work pushes back: "you have 20 transcripts — can you really make a chi-squared claim?" Write your response. What are the legitimate cautions, and what are the legitimate defences?
Minimum 20 characters required.
Question 1: A content-analytic study finds that a code appears in 9 of 10 caregivers and 4 of 10 non-caregivers. Which test is the most defensible inferential choice given the small expected cell counts?
Question 2: Which best characterises the relationship between LIWC and human-coded latent content analysis?
Question 3: Why is a chi-squared (or Fisher's exact) test on a content-analytic codes × subgroup matrix a legitimate inferential move, even when the sample is small and purposive?
Content-Analyzing the Loneliness Dataset — R Workflow and the Week 8 Capstone
Introduction and Overview
Sections 1–3 framed content analysis methodologically. This section turns operational. You will see, end to end, how to take a Taguette codebook (from your Module 5 work), apply it across the 20 loneliness transcripts, transform the resulting export into a codes × cases matrix in R, compute frequency tables and visualisations, run chi-squared and Fisher's exact tests on codes × subgroup, and conduct a keyness analysis with quanteda.textstats that surfaces which words differ most between two subcorpora.
The R workflow that follows assumes you completed the Module 5 milestone: you have an exported Taguette CSV with one row per (passage, code) pair, columns for the document filename, the tag (code), and the passage content. Module 5 covered 3–5 transcripts; Module 8 extends the codebook to 8–12 transcripts. The volume jump is intentional: you need enough cases for the frequency tables and chi-squared tests to be analytically informative, and 8–12 transcripts is the floor for a defensible content-analytic capstone milestone.
Learning Objectives for Section 4
- Load a Taguette export into R and reshape it into a codes × cases matrix.
- Compute frequency tables (overall, by subgroup) using
dplyr. - Visualise code frequencies by subgroup with a
ggplot2bar chart. - Conduct chi-squared and Fisher's exact tests on codes × subgroup.
- Run a keyness analysis with
quanteda.textstats::textstat_keyness()to compare subcorpora. - Complete the Week 8 capstone deliverable.
4.1 Step 1 — Load the Taguette Export Into R
Your Taguette export is a CSV with one row per highlighted passage. Each row contains the document filename (which encodes the participant ID), the tag (your code), and the passage content. The first transformation in R is to read it into a tibble and add the participant-level variables (age, gender, caregiver status, etc.) needed for subgroup comparison.
Open RStudio. Working from the repository root, run:
library(tidyverse)
library(readtext)
# Read the Taguette export — one row per (passage, code) pair
codings <- read_csv("term projects/HSCI_841/taguette_export_week8.csv")
glimpse(codings)
# Typical columns: document, tag (= code), content (= passage text), start, end
# Pull participant_id and pseudonym out of the document filename
codings <- codings |>
mutate(
participant_id = str_extract(document, "P[0-9]+"),
pseudonym = str_extract(document, "(?<=_)[A-Za-z]+(?=\\.txt)")
)
# Read the participant metadata table (created from transcript headers)
# Columns: participant_id, pseudonym, age, gender, life_stage, caregiver, immigrant
participants <- read_csv("term projects/HSCI_841/participant_metadata.csv")
# Merge metadata onto every coded row
codings <- codings |> left_join(participants, by = "participant_id")
glimpse(codings)
What success looks like: The codings tibble has one row per highlighted passage, columns for the code (tag), the passage text (content), and the participant-level variables (age, gender, life_stage, caregiver, immigrant) needed for subgroup analysis.
4.2 Step 2 — Reshape Into a Codes × Cases Matrix
The codes-as-variables move is now a literal data-shape operation: convert the long-format export into a wide-format matrix where rows are participants and columns are codes. The cell values are typically binary (1 if the code applied to that participant at all, 0 otherwise) but can be counts (the number of times the code applied within the transcript).
# Binary indicator matrix: 1 if code appeared in transcript, 0 otherwise
code_matrix_binary <- codings |>
distinct(participant_id, tag) |>
mutate(present = 1) |>
pivot_wider(
names_from = tag,
values_from = present,
values_fill = 0
)
# Count matrix: number of times each code appeared in each transcript
code_matrix_counts <- codings |>
count(participant_id, tag) |>
pivot_wider(
names_from = tag,
values_from = n,
values_fill = 0
)
# Add the participant-level variables for subgroup analysis
code_matrix_binary <- code_matrix_binary |> left_join(participants, by = "participant_id")
print(code_matrix_binary)
What this gives you: A 20 × (7 + k) tibble (or 8–12 × (7 + k) for the Week 8 milestone), where each row is a participant, each of the first 7 code columns is a binary or count variable, and the remaining columns are participant attributes (age, gender, life_stage, etc.). This is the analytic-ready frame for everything that follows.
4.3 Step 3 — Frequency Tables With dplyr
The overall and by-subgroup frequency tables in Section 3 are produced in a few dplyr calls.
# Overall code frequency (n transcripts in which each code appeared)
code_freq_overall <- code_matrix_binary |>
summarise(across(where(is.numeric) & !c(age), sum)) |>
pivot_longer(everything(), names_to = "code", values_to = "n_transcripts") |>
mutate(pct = round(100 * n_transcripts / nrow(code_matrix_binary), 1)) |>
arrange(desc(n_transcripts))
print(code_freq_overall)
# By caregiver status: 2 x 2 contingency for each code
code_freq_by_caregiver <- codings |>
distinct(participant_id, tag, caregiver) |>
count(tag, caregiver) |>
pivot_wider(names_from = caregiver, values_from = n, values_fill = 0)
print(code_freq_by_caregiver)
# By age band
code_matrix_binary <- code_matrix_binary |>
mutate(age_band = case_when(
age < 30 ~ "18-29",
age < 50 ~ "30-49",
age < 65 ~ "50-64",
TRUE ~ "65+"
))
code_by_age <- code_matrix_binary |>
group_by(age_band) |>
summarise(across(starts_with("loneliness-"), sum), n_in_band = n())
print(code_by_age)
Worked-example finding: If you replicate the Section 3 caregiver split on your own data, you will likely find that codes carrying a "moral demand" valence (shame, identity, role-protection) cluster in the caregiver subgroup, while codes carrying a "world-shrinkage" valence (fading, isolation, sensory loss) cluster in the older non-caregiver subgroup. The exact numbers will vary depending on which 8–12 transcripts you coded.
4.4 Step 4 — Visualise With ggplot2
A bar chart of code frequencies by subgroup is the standard published display for a content-analytic study. ggplot2 produces a publication-quality version in a few lines.
library(ggplot2)
# Long-format frequency table for plotting
plot_data <- codings |>
distinct(participant_id, tag, caregiver) |>
count(tag, caregiver) |>
left_join(
participants |> count(caregiver, name = "n_in_group"),
by = "caregiver"
) |>
mutate(pct = 100 * n / n_in_group)
ggplot(plot_data, aes(x = reorder(tag, pct), y = pct, fill = caregiver)) +
geom_col(position = "dodge") +
coord_flip() +
scale_fill_manual(values = c("caregiver" = "#CC0033", "non-caregiver" = "#0B7B6B")) +
labs(
x = NULL,
y = "% of subgroup with code present",
fill = "Life stage",
title = "Code prevalence by caregiver status (loneliness corpus, n=20)"
) +
theme_minimal(base_size = 12) +
theme(panel.grid.major.y = element_blank())
ggsave("figures/code_prevalence_by_caregiver.png", width = 8, height = 5, dpi = 300)
What success looks like: A horizontal bar chart with codes on the y-axis (longest bars at the top), percentage on the x-axis, two bars per code (red for caregiver, teal for non-caregiver). Codes for which the two bars differ markedly are the ones worth subsequent chi-squared testing.
4.5 Step 5 — Chi-Squared and Fisher's Exact Tests
The inferential half of the analysis. For each code that showed a meaningful descriptive difference between subgroups in step 4, run the appropriate contingency-table test.
# For a single code — the shame-prevents-disclosure example from Section 3
# Build the 2 x 2 contingency table
shame_table <- table(
caregiver = code_matrix_binary$caregiver,
shame = code_matrix_binary$`shame-prevents-disclosure`
)
print(shame_table)
# Chi-squared with Yates' continuity correction (default for 2 x 2)
chisq.test(shame_table)
# Fisher's exact — the more defensible choice for small expected counts
fisher.test(shame_table)
# Loop across all codes to produce a table of p-values
code_names <- setdiff(colnames(code_matrix_binary),
c("participant_id", "age", "gender", "life_stage",
"caregiver", "immigrant", "pseudonym", "age_band"))
test_results <- map_dfr(code_names, function(cd) {
tbl <- table(code_matrix_binary$caregiver, code_matrix_binary[[cd]])
ft <- fisher.test(tbl)
tibble(
code = cd,
n_caregiver_pos = tbl["caregiver", "1"],
n_noncaregiver_pos = tbl["non-caregiver", "1"],
odds_ratio = unname(ft$estimate),
p_value = ft$p.value
)
}) |> arrange(p_value)
print(test_results)
What to report: The test_results tibble is what goes in your methods/findings tables. Lead with the codes that show the largest distributional differences (lowest p-values); report the odds ratio alongside the p-value because p-values from small samples are misleading on their own; flag any codes for which expected cell counts fall below 5 by noting that Fisher's exact was used.
4.6 Step 6 — Keyness Analysis With quanteda.textstats
Keyness is the manifest-content cousin of the chi-squared analysis above. It asks: which words appear disproportionately in one subcorpus compared to another? The classical implementation is the log-likelihood ratio test on word frequencies between two subcorpora; quanteda.textstats::textstat_keyness() is the R implementation.
For the loneliness dataset, the natural keyness contrast is one of the subgroup contrasts you have already been working with: caregiver vs. non-caregiver, immigrant vs. non-immigrant, or under-40 vs. over-65. The output is a ranked list of words, each with a chi-squared (or log-likelihood) statistic and a sign indicating which subcorpus the word over-occurs in.
library(quanteda)
library(quanteda.textstats)
library(quanteda.textplots)
# Build the quanteda corpus from the transcripts (as in Module 5)
loneliness_rt <- readtext("term projects/HSCI_841/transcripts/P*.txt",
docvarsfrom = "filenames",
docvarnames = c("participant_id", "pseudonym"),
dvsep = "_")
loneliness_corpus <- corpus(loneliness_rt)
# Attach caregiver status as a document variable
docvars(loneliness_corpus, "caregiver") <- participants$caregiver[
match(docvars(loneliness_corpus, "participant_id"),
participants$participant_id)
]
# Tokenise, lowercase, remove stopwords
loneliness_tokens <- tokens(loneliness_corpus,
remove_punct = TRUE,
remove_numbers = TRUE) |>
tokens_tolower() |>
tokens_remove(stopwords("en"))
# Build the document-feature matrix and group by caregiver status
loneliness_dfm <- dfm(loneliness_tokens) |>
dfm_group(groups = docvars(loneliness_corpus, "caregiver"))
# Keyness test: caregiver subcorpus as target, non-caregiver as reference
keyness <- textstat_keyness(loneliness_dfm,
target = "caregiver",
measure = "lr") # log-likelihood ratio
print(head(keyness, 30))
# Plot top 20 keywords on each side
textplot_keyness(keyness, n = 20)
ggsave("figures/keyness_caregiver_vs_noncaregiver.png", width = 8, height = 6, dpi = 300)
What to expect: Words like kids, mom, dad, husband, hospital, appointment, exhausted will load on the caregiver side; words like chair, walker, eyes, fading, alone, quiet are likely to load on the non-caregiver (older, more isolated) side. The keyness analysis is manifest content analysis at scale; combined with your latent codebook results, the two halves triangulate to support stronger claims than either alone.
4.7 The Week 8 Capstone Milestone
The Week 8 capstone milestone is a content-analytic frequency table from a coded subset, with at least one defensible quantitative comparison. You will deliver four artefacts: the coded transcripts (Taguette export), the frequency table (as an Excel or CSV file), the R script that produces the chi-squared / Fisher's exact comparison, and a 500-word interpretive memo that places the quantitative findings back into a qualitative reading.
Reflection
Before you run a single line of the R workflow, declare in writing what comparison you intend to test. Which subgroup contrast (caregiver/non-caregiver, immigrant/native-born, older/younger, women/men)? Which code do you predict will differ most? And what is the substantive reasoning behind your prediction?
Minimum 20 characters required.
Question 1: In the R workflow, what is the purpose of the pivot_wider() step on the Taguette export?
pivot_wider() reshapes it into the codes × cases matrix that the rest of the analysis requires — the data-shape operationalisation of the codes-as-variables move.Question 2: The textstat_keyness() function in quanteda.textstats answers what question?
textstat_keyness() identifies the words that distinguish one subcorpus from another, with a log-likelihood or chi-squared statistic for each word. It is the manifest companion to the latent code-level analysis.Question 3: The Week 8 capstone milestone requires:
Final Assessment
Bringing It All Together
Lesson 8 has done two related things. First, it has located content analysis historically — from Lasswell's WWII propaganda studies through Berelson's 1952 codification to Krippendorff's contemporary synthesis — and clarified what makes it methodologically distinctive. Content analysis is the qualitative-quantitative bridge: it begins with the same kind of qualitative judgement that drives thematic analysis (Module 5) but turns the resulting codes into variables and applies the inferential machinery of HSCI 230 and 410. The single operational move — treating codes as variables — is what lets a 20-transcript qualitative corpus answer questions that thematic analysis alone cannot: how often, how distributed, larger than chance.
Second, the lesson has given you the operational machinery to do content analysis on your own capstone data. The Module 5 codebook is the input; the Taguette export is the bridge; the R workflow turns long-format coding data into wide-format codes × cases matrices, computes descriptive and inferential statistics, and produces publication-quality visualisations. The Week 8 milestone — a coded subset, a frequency table, a chi-squared or Fisher's exact comparison, and a 500-word interpretive memo — is your first piece of hybrid quantitative-qualitative analysis in this course.
What you take away from this lesson sets up Lesson 9 (Schema and Narrative Analysis), which moves in a different direction: not counting codes across many transcripts but tracing the cognitive structures inside individual transcripts. Lesson 10 (Discourse Analysis) deepens the attention to language and power. Lesson 11 (Analytic Induction, QCA, and Decision Models) returns to systematic comparison with a different toolkit. Lesson 12 (Computational Text and LLM Analysis) extends today's dictionary-based and keyness moves to the scale of millions of documents.
Key Takeaways from Lesson 8
- Content analysis has a 90-year history: Lasswell (1927, 1942) established it as a method for propaganda analysis; Berelson (1952) codified it as the systematic, quantitative description of manifest content; Krippendorff (1980–2018) modernised it to accommodate latent meaning, inference, and rigorous reliability statistics.
- Manifest content is what the text literally says; latent content is what it means. Most contemporary content analysis blends both, with the analyst declaring which is which and demonstrating reliability for the latent codes.
- The codes-as-variables move makes content analysis a hybrid method: qualitative judgement defines the codes; quantitative procedures analyse them. Both halves are necessary; neither is optional.
- Three unit-types structure a content-analytic design: sampling units (what is drawn from the population), recording units (where the variable takes its value), and context units (the surrounding text consulted during coding).
- Reliability is load-bearing: Krippendorff's alpha is the field standard; α ≥ 0.80 supports definitive claims, 0.67–0.80 is acceptable for tentative inference, and below 0.67 the codebook must be revised.
- Inferential tests on codes × subgroup matrices — chi-squared and Fisher's exact — are legitimate when the question is whether a within-corpus distributional difference is larger than chance, not when it is generalising to a population.
- Dictionary-based content analysis (LIWC, sentiment dictionaries) and computational content analysis (Module 12) are scalable extensions of the same logic, valuable as complements to human coding rather than as replacements.
Core Concepts Reviewed
Section 1: The history of content analysis from Lasswell through Berelson to Krippendorff and the contemporary synthesis; the manifest/latent distinction and why contemporary content analysis blends both; the codes-as-variables operational move; when content analysis is the right tool.
Section 2: Sampling, recording, and context units; exhaustive, mutually exclusive, and multi-coded schemes; a priori vs. emergent codes; Krippendorff's alpha as the reliability standard with its 0.80 / 0.67 thresholds; the end-to-end workflow.
Section 3: Frequency tables overall and by subgroup; chi-squared and Fisher's exact tests on codes × subgroup contingency tables; trend analysis; dictionary-based content analysis (LIWC, sentiment); preview of computational content analysis.
Section 4: Loading Taguette exports into R; reshaping into codes × cases matrices with pivot_wider(); descriptive frequency tables with dplyr; ggplot2 visualisation; chi-squared / Fisher's exact in R; quanteda.textstats::textstat_keyness(); the Week 8 capstone milestone.
The final reflection below asks you to step out of method-mode and articulate where content analysis fits in your own methodological identity. There is no single right answer; the goal is to leave the lesson with a defensible stance on when this hybrid method belongs in your toolkit.
Final Reflection
Content analysis sits on the boundary between qualitative and quantitative research. Some methodological writers (especially in nursing and health-services research) treat it as a qualitative method that uses some counting; others (especially in communications and computational social science) treat it as a quantitative method applied to text. Where do you locate it — and where will you locate yourself, methodologically, after Module 8?
Minimum 30 characters required.
Question 1: Which figure is most associated with the development and naming of Krippendorff's alpha, the field-standard reliability statistic for content analysis?
Question 2: The single operational move that makes content analysis a hybrid quantitative/qualitative method is:
Question 3: Berelson's 1952 definition restricted content analysis to:
Question 4: In Krippendorff's terminology, the context unit is:
Question 5: A code shows Krippendorff's α = 0.74 in a reliability subset. What is the appropriate response?
Question 6: A code appears in 9 of 10 caregivers and 4 of 10 non-caregivers. Which test is the most defensible inferential choice?
Question 7: Which approach to building a content-analytic codebook starts with codes drawn from prior theory or instrument and applies them deductively to the data?
Question 8: The Linguistic Inquiry and Word Count (LIWC) dictionary is best understood as an example of:
Question 9: The Bernard/Wutich/Ryan stance on the qualitative/quantitative boundary in content analysis is best summarised as:
Question 10: The R function textstat_keyness() in the quanteda.textstats package answers the question:
Question 11: A research question is "do younger participants and older participants describe loneliness differently?" Why is content analysis the right tool for this question?
Question 12: Which of the following is NOT one of the three unit-types Krippendorff distinguishes?
Question 13: The Week 8 capstone deliverable includes:
Question 14: In the R workflow, the long-format Taguette export is reshaped into a wide-format codes × cases matrix using:
pivot_wider() is the data-shape operation that operationalises the codes-as-variables move: codes become columns, cases become rows.Question 15: Which of the following BEST describes the relationship between content analysis and the methods you will study in Module 12 (computational text and LLM analysis)?
Glossary — Key Terms, People & Methodological Stances
📚 Reference page — available throughout the lesson
This glossary collects the key concepts, people, and methodological stances introduced in Lesson 8. Use it as a reference while you work through the material, or as a review before the final assessment. Type in the search box to filter entries.
quanteda.textstats::textstat_keyness().