Content Analysis

Qualitative Research Methods & Analysis in Public Health

Learning objectives for this lesson:

Trace content analysis from Lasswell's WWII propaganda studies through Berelson's 1952 codification to Krippendorff's contemporary synthesis
Distinguish manifest from latent content and explain why most contemporary content analysis blends both
Treat codes as variables, the operational move that turns content analysis into a hybrid quantitative/qualitative method
Sample text systematically: choose sampling units, recording units, and context units that match the research question
Develop a content-analytic coding scheme that is exhaustive, defensibly exclusive (or explicitly multi-coded), and reliably applied
Compute Krippendorff's alpha as the standard reliability statistic and interpret its acceptable thresholds
Test hypotheses on content-analytic data using chi-squared, distributional comparisons, and trend analysis
Apply dictionary-based and computational content analysis to the loneliness corpus and complete the capstone milestone

This course was developed by Dr. Kiffer G. Card, Faculty of Health Sciences, Simon Fraser University based on Bernard, H. R., Wutich, A., & Ryan, G. W. (2017). Analyzing Qualitative Data: Systematic Approaches (2nd ed.). SAGE. This lesson covers Chapter 11 (pp. 243–268).

Section 1 of 5

What Content Analysis Is: History, Manifest/Latent Content, and Codes as Variables

⏱ Estimated reading time: 30 minutes

Section 1 of 5

What Content Analysis Is

History, manifest vs. latent content, and codes as variables.

Origins

Lasswell and the propaganda studies

Harold Lasswell's wartime work at the Library of Congress operationalised content analysis: counting symbols, tracing their distribution, and drawing inferences about communicative intent.

His framing question still organises the field:

Who says what, to whom, in which channel, with what effect?

Harold Lasswell, political scientist and communication theorist. — Harold Lasswell (1902–1978). Public domain, via Wikimedia Commons.

Codification to synthesis

Berelson (1952) to Krippendorff (2018)

Berelson, 1952

First textbook definition: objective, systematic, quantitative description of manifest content. Deliberately restricted to the literal surface.

Krippendorff, 1980–2018

Expanded to latent meaning; formalised reliability requirements; developed the alpha statistic. Now the canonical reference.

Bernard, Wutich, and Ryan (2017) position content analysis as part of a continuum of text-analytic methods, which is the framing this course uses.

The key distinction

Manifest vs. latent content

Contemporary content analysis blends both. The analyst declares which is which and demonstrates reliability for latent codes.

The defining move

Codes as variables

Thematic analysis

A code is a label attached to a passage. Its job is to organise interpretation for one study.

Content analysis

A code is a variable that takes a value (0/1, or a count) for every recording unit. It is a column in a data matrix.

Once codes are variables, frequency tables, cross-tabulations, and chi-squared tests follow naturally. Both the qualitative and quantitative halves are load-bearing.

Carry forward

When content analysis is the right choice

Your question asks how often, or whether subgroups differ in what they say.
The codes-as-variables move is what separates content analysis from thematic work.
Both qualitative and quantitative halves are necessary; neither is decoration.

A later section covers the design decisions that make a content analysis defensible.

Introduction and Overview

Content analysis is the oldest systematic method for analysing text in the social sciences, and it is also the bridge between the qualitative and quantitative traditions of this course. Where the thematic analysis you learned earlier in the course stops once you have a defensible set of themes, content analysis keeps going: it counts the themes, distributes the counts across subgroups, and tests whether the differences are larger than chance. The method is qualitative in its first move (identifying what the relevant categories are) and quantitative in its second (counting their occurrence). It is the technique you reach for when your research question asks both what kinds of loneliness do people describe and how often, and does that distribution differ by age, gender, or caregiver status.

This section traces how content analysis became the workhorse it is today. We start with Harold Lasswell's WWII analysis of Axis propaganda, watch Bernard Berelson codify the method in 1952, follow Klaus Krippendorff's modernisation into the canonical 2018 monograph, and end with the contemporary synthesis offered by Bernard, Wutich, and Ryan (2017, Ch. 11). Along the way we will pull apart the distinction between manifest and latent content, the line that separates "what the text literally says" from "what it means", and we will explain the single operational move that turns content analysis into the hybrid method it is: treating codes as variables. By the end of the section you will be able to say what content analysis is and why it is methodologically different from the thematic work you have already done.

Learning Objectives for this section

Locate content analysis historically: from Lasswell (1927, 1942) and Berelson (1952) to Krippendorff (2018) and the contemporary synthesis.
Distinguish manifest from latent content and recognise that most contemporary content analysis blends both.
Explain the operational move that makes content analysis a hybrid method: codes as variables.
Identify the kinds of public-health research questions for which content analysis is the right tool.

1.1 A Short History: From Lasswell to Krippendorff

Harold Lasswell’s wartime propaganda analysis at the Library of Congress operationalized content analysis: who says what to whom in which channel with what effect. The first attempt to quantify symbolic content systematically. Many of today’s content-analysis categories descend directly from his coding sheets.

Bernard Berelson’s textbook defined content analysis as 'the objective, systematic, and quantitative description of the manifest content of communication.' This narrow definition dominated for two decades and is still the working definition in much applied communications research.

Klaus Krippendorff’s textbook broadened the field to include latent meaning and explicitly engaged the reliability and validity questions that remain central. Krippendorff’s alpha statistic, developed across multiple editions, is now the standard inter-coder reliability measure for content analysis.

Content analysis today spans manual coding, computational text analysis, supervised machine learning, and now LLM-assisted approaches. All trace back to the Lasswell-Berelson-Krippendorff lineage. The methods have changed; the underlying logic, turning text into countable, comparable data, has not.

The modern history of content analysis begins with Harold D. Lasswell, the political scientist who in 1927 published Propaganda Technique in the World War, the first systematic effort to study communication content as a window onto political intent. Lasswell's question was not "what does this propaganda mean to its readers?" but "what categories of appeal does it use, and in what proportion?" The methodological commitment was to systematic counting of explicit features, mentions of leaders, mentions of enemies, deployment of symbols, rather than impressionistic close reading. The pay-off was that two analysts working independently could produce comparable numbers.

During the Second World War, Lasswell led the Experimental Division for the Study of War-Time Communications at the U.S. Library of Congress. The Division analysed German, Italian, and Japanese propaganda in volume that no individual reader could have made sense of impressionistically. By systematic coding of explicit features, the frequency with which Axis broadcasts mentioned specific Allied generals, the proportion of broadcast time devoted to economic versus military themes, the rise and fall of named enemies week by week, the Division produced intelligence assessments that helped guide both counter-propaganda efforts and broader policy decisions. The work was content analysis as inference about a producer: from the text, back to the propagandist's strategic state of mind. The method got its first major public-policy validation in this period (Lasswell, Lerner, & Pool, 1952).

The post-war codification belongs to Bernard Berelson. His Content Analysis in Communication Research (1952) gave the field its first textbook and its most-cited definition: content analysis is "a research technique for the objective, systematic, and quantitative description of the manifest content of communication" (Berelson, 1952, p. 18). Each of the four adjectives in that definition matters. Objective meant that the analysis was repeatable by other analysts. Systematic meant that the same procedure was applied across the entire sample. Quantitative meant counting, not impression. Manifest meant the literal surface of the text, what was said, rather than what the analyst imagined the author meant. Berelson's stance was, in effect, the high-modernist version of content analysis: positivist, quantitative, and resolutely uninterested in latent meaning.

The next major recodification came from Klaus Krippendorff, whose Content Analysis: An Introduction to Its Methodology (1980, 2004, 2013, and now in its 4th edition, 2018) brought the method into the contemporary social-science mainstream. Krippendorff's most important moves were two. First, he redefined content analysis as "a research technique for making replicable and valid inferences from texts (or other meaningful matter) to the contexts of their use" (Krippendorff, 2018, p. 24), broadening Berelson's "objective description" to "inference about context," which made room for latent content. Second, he gave the field its statistical centre of gravity by developing Krippendorff's alpha, the reliability statistic that has since become the methodological standard for content analysis (we will use it in a later section). Krippendorff's text remains the field's most-cited methodological reference and the standard against which contemporary content-analytic studies are judged.

The synthesis you are reading in this course comes from Bernard, Wutich, and Ryan (2017), who present content analysis as part of a continuum of text-analytic methods rather than as a standalone enterprise. Their stance is that the strict separation Berelson drew between qualitative and quantitative analysis was always more rhetorical than real, and that contemporary content analysis is best understood as a method that blends the two: qualitative judgement determines the coding scheme; quantitative procedures determine how the codes are distributed and whether the differences are reliable. This is the position that Hsieh and Shannon (2005) articulated influentially in the health-research literature when they distinguished conventional, directed, and summative content analysis, a typology that has since structured most qualitative-content-analytic work in nursing, health services research, and public health.

Tradition	Approximate dates	Defining commitment	What you would publish
Lasswell propaganda analysis	1927–1948	Systematic counting of explicit features to infer producer intent	Frequency tables of symbols, names, and themes, usually classified documents
Berelson classical content analysis	1952–1970s	Objective, systematic, quantitative description of manifest content	Tables of code frequencies, with explicit coding rules and percent-agreement reliability
Krippendorff contemporary content analysis	1980–present	Replicable and valid inference from text to context; explicit room for latent content; rigorous reliability statistics	Code distributions, inferential tests, Krippendorff's alpha as the reliability statistic, explicit sampling logic
Qualitative content analysis (Hsieh & Shannon, 2005)	2000s–present	Conventional / directed / summative variants; categories may emerge from data, be imposed from theory, or be derived from word counts	Coded extracts plus a frequency table; mixed-method publications in health journals
Bernard/Wutich/Ryan synthesis (this course)	2017	Content analysis as a hybrid method that explicitly turns codes into variables and analyses them with descriptive and inferential statistics	A coded corpus, a frequency-by-subgroup table, a defensible inferential test, and an interpretive narrative

Why the history matters for your capstone

When you write your eventual capstone methods section, you will need to position your content-analytic work in this lineage. Most contemporary health-research applications cite Krippendorff (2018) for the reliability machinery, Hsieh & Shannon (2005) for the typology, and Bernard, Wutich, and Ryan (2017) or Schreier (2012) for the practical workflow. Knowing where each one sits in the genealogy lets you make defensible methodological choices, for example, whether to treat your codes as mutually exclusive (Berelson) or to allow multi-coding (Krippendorff/Schreier).

1.2 Manifest Versus Latent Content

Key insight - Manifest vs latent is a continuum, not a binary

The classical distinction holds that manifest content is what is literally on the page ('the word vaccination appears') and latent content is what is implied or interpreted ('this passage frames vaccination as a moral obligation'). In practice, every coding decision involves some interpretation, even counting the word 'vaccine' requires deciding whether 'vaccinated' counts. The more latent the code, the more important reliability evidence and explicit decision rules become. A defensible content analysis names where its codes sit on the manifest-to-latent continuum.

A frequently asked methodological question about content analysis is: are you coding what the text literally says, or what it means? Bernard, Wutich, and Ryan's answer, and the answer of the modern field, is that the choice is yours, but it must be explicit.

Manifest content is the literal, surface-level content of the text. If your code is "mentions of pets," a manifest coder counts every appearance of the words pet, dog, cat, Rufus, Marie's cat, parrot, and so on. The advantage is that two manifest coders working from a clear word-list will produce nearly identical counts; the disadvantage is that manifest content misses everything the participant is doing with the text other than literal naming.

Latent content is the underlying meaning that requires interpretation to surface. If your code is "loneliness framed as the cost of having loved," a latent coder reads a passage and decides whether the participant is articulating the idea, regardless of whether the specific words "cost" or "love" appear. Linda's account of Bill's empty chair (P05) is a latent expression of "loneliness-as-residue-of-marriage", she does not use that phrase, but the meaning is unmistakable to a competent reader. Latent coding captures more of the meaning but is harder to do reliably.

Berelson's classical definition (1952) restricted content analysis to manifest content. Krippendorff and the subsequent generation rejected the restriction, on the grounds that inference about context, the contemporary purpose of content analysis, almost always requires reading for latent meaning. The modern compromise is that content analysis routinely codes both, but the analyst must declare which is which and must demonstrate that latent codes can be applied reliably.

Feature	Manifest coding	Latent coding
Unit examined	Words, phrases, named entities	Passages, propositions, implicit meanings
Decision procedure	Match against a dictionary or word-list	Read for meaning; apply rule from codebook
Reliability difficulty	Lower, identical word lists give identical counts	Higher, requires training and an explicit codebook
Interpretive depth	Shallow but defensible	Deeper but contestable
Typical reliability target (Krippendorff's alpha)	α ≥ 0.80 readily achievable	α 0.67–0.80 acceptable for tentative inference; α ≥ 0.80 for definitive claims

A worked example from the loneliness corpus

Consider the question: how often do participants describe loneliness in terms that involve a piece of household furniture?

Manifest version: Count every occurrence of the words chair, couch, sofa, bed, table across all 20 transcripts. Linda mentions "chair" 8 times in P05; Helen mentions "chair" 3 times in P11. Manifest count: straightforward, replicable, and largely uninformative on its own.

Latent version: Code any passage in which a piece of furniture stands in for an absent person or a lost role. Linda (P05) describes Bill's chair as the empty space he used to fill, that is a clear latent instance. Helen (P11) describes her armchair as the place where she does not have anyone to read with, also latent, but with a different propositional structure (her loss is structural, not bereavement-specific). Linda's mention of the kitchen table where she ate dinner with Bill, latent, same theme. The latent code furniture-as-trace-of-absent-other may apply to 9 transcripts (vs. 11 transcripts for the manifest word-list), but the latent code is doing meaningful analytic work while the manifest word-list is not.

In your capstone, you will frequently want both: the manifest word-list as a fast first pass, and the latent code as the substantive analytic move.

1.3 Codes as Variables: The Operational Move

Here is the single move that makes content analysis what it is. In thematic analysis (an earlier module), a code is a label you attach to a passage; the code's job is to organise interpretation. In content analysis, the same code becomes a variable: a column in a data frame, with a value for every unit in the corpus. Once codes are variables, the entire apparatus of descriptive and inferential statistics is available.

The reframing is subtle but transformative. Consider the code loneliness-as-residue-of-marriage. In thematic analysis, the code is a finding: "this is one of the kinds of loneliness participants describe." In content analysis, the code becomes a variable that takes the value 1 for every transcript in which the theme appears and 0 elsewhere. Now you can ask: in what proportion of transcripts does the theme appear? Does that proportion differ between widowed and non-widowed participants? Is the difference larger than chance? Does the proportion grow over the course of the interview, that is, do participants need to talk for a while before they articulate the theme? Each of these is a statistical question that the codes-as-variables move makes available.

Bernard, Wutich, and Ryan are explicit that this move is what places content analysis in the hybrid position it occupies. It is not "qualitative analysis with numbers attached" (which is what Berelson thought it was), nor "a quantitative method applied to text" (which is what nineteenth-century philology was). It is a method in which qualitative judgement is required to define the variables, and quantitative procedures are used to analyse them. Both halves are necessary; neither is optional.

Move	Thematic analysis (an earlier module)	Content analysis (this module)
What a code is	A label organising interpretive material	A variable with a value for every unit
What you report	The code, with illustrative quotes	The code's frequency, distribution, and inferential tests
What you defend	That the code names something real in the data	That the code is reliably applied AND names something real in the data
How you defend it	By showing illustrative passages	By showing inter-coder agreement (Krippendorff's alpha) AND illustrative passages
What "more data" buys you	Greater confidence that you have saturated the themes	Statistical power to detect distributional differences

1.4 When Content Analysis Is the Right Tool

Not every qualitative research question calls for content analysis. The method's distinctive payoff is in questions that involve distributional comparison: how often something appears, whether subgroups differ in their use of a code, whether something changes over time. Where the question is genuinely interpretive, what does this single passage mean, or how does this participant make sense of their experience, thematic analysis (an earlier module), schema analysis (a later module), or narrative analysis (also a later module) will give you more analytic traction than content analysis will.

Concretely, you should reach for content analysis when one or more of the following are true:

You have a comparison question. Caregivers versus non-caregivers; younger versus older; immigrant versus native-born; pre-pandemic versus post-pandemic. Content analysis is built for these.
Your corpus is large enough that close reading alone would be impractical. Twenty transcripts is on the lower end, you can certainly do content analysis on a corpus this size, and you will in this module, but for corpora of 50, 200, or 5,000 documents, content analysis is one of the few options that scales.
You need replicable counts that other researchers can audit. Where policymakers or journal reviewers will demand "how many" rather than "how rich," content analysis is the methodological answer.
Your research question is longitudinal. Trends, shifts, before/after comparisons, content analysis is the standard tool for them in the qualitative literature.
You are writing for a quantitative audience. Public-health journals routinely ask for the kind of distributional evidence that content analysis produces. Where thematic analysis would be rejected as too impressionistic, content analysis can be persuasive.

Reflection

Pick one of your candidate themes from the Week 5 codebook you built earlier in the course. Now answer two questions about it: (1) Is the theme manifest, latent, or both? (2) If you turned it into a variable, what comparison or distribution would you most want to examine? Be specific, name the subgroups, the cases, or the dimension of variation you would test.

Model answerA strong response is specific to one named theme and gives a clear, testable comparison. Example: "My code loneliness-as-residue-of-marriage is predominantly latent, participants do not use the phrase, but the propositional content is clear in passages where they describe an absent partner. If I turned the code into a variable, I would expect the count to be much higher for the 6 widowed participants in the corpus than for the 4 single-and-never-partnered participants. The interesting comparison is less the obvious widow/non-widow contrast than whether the never-partnered participants articulate any functionally equivalent latent code (perhaps loneliness-as-residue-of-roads-not-taken) that fills the same analytic slot." A weak answer names a theme but cannot operationalise the variable or specify a comparison, that is the sign that the code has not yet been promoted to variable status.

Minimum 20 characters required.

✓ Reflection saved

Section 2 of 5

Designing a Content Analysis: Sampling Text, Recording Units, and Reliability

⏱ Estimated reading time: 30 minutes

Section 2 of 5

Designing a Content Analysis

Sampling text, recording units, and reliability.

Three unit types

Sampling, recording, and context units

Sampling unit

The chunk drawn from the population. In the loneliness corpus: each interview transcript.

Recording unit

Where the variable takes its value. May be the whole transcript, a paragraph, a sentence, or a speaker turn.

Context unit

The surrounding text a coder may consult. Wider context supports latent coding; narrower context increases reliability.

The coding scheme

Three requirements

Exhaustive

Every recording unit can be assigned at least one code, or an explicit "none applies" category.

Defensibly exclusive

Code boundaries are clear enough to guide consistent decisions. Multi-coding is permitted, but declared.

Operationally defined

Each code has a description, an inclusion rule, and an exclusion rule. Another coder can apply it without asking you.

Reliability standard

Krippendorff's alpha

\[ \color{#0B7B6B}{\alpha} = 1 - \frac{\color{#C2410C}{D_o}}{\color{#6D28D9}{D_e}} \]

α reliability coefficient D_o observed disagreement D_e disagreement expected by chance

\(D_o\) = observed disagreement; \(D_e\) = expected disagreement by chance.

α ≥ 0.80

Definitive claims

0.67–0.80

Tentative inference; flag in write-up

< 0.67

Revise codebook before proceeding

End-to-end

The workflow in order

Formulate a distributional / comparative research question
Define the text population and sample from it
Declare sampling, recording, and context units in writing
Develop the codebook (a priori + emergent)
Train coders on a practice subset
Compute reliability on 10–20% of corpus
Revise codebook if α < 0.67
Apply final scheme to full corpus
Analyse frequency matrix; run inferential tests
Report transparently (corpus, units, codebook, reliability)

Carry forward

What to take into the next section

Unit choices shape both what the analysis shows and how reliable it will be.
Krippendorff's alpha: 0.80 definitive, 0.67 tentative, below 0.67 revise.
A low alpha is a diagnosis of a codebook problem, not a verdict on the data.

A later section covers what to do with the coded matrix once reliability is acceptable.

Introduction and Overview

An earlier section framed content analysis methodologically. This section is about the design decisions that determine whether your content analysis will be defensible. There are three of them, in order: what counts as the text to be analysed (sampling), what counts as the unit being coded (recording units and context units), and how the codes are applied with sufficient consistency that the resulting counts mean something (reliability). Get any one of these wrong and the analysis is suspect; get all three right and you have a piece of content-analytic work that a sceptical quantitative reviewer will accept.

Bernard, Wutich, and Ryan are unusually explicit on these matters because content analysis, more than any other qualitative method, can be done badly in ways that produce numbers that look authoritative but are not. The history of the field is littered with frequency tables built on inconsistent coding of poorly specified units sampled from non-representative corpora. The remedy is procedural: you commit, in writing, to a sampling rule, a unit-definition rule, a coding scheme, and a reliability target, before you begin coding. You then execute the procedure transparently, report what happened, and let the reader audit it.

Learning Objectives for this section

Distinguish sampling units, recording units, and context units, and choose each defensibly for your research question.
Build a content-analytic coding scheme that is exhaustive, defensibly exclusive (or explicitly multi-coded), and clearly defined.
Distinguish a priori (deductive) from emergent (inductive) coding schemes, and recognise the typical hybrid in practice.
Compute Krippendorff's alpha and interpret it against the field's reliability thresholds.

2.1 Sampling Text: The Three Kinds of Units

Krippendorff's most enduring methodological contribution after the alpha statistic is the distinction between three levels of unit in any content-analytic design:

Sampling units, the chunks of text that are drawn from the population. In a corpus of 20 loneliness transcripts, each transcript is a sampling unit. In a corpus of newspaper coverage of the overdose crisis, each article is a sampling unit. In a Twitter dataset, each tweet is a sampling unit.
Recording units, the unit you actually code. The recording unit is where the variable takes its value. Recording units may be the same as sampling units (you code each transcript as a whole) or smaller (you code each sentence, each paragraph, each turn-at-talk, or each named theme-bearing passage).
Context units, the surrounding text the coder is permitted to consult when deciding how to code a recording unit. If your recording unit is a sentence and your context unit is the paragraph, the coder reads the sentence in the context of the paragraph but assigns the code to the sentence alone.

The trio matters because the choice of each one shapes both what the analysis can show and how reliable it will be. A study that uses the whole transcript as the recording unit produces a frequency-per-transcript table; a study that uses the sentence as the recording unit produces a frequency-per-sentence table. These two analyses can give different answers to the same surface question, and the difference is methodological, not substantive.

Recording unit choice	What it lets you measure	What it makes harder	Typical use in loneliness corpus
The whole transcript	Presence/absence of each code per participant	Intensity, frequency within a transcript, longitudinal pattern within an interview	"Of the 20 participants, how many invoke the loneliness-as-residue-of-marriage code at any point?"
The paragraph or speaker turn	Within-transcript distribution; co-occurrence of codes	Word-level features; reliability is harder than transcript-level	"In how many turns of Linda's interview does she invoke the absent-Bill theme, and where in the interview do those turns cluster?"
The sentence	Fine-grained distribution; sequence; intensity by participant	Higher coding burden; lower reliability for latent codes	"What proportion of Linda's sentences contain spatial metaphors for absence?"
The word or phrase (dictionary-based)	Massively scalable counts	Latent meaning entirely; ambiguity resolution; context	"How often does each transcript contain the word 'chair', and does the count correlate with widowhood status?"

For your capstone, recommend: transcript-level recording units with paragraph-level context

In a 20-transcript corpus, the most defensible choice is to code each transcript as either containing or not containing each code (a binary recording-unit-per-transcript design), with each paragraph as the context unit you read to make the decision. This gives you a clean codes × participants matrix that supports chi-squared comparisons across subgroups. The trade-off is that you cannot measure within-transcript intensity, but for a corpus this size, between-transcript distributional analysis is the analytically productive level.

2.2 Developing a Coding Scheme

Identify sampling unitsv

What counts as a case? Documents? Paragraphs? Speaker turns? Articles? The sampling unit is what your eventual frequencies will be ratios of. Decide it before coding.

Identify recording unitsv

What counts as a single coded instance? A word? A sentence? A paragraph? A clause? The recording unit determines how granular the codes are and how much data each code generates.

Develop the coding schemev

Three sources: (1) prior literature/theory (deductive codes), (2) pilot reading of data (inductive codes), (3) iterative refinement from the first 10-20% of the corpus. The final scheme almost always combines deductive and inductive codes.

Train, pilot, refinev

Two or more coders apply the scheme to a pilot set (typically 10-15% of the corpus). Compare disagreements line-by-line. Refine code definitions, add inclusion/exclusion rules, add exemplars. Re-pilot until reliability is acceptable.

Apply to the full corpusv

Full-corpus coding by trained coders. Spot-check reliability on a held-out sample (typically 10%). Re-train if reliability drifts.

Aggregate and analyzev

Frequency tables, cross-tabulations, chi-squared tests, dictionary expansions, time trends. The countable output is the comparative advantage of content analysis over thematic analysis.

The coding scheme, the codebook, in the language of an earlier module, is the spine of any content analysis. In an earlier module you built one for thematic analysis; here you adapt it for the variable-treatment of content analysis. Three properties matter more in content analysis than they did in thematic analysis.

Exhaustive coverage. Your coding scheme should cover the content you care about. For content analysis specifically, exhaustive coverage means that every recording unit can be assigned at least one code (or the explicit "not-coded" residual category). The classical Berelson position is that codes should be mutually exclusive, each unit gets exactly one code, on the grounds that overlapping codes make the resulting frequencies hard to interpret. Krippendorff and the contemporary field reject the mutual-exclusivity requirement: a passage can simultaneously instantiate "loneliness-as-existential-fact" and "loneliness-coped-with-by-pet-companionship" and there is no good reason to force the coder to choose. The modern compromise is multi-coded passages are permitted if the codebook says so and the multi-coding is consistent.

Clear operational definitions. Each code in your scheme needs a definition that another coder could apply without consulting you. The definition has three parts: a brief substantive description, an inclusion rule (what counts), and an exclusion rule (what doesn't count). Inclusion and exclusion rules are where reliability is won or lost. "Mentions of loneliness" is a definition without inclusion/exclusion rules. "Any statement in which the participant attributes loneliness to a specific event or relationship (inclusion); excluding generic statements about loneliness in the population or society at large (exclusion)" is a definition that can be applied reliably.

A priori versus emergent. Content-analytic codebooks come from one of two directions, or (usually) both. A priori codes come from the literature, your conceptual framework, or your interview guide; you decide before you read the data that you will code for, say, "stigma," "social-comparison," and "coping-by-substance-use" because the literature on loneliness says these are the right categories. Emergent codes come from the data, you read the corpus, notice what is there, and let the codebook grow to fit. Hsieh and Shannon (2005) call the first directed content analysis and the second conventional content analysis; their summative third type starts from word counts and works upward. Elo and Kyngäs (2008) and Mayring (2000) offer complementary process descriptions widely cited in nursing and European health-research traditions.

Approach	Codebook source	Advantage	Risk
Conventional (inductive / emergent)	Codes emerge from close reading of the corpus	Sensitive to participants' own categories; surfaces unexpected themes	May reproduce the analyst's preconceptions; can be unsystematic
Directed (deductive / a priori)	Codes drawn from theory, prior literature, or instrument	Replicable; comparable across studies; testable against theory	May miss what the data are actually doing if the codebook is wrong
Summative	Codes start as word counts, then expand to latent meaning	Scales; obvious replicability; computational tractability	Manifest-only by default; risks counting words without coding meaning
Hybrid (most contemporary work)	A priori scaffold plus emergent expansion	Best of both; transparent about origin of each code	Requires explicit documentation of which codes came from where

2.3 Inter-coder Reliability: Stricter Than Thematic Analysis

Simple agreementClick to explore

Cohen’s kappaClick to explore

Krippendorff’s alphaClick to explore

Beyond the numberClick to explore

Reliability is the test of whether two coders, working independently with the same codebook, would assign the same codes to the same passages. In thematic analysis (an earlier module), reliability matters but it is one consideration among several; in content analysis, reliability is load-bearing, the frequencies you report are only as good as the consistency with which the codes were applied. If two coders applying the same codebook produce different counts, the counts are noise.

The field's standard reliability statistic is Krippendorff's alpha (Krippendorff, 2004, 2018; Hayes & Krippendorff, 2007). Alpha has three properties that recommend it over the older Cohen's kappa: it accommodates any number of coders (where Cohen's kappa handles two); it accommodates any level of measurement (nominal, ordinal, interval, ratio); and it handles missing data gracefully. The arithmetic compares the observed disagreement among coders to the disagreement expected by chance, and produces a coefficient that runs from 1 (perfect agreement) through 0 (chance) to negative values (worse than chance). Put simply, alpha asks how much better than random guessing the coders did: when they never disagree the observed disagreement is zero and alpha is 1, and when they agree only as often as chance would predict the two disagreements are equal and alpha is 0.

Krippendorff's alpha	Interpretation	Typical recommendation
α ≥ 0.80	Strong agreement	Acceptable for definitive content-analytic claims (Krippendorff, 2018)
0.67 ≤ α < 0.80	Acceptable for tentative inference	Report; flag as tentative; discuss areas of disagreement
α < 0.67	Inadequate	Revise the codebook and re-train; do not publish counts from this code

The thresholds are convention, not law, and they vary slightly between sources. The 0.80/0.67 dichotomy comes from Krippendorff's own writing and is the most commonly cited in the field. Some health-research applications adopt a stricter 0.70 floor and a 0.80 publication target. Whatever you adopt, declare it in your methods section before you compute it.

Reliability is computed on a subset of the corpus, typically 10–20%, that two coders code independently. The remaining 80–90% is then coded by one coder alone, with the assurance that the reliability of the system is documented. The reliability subset is usually drawn randomly from the corpus to ensure that the reliability estimate is generalisable.

What to do when reliability is low

A low alpha is not the end of the analysis; it is a diagnosis. The cause is almost always one of: (a) a code definition that is too vague (revise inclusion/exclusion rules); (b) a code definition that covers too much (split into sub-codes); (c) insufficient coder training (run additional training sessions on a separate set of practice passages); or (d) the code is genuinely contested in the corpus (which is itself a finding, report it). The workflow is iterative: train, code a subset, compute alpha, revise the codebook, re-train, re-code, re-compute. Two cycles is typical; three is not unusual.

2.4 The Workflow End-to-End

ACTIVITY Try it - Pilot a coding scheme

Take 5-10 short documents (e.g., recent news headlines about a public health topic). Develop a small coding scheme:

Define 3-5 codes that capture features you care about (e.g., 'mentions vaccination', 'cites a Canadian source', 'uses risk framing').
Write a one-sentence definition + a one-sentence inclusion rule for each code.
Code the documents yourself once. Set the codes aside.
The next day, re-code the same documents without looking at your original codes. Compute your own intra-coder reliability.

If you can’t agree with yourself, two independent coders will agree even less. Most content-analysis projects need 2-3 iterations before reliability is acceptable.

A complete content-analytic study, in Bernard, Wutich, and Ryan's framing, follows these steps in order:

Formulate the research question, explicitly distributional or comparative.
Define the population of texts, what corpus is the analysis drawn from?
Sample the texts, if the population is small enough, the entire population may be sampled; otherwise, define a sampling frame and a sampling procedure.
Decide on sampling, recording, and context units, declare them in writing.
Develop the codebook, a priori, emergent, or hybrid; with operational definitions for every code.
Train coders, on a practice subset that does not enter the final analysis.
Compute reliability on a 10–20% subset, Krippendorff's alpha; revise codebook if needed.
Apply the final codebook across the corpus, one coder for the remaining 80–90% is typical for a small-corpus study.
Analyse the resulting frequency table, descriptive statistics, then inferential tests (a later section).
Report transparently, corpus, sampling, units, codebook (in an appendix), reliability, analysis, interpretation.

The workflow is sequential, with one feedback loop: a low reliability coefficient sends the analyst back to revise the codebook before the scheme is applied to the full corpus.

Reflection

For your capstone milestone, declare the design decisions you will make. Which 5–7 codes from your earlier codebook will you use? What is your recording unit (whole transcript / paragraph / sentence)? What is your context unit? What reliability target will you adopt, and how will you handle codes that do not meet it?

Model answerA strong answer names the codes, the unit choices, and the reliability rule. Example: "I will use seven codes from my earlier codebook: loneliness-as-residue-of-marriage, cultural-untranslatability-of-loneliness, coping-by-pet-companionship, identity-of-being-lonely-person, technology-as-double-edged, fading-at-the-edges, and shame-prevents-disclosure. My recording unit will be the whole transcript: each transcript is coded as containing or not containing each code, producing a 20 × 7 binary matrix. My context unit is the paragraph, if a passage is ambiguous, the coder reads the surrounding paragraph before deciding. I will adopt Krippendorff's α ≥ 0.70 as my acceptability threshold and α ≥ 0.80 as my publication target. Codes that fall between 0.67 and 0.80 will be reported but flagged as tentative; codes below 0.67 will be revised and re-coded or dropped." A weak answer either does not commit to specific codes or does not name a reliability rule.

Minimum 20 characters required.

✓ Reflection saved

Section 3 of 5

Analyzing Coded Text: Hypothesis Testing, Comparisons, and Dictionaries

⏱ Estimated reading time: 30 minutes

Section 3 of 5

Analysing Coded Text

Frequency tables, cross-tabulations, chi-squared tests, and dictionary-based approaches.

The starting point

Frequency tables

The comparison move

Cross-tabulation by subgroup

Inferential tests

Chi-squared and Fisher's exact

Chi-squared test of independence

\[ \color{#0B7B6B}{\chi^2} = \sum \frac{(\color{#C2410C}{O} - \color{#6D28D9}{E})^2}{\color{#6D28D9}{E}} \]

χ² chi-squared statistic O observed cell count E expected count under independence

Chi-squared

Use when all expected cell counts are at least 5. Returns χ², degrees of freedom, and p-value.

Fisher's exact

Use when any expected cell count is below 5, which is common in 20-transcript corpora. More conservative; preferred for small samples.

Automated approaches

Dictionary-based content analysis

A pre-specified word list is applied computationally. Outputs are frequency counts per category, treated as variables like any other code.

LIWC (Linguistic Inquiry and Word Count, Pennebaker et al.) classifies words into 80+ psychological and linguistic categories.

Common sentiment dictionaries: Bing Liu's Bing lexicon, AFINN, and the NRC Emotion Lexicon.

Best practice

Dictionaries for manifest, replicable counts. Human coding for latent meaning. Both together are stronger than either alone.

Carry forward

What to take into the next section

Frequency tables are the starting point; cross-tabulation by subgroup is where findings emerge.
Fisher's exact is usually the right test in 20-transcript corpora with small expected cells.
Dictionaries and human coding are complementary, not competing approaches.

A later section walks through the complete R workflow: Taguette export to coded matrix to published result.

Introduction and Overview

Once your coding is done and your reliability is acceptable, you have a coded matrix: rows are recording units (transcripts, paragraphs, or sentences), columns are codes, and the cell values are either binary indicators (1 if the code applied, 0 otherwise) or counts (the number of times the code applied to that unit). This matrix is the input to the inferential half of content analysis. This section walks through what you can do with it: descriptive frequency tables, cross-tabulation by subgroup, chi-squared tests for distributional differences, and trend analyses where the corpus is longitudinal. It then introduces dictionary-based content analysis, LIWC and sentiment dictionaries, and previews the computational content analysis that a later module will develop in depth.

The intellectual point of this section is that content analysis is a hypothesis-testing method when it wants to be. Whatever you tested in your earlier regression work, you can test on content-analytic codes: differences in proportions, associations, trends, interactions. The variables are codes rather than survey items, but the inferential machinery is the same. This is what Bernard, Wutich, and Ryan mean when they say content analysis is the bridge between the qualitative and quantitative traditions.

Learning Objectives for this section

Compute and present a content-analytic frequency table at three levels: overall, by subgroup, and over time.
Apply chi-squared tests of independence (and Fisher's exact for small cells) to codes × subgroup matrices.
Interpret distributional differences in content-analytic data substantively as well as statistically.
Recognise when dictionary-based content analysis (LIWC, sentiment dictionaries) is appropriate.
Locate computational content analysis (a later module) as the scalable extension of the dictionary approach.

3.1 Frequency Tables: The Starting Point

Every content-analytic study starts with descriptive frequencies. For your loneliness corpus, a typical first-pass frequency table looks like the one below: each of seven codes, the count of transcripts in which the code appears (out of 20), and the percentage. The codes here are illustrative; your own codebook will produce different numbers.

Code	Transcripts (n / 20)	Percentage
loneliness-as-residue-of-marriage	6	30%
cultural-untranslatability-of-loneliness	4	20%
coping-by-pet-companionship	9	45%
identity-of-being-lonely-person	11	55%
technology-as-double-edged	14	70%
fading-at-the-edges	5	25%
shame-prevents-disclosure	13	65%

Already this table is doing analytic work. The most prevalent code, technology-as-double-edged, appears in 14 of 20 transcripts, suggesting that ambivalence about phones, social media, and video calls is a near-universal feature of contemporary loneliness narratives, regardless of who the participant is. The least prevalent, cultural-untranslatability, appears in 4 transcripts, and you can already predict who those four participants are: the ones who emigrated to Canada from elsewhere (Amira, Maya's mother, two others). The frequency table tells you both what is shared across the corpus and what is concentrated.

3.2 Cross-Tabulation by Subgroup: The Comparison Move

Content analysis's real value emerges when you cross-tabulate the codes by a participant-level variable: age band, gender, caregiving status, immigration status, life-stage. Here is the same code distribution disaggregated by life-stage in our worked corpus (10 caregivers, 10 non-caregivers):

Code	Caregivers (n=10)	Non-caregivers (n=10)	Total (n=20)
loneliness-as-residue-of-marriage	2 (20%)	4 (40%)	6 (30%)
cultural-untranslatability-of-loneliness	2 (20%)	2 (20%)	4 (20%)
coping-by-pet-companionship	3 (30%)	6 (60%)	9 (45%)
identity-of-being-lonely-person	6 (60%)	5 (50%)	11 (55%)
technology-as-double-edged	7 (70%)	7 (70%)	14 (70%)
fading-at-the-edges	1 (10%)	4 (40%)	5 (25%)
shame-prevents-disclosure	9 (90%)	4 (40%)	13 (65%)

Now the analytic action begins. The shame-prevents-disclosure code appears in 9 of 10 caregivers but only 4 of 10 non-caregivers, a striking difference that is the kind of finding content analysis is built to produce. The fading-at-the-edges code shows the reverse pattern: more common in non-caregivers (likely the older, more isolated participants like Helen) than in caregivers. The technology-as-double-edged code is invariant by subgroup, supporting the earlier hypothesis that this is a near-universal feature of contemporary loneliness regardless of life-stage. These patterns are descriptive; the next move is to ask whether they are larger than chance.

3.3 Chi-Squared Tests on Codes × Subgroup

The classical inferential test for a 2 × 2 (or larger) contingency table of categorical data is the chi-squared test of independence. Applied to a content-analytic frequency table, the chi-squared test asks: is the distribution of this code statistically different between the subgroups? The mechanics are familiar from earlier courses; here we apply them to qualitative codes treated as variables.

Take the shame-prevents-disclosure code from the table above: 9 of 10 caregivers vs. 4 of 10 non-caregivers. The chi-squared test on this 2 × 2 table gives χ² = 5.49 with 1 degree of freedom, p = 0.019. The difference is statistically significant at conventional thresholds. Practically, and this is the part the qualitative analyst contributes, the analysis suggests that caregiver participants in this corpus systematically describe shame as a barrier to disclosing their loneliness, in a way that non-caregivers do not. The substantive interpretation might be that caregiving identity carries a moral demand of unselfish endurance, and that admitting loneliness conflicts with that demand. The chi-squared test is the warrant for the claim; the qualitative reading is the claim.

When to use Fisher's exact instead

Chi-squared tests assume that expected cell counts are at least 5 (some sources say 1). In small qualitative corpora, especially those with rare codes, this assumption frequently fails. The standard remedy is Fisher's exact test, which gives an exact p-value regardless of cell size. In R, fisher.test() is a drop-in replacement for chisq.test(); for a 20-transcript corpus with codes appearing in 4–14 transcripts, Fisher's exact is almost always the more defensible choice. The shame example above shows why this matters: its expected counts are only 3.5 in two of the four cells, so although the uncorrected chi-squared gives p = 0.019, Fisher's exact returns p of about 0.06. Under the test you should trust here, that caregiver difference is better reported as suggestive than as conclusively significant. Report both if you wish, but lead with Fisher's.

3.4 Trend Analysis: Codes Over Time

Content analysis is at its strongest in longitudinal applications. Where your corpus has a time dimension, coverage of a topic across years of newspaper articles, posts on a forum across the pandemic, or transcripts collected at multiple time points, the codes can be plotted over time and tested for trend. The classical example is Pool's (1952) analysis of editorial coverage of the Soviet Union across decades; the contemporary example is the proliferation of computational content analyses of social-media discourse before, during, and after specific events (elections, crises, vaccine roll-outs).

Your loneliness capstone is cross-sectional, so trend analysis is not the dominant move, but you may have a quasi-temporal variable. For example, you could ask whether participants who were interviewed earlier in the corpus (the first 10 transcripts) differ systematically from those interviewed later (the last 10) in any code, as a way of probing whether your codebook was over-fitted to early transcripts. Or you could examine whether participants who lived through the pandemic as adults (versus those who were teenagers at the time) describe loneliness differently. These are pseudo-trend analyses on a cross-sectional corpus, but the inferential logic is the same.

3.5 Dictionary-Based Content Analysis

So far we have assumed that humans are applying the codes. Dictionary-based content analysis is the variant in which a pre-specified word list (a "dictionary") is applied to the corpus by a computer, and the resulting counts are treated as content-analytic variables. The approach is best understood as the manifest-content extreme of content analysis: it counts what is explicitly there, fast and at scale, at the cost of any latent-meaning sensitivity.

The best-known dictionary in the social sciences is the Linguistic Inquiry and Word Count (LIWC) dictionary, developed by James Pennebaker and colleagues from the early 1990s (Pennebaker, Boyd, Jordan, & Blackburn, 2015; Boyd, Ashokkumar, Seraj, & Pennebaker, 2022). LIWC is a curated set of about 90 categories, positive emotion, negative emotion, anxiety, sadness, body, health, family, social processes, cognitive processes, and so on, each defined by a list of words and word stems. Run LIWC over a transcript and it returns the percentage of words in each category. A transcript that scores high on "sadness" contains a high proportion of words like sad, lonely, grief, cry, miss, lost; one that scores high on "social processes" contains a high proportion of pronouns referring to other people.

LIWC has been used extensively in health-related text analysis: predicting depression from writing samples (Pennebaker, Mehl, & Niederhoffer, 2003), characterising trauma narratives, distinguishing the writing of patients on different medications, predicting suicide risk from social-media posts. It is the methodological ancestor of contemporary sentiment-analysis systems and remains in active use; LIWC-22 is the current version.

Sentiment dictionaries are the lighter-weight cousin of LIWC. The best-known are Bing Liu's Bing lexicon (about 6,800 words classified as positive or negative), the AFINN lexicon (a scored version, with words assigned valence from -5 to +5), and the NRC Word-Emotion Association Lexicon (which classifies words into Plutchik's eight basic emotions plus positive/negative). All three are available in the R tidytext package and can be applied to a transcript corpus in under 50 lines of code (you will do this in a later module).

Dictionary-based vs. human-coded content analysis

The choice is not exclusive. Most rigorous contemporary content-analytic studies use both: dictionary-based methods for the manifest, scale-sensitive, replicable counts, and human-coded latent analysis for the meaning-sensitive, ambiguity-tolerant, theory-engaged interpretation. The two halves answer different questions and are reported alongside each other. The danger is treating dictionary output as the whole analysis, LIWC says a transcript scores 4.2% on "sadness," but it cannot tell you that the sadness is bereavement-specific, that it co-occurs with relief, or that the participant is describing the sadness ironically. Only human coding can.

3.6 Computational Content Analysis: A Preview of a later module

Computational content analysis is the scalable extension of the dictionary approach, plus the addition of unsupervised methods that discover categories rather than counting pre-specified ones. The three families you will meet later in the course are: topic models (Latent Dirichlet Allocation and its successors, which discover thematic structure in large corpora without a codebook); word-embedding methods (which represent words as vectors in a high-dimensional space, enabling semantic similarity and analogy operations); and large-language-model-based coding (GPT-4-class systems applied as zero-shot or few-shot coders).

The methodological status of computational content analysis is still being worked out. The advantages are obvious: speed, scale, replicability of the computational pipeline. The disadvantages are real: opacity (LDA and LLMs cannot show their work the way a human coder's audit trail can), brittleness (a topic-model solution is sensitive to hyperparameter choices in ways the analyst rarely audits), and the recurring failure to validate computational outputs against human-coded gold standards. Bernard, Wutich, and Ryan's stance, and the stance of this course, is that computational content analysis is a useful complement to human content analysis, not a replacement for it. A later module will give you the operational machinery; this module is establishing the framework that machinery extends.

Variant	What it counts	Scalability	Reliability	Latent-content sensitivity
Human-coded latent content analysis (this module)	Codebook-defined themes	~50–200 documents per project	Krippendorff's alpha computed and reported	High
Human-coded manifest content analysis	Word or phrase occurrences	~500–5,000 documents per project	Near-perfect	Low
Dictionary-based (LIWC, sentiment)	Pre-specified word lists	Millions of documents	Perfect within the dictionary	Very low
Topic models (LDA; a later module)	Unsupervised thematic structure	Millions of documents	Depends on hyperparameter validation	Moderate but opaque
LLM-based coding (a later module)	Anything specifiable in a prompt	Limited by API cost, not by labour	Variable; requires gold-standard validation	High but unverifiable

Reflection

Consider the comparison in section 3.3: shame-prevents-disclosure appeared in 9 of 10 caregivers vs. 4 of 10 non-caregivers, χ² p = 0.019. Imagine a reviewer who is sceptical of qualitative work pushes back: "you have 20 transcripts, can you really make a chi-squared claim?" Write your response. What are the legitimate cautions, and what are the legitimate defences?

Model answerA strong response acknowledges the legitimate cautions and then articulates the defensible position. Cautions: 20 transcripts is small for chi-squared; Fisher's exact would be the more defensible test given likely small expected cell counts; and the sample is purposive, not probabilistic, so the inferential claim cannot generalise to the British Columbian population, it generalises only to the comparison within this corpus. Defences: the test is doing exactly the work it should be doing, it answers the question "is the difference between caregivers and non-caregivers in this corpus larger than what we would expect by chance alone, given the corpus size?" That is a legitimate question even when the sample is small and non-probabilistic. The chi-squared test is not the warrant for a population-level prevalence claim (an earlier course territory); it is the warrant for a within-corpus distributional claim that triangulates with the qualitative reading. A defensible methods section will say: "Within this purposive corpus of 20 transcripts, the difference in code prevalence between caregivers and non-caregivers exceeded what would be expected by chance (chi-squared p = 0.019; the more conservative Fisher's exact test gives p of about 0.06, just short of the conventional 0.05 cutoff). Given the corpus size the finding is best reported as suggestive and as warranting further investigation in a larger study, not as a claim that generalises."

Minimum 20 characters required.

✓ Reflection saved

Section 4 of 5

Content-Analyzing the Loneliness Dataset: R Workflow and the Week 8 Capstone

⏱ Estimated reading time: 45 minutes

Section 4 of 5

The R Workflow and Week 8 Capstone

From Taguette export to coded matrix, frequency tables, visualisation, and inferential tests.

Step 1

Load the Taguette export into R

Input

A CSV from Taguette: one row per highlighted passage, columns for document filename, code (tag), and passage text.

After the join

A coding tibble with participant-level variables (age, gender, caregiver, immigrant, life_stage) attached to each passage row. Ready for reshaping.

Step 2

Reshape into codes × cases matrix

Steps 3 & 4

Frequency tables and visualisation

dplyr summary

Group by subgroup variable; count how many participants in each group have each code; compute percentages. One tibble per comparison.

ggplot2 bar chart

Horizontal bars; codes sorted longest-to-shortest; two bars per code (e.g., caregiver vs. non-caregiver). Codes where bars diverge most are the primary findings.

Steps 5 & 6

Inferential tests and keyness analysis

Step 5: Contingency tests

One test per code that showed a meaningful descriptive difference. Report odds ratio alongside p-value. Fisher's exact for small expected cells.

Step 6: Keyness

textstat_keyness() in quanteda.textstats. Which words appear disproportionately in one subcorpus? Log-likelihood ratio; complement to code-level chi-squared.

Carry forward

The capstone milestone

Four artefacts

Taguette export; codebook with operational definitions; reliability report (Krippendorff's alpha); R output with frequency table, bar chart, and at least one inferential test.

Write first

Before running R, declare in writing: which subgroup contrast, which codes, what you expect. That declaration anchors the methods section.

A later section is the final assessment, where you consolidate and check your understanding.

Introduction and Overview

Earlier sections framed content analysis methodologically. This section turns operational. You will see, end to end, how to take a Taguette codebook (from your earlier work), apply it across the 20 loneliness transcripts, transform the resulting export into a codes × cases matrix in R, compute frequency tables and visualisations, run chi-squared and Fisher's exact tests on codes × subgroup, and conduct a keyness analysis with quanteda.textstats that surfaces which words differ most between two subcorpora.

The R workflow that follows assumes you completed an earlier module milestone: you have an exported Taguette CSV with one row per (passage, code) pair, columns for the document filename, the tag (code), and the passage content. An earlier module covered 3–5 transcripts; this module extends the codebook to 8–12 transcripts. The volume jump is intentional: you need enough cases for the frequency tables and chi-squared tests to be analytically informative, and 8–12 transcripts is the floor for a defensible content-analytic capstone milestone.

Learning Objectives for this section

Load a Taguette export into R and reshape it into a codes × cases matrix.
Compute frequency tables (overall, by subgroup) using dplyr.
Visualise code frequencies by subgroup with a ggplot2 bar chart.
Conduct chi-squared and Fisher's exact tests on codes × subgroup.
Run a keyness analysis with quanteda.textstats::textstat_keyness() to compare subcorpora.
Complete the capstone deliverable.

4.1 Step 1: Load the Taguette Export Into R

Your Taguette export is a CSV with one row per highlighted passage. Each row contains the document filename (which encodes the participant ID), the tag (your code), and the passage content. The first transformation in R is to read it into a tibble and add the participant-level variables (age, gender, caregiver status, etc.) needed for subgroup comparison.

RLoad Taguette export and merge participant metadata

Open RStudio. Working from the repository root, run:

library(tidyverse)
library(readtext)

# Read the Taguette export, one row per (passage, code) pair
codings <- read_csv("term projects/HSCI_841/taguette_export_week8.csv")
glimpse(codings)
# Typical columns: document, tag (= code), content (= passage text), start, end

# Pull participant_id and pseudonym out of the document filename
codings <- codings |>
  mutate(
    participant_id = str_extract(document, "P[0-9]+"),
    pseudonym      = str_extract(document, "(?<=_)[A-Za-z]+(?=\\.txt)")
  )

# Read the participant metadata table (created from transcript headers)
# Columns: participant_id, pseudonym, age, gender, life_stage, caregiver, immigrant
participants <- read_csv("term projects/HSCI_841/participant_metadata.csv")

# Merge metadata onto every coded row
codings <- codings |> left_join(participants, by = "participant_id")

glimpse(codings)

What success looks like: The codings tibble has one row per highlighted passage, columns for the code (tag), the passage text (content), and the participant-level variables (age, gender, life_stage, caregiver, immigrant) needed for subgroup analysis.

4.2 Step 2: Reshape Into a Codes × Cases Matrix

The codes-as-variables move is now a literal data-shape operation: convert the long-format export into a wide-format matrix where rows are participants and columns are codes. The cell values are typically binary (1 if the code applied to that participant at all, 0 otherwise) but can be counts (the number of times the code applied within the transcript).

RBuild the codes × cases matrix

# Binary indicator matrix: 1 if code appeared in transcript, 0 otherwise
code_matrix_binary <- codings |>
  distinct(participant_id, tag) |>
  mutate(present = 1) |>
  pivot_wider(
    names_from  = tag,
    values_from = present,
    values_fill = 0
  )

# Count matrix: number of times each code appeared in each transcript
code_matrix_counts <- codings |>
  count(participant_id, tag) |>
  pivot_wider(
    names_from  = tag,
    values_from = n,
    values_fill = 0
  )

# Add the participant-level variables for subgroup analysis
code_matrix_binary <- code_matrix_binary |> left_join(participants, by = "participant_id")

print(code_matrix_binary)

What this gives you: A 20 × (7 + k) tibble (or 8–12 × (7 + k) for the milestone), where each row is a participant, each of the first 7 code columns is a binary or count variable, and the remaining columns are participant attributes (age, gender, life_stage, etc.). This is the analytic-ready frame for everything that follows.

4.3 Step 3: Frequency Tables With `dplyr`

The overall and by-subgroup frequency tables in an earlier section are produced in a few dplyr calls.

RCompute frequency tables: overall and by subgroup

# Overall code frequency (n transcripts in which each code appeared)
code_freq_overall <- code_matrix_binary |>
  summarise(across(where(is.numeric) & !c(age), sum)) |>
  pivot_longer(everything(), names_to = "code", values_to = "n_transcripts") |>
  mutate(pct = round(100 * n_transcripts / nrow(code_matrix_binary), 1)) |>
  arrange(desc(n_transcripts))
print(code_freq_overall)

# By caregiver status: 2 x 2 contingency for each code
code_freq_by_caregiver <- codings |>
  distinct(participant_id, tag, caregiver) |>
  count(tag, caregiver) |>
  pivot_wider(names_from = caregiver, values_from = n, values_fill = 0)
print(code_freq_by_caregiver)

# By age band
code_matrix_binary <- code_matrix_binary |>
  mutate(age_band = case_when(
    age < 30 ~ "18-29",
    age < 50 ~ "30-49",
    age < 65 ~ "50-64",
    TRUE     ~ "65+"
  ))

code_by_age <- code_matrix_binary |>
  group_by(age_band) |>
  summarise(across(starts_with("loneliness-"), sum), n_in_band = n())
print(code_by_age)

Worked-example finding: If you replicate the earlier caregiver split on your own data, you will likely find that codes carrying a "moral demand" valence (shame, identity, role-protection) cluster in the caregiver subgroup, while codes carrying a "world-shrinkage" valence (fading, isolation, sensory loss) cluster in the older non-caregiver subgroup. The exact numbers will vary depending on which 8–12 transcripts you coded.

4.4 Step 4: Visualise With `ggplot2`

A bar chart of code frequencies by subgroup is the standard published display for a content-analytic study. ggplot2 produces a publication-quality version in a few lines.

RBar chart of code frequencies by life-stage subgroup

library(ggplot2)

# Long-format frequency table for plotting
plot_data <- codings |>
  distinct(participant_id, tag, caregiver) |>
  count(tag, caregiver) |>
  left_join(
    participants |> count(caregiver, name = "n_in_group"),
    by = "caregiver"
  ) |>
  mutate(pct = 100 * n / n_in_group)

ggplot(plot_data, aes(x = reorder(tag, pct), y = pct, fill = caregiver)) +
  geom_col(position = "dodge") +
  coord_flip() +
  scale_fill_manual(values = c("caregiver" = "#CC0033", "non-caregiver" = "#0B7B6B")) +
  labs(
    x = NULL,
    y = "% of subgroup with code present",
    fill = "Life stage",
    title = "Code prevalence by caregiver status (loneliness corpus, n=20)"
  ) +
  theme_minimal(base_size = 12) +
  theme(panel.grid.major.y = element_blank())

ggsave("figures/code_prevalence_by_caregiver.png", width = 8, height = 5, dpi = 300)

What success looks like: A horizontal bar chart with codes on the y-axis (longest bars at the top), percentage on the x-axis, two bars per code (red for caregiver, teal for non-caregiver). Codes for which the two bars differ markedly are the ones worth subsequent chi-squared testing.

4.5 Step 5: Chi-Squared and Fisher's Exact Tests

The inferential half of the analysis. For each code that showed a meaningful descriptive difference between subgroups in step 4, run the appropriate contingency-table test.

RChi-squared and Fisher's exact tests on codes × subgroup

# For a single code, the shame-prevents-disclosure example from an earlier section
# Build the 2 x 2 contingency table
shame_table <- table(
  caregiver = code_matrix_binary$caregiver,
  shame     = code_matrix_binary$`shame-prevents-disclosure`
)
print(shame_table)

# Chi-squared with Yates' continuity correction (default for 2 x 2)
chisq.test(shame_table)

# Fisher's exact, the more defensible choice for small expected counts
fisher.test(shame_table)

# Loop across all codes to produce a table of p-values
code_names <- setdiff(colnames(code_matrix_binary),
                       c("participant_id", "age", "gender", "life_stage",
                         "caregiver", "immigrant", "pseudonym", "age_band"))

test_results <- map_dfr(code_names, function(cd) {
  tbl <- table(code_matrix_binary$caregiver, code_matrix_binary[[cd]])
  ft  <- fisher.test(tbl)
  tibble(
    code     = cd,
    n_caregiver_pos    = tbl["caregiver", "1"],
    n_noncaregiver_pos = tbl["non-caregiver", "1"],
    odds_ratio = unname(ft$estimate),
    p_value    = ft$p.value
  )
}) |> arrange(p_value)

print(test_results)

What to report: The test_results tibble is what goes in your methods/findings tables. Lead with the codes that show the largest distributional differences (lowest p-values); report the odds ratio alongside the p-value because p-values from small samples are misleading on their own; flag any codes for which expected cell counts fall below 5 by noting that Fisher's exact was used.

4.6 Step 6: Keyness Analysis With `quanteda.textstats`

Keyness is the manifest-content cousin of the chi-squared analysis above. It asks: which words appear disproportionately in one subcorpus compared to another? The classical implementation is the log-likelihood ratio test on word frequencies between two subcorpora; quanteda.textstats::textstat_keyness() is the R implementation.

For the loneliness dataset, the natural keyness contrast is one of the subgroup contrasts you have already been working with: caregiver vs. non-caregiver, immigrant vs. non-immigrant, or under-40 vs. over-65. The output is a ranked list of words, each with a chi-squared (or log-likelihood) statistic and a sign indicating which subcorpus the word over-occurs in.

RKeyness analysis: which words distinguish caregiver from non-caregiver transcripts?

library(quanteda)
library(quanteda.textstats)
library(quanteda.textplots)

# Build the quanteda corpus from the transcripts (as earlier in the course)
loneliness_rt <- readtext("term projects/HSCI_841/transcripts/P*.txt",
                          docvarsfrom = "filenames",
                          docvarnames = c("participant_id", "pseudonym"),
                          dvsep = "_")
loneliness_corpus <- corpus(loneliness_rt)

# Attach caregiver status as a document variable
docvars(loneliness_corpus, "caregiver") <- participants$caregiver[
  match(docvars(loneliness_corpus, "participant_id"),
        participants$participant_id)
]

# Tokenise, lowercase, remove stopwords
loneliness_tokens <- tokens(loneliness_corpus,
                            remove_punct   = TRUE,
                            remove_numbers = TRUE) |>
  tokens_tolower() |>
  tokens_remove(stopwords("en"))

# Build the document-feature matrix and group by caregiver status
loneliness_dfm <- dfm(loneliness_tokens) |>
  dfm_group(groups = docvars(loneliness_corpus, "caregiver"))

# Keyness test: caregiver subcorpus as target, non-caregiver as reference
keyness <- textstat_keyness(loneliness_dfm,
                            target = "caregiver",
                            measure = "lr")  # log-likelihood ratio
print(head(keyness, 30))

# Plot top 20 keywords on each side
textplot_keyness(keyness, n = 20)
ggsave("figures/keyness_caregiver_vs_noncaregiver.png", width = 8, height = 6, dpi = 300)

What to expect: Words like kids, mom, dad, husband, hospital, appointment, exhausted will load on the caregiver side; words like chair, walker, eyes, fading, alone, quiet are likely to load on the non-caregiver (older, more isolated) side. The keyness analysis is manifest content analysis at scale; combined with your latent codebook results, the two halves triangulate to support stronger claims than either alone.

4.7 The Capstone Milestone

The capstone milestone is a content-analytic frequency table from a coded subset, with at least one defensible quantitative comparison. You will deliver four artefacts: the coded transcripts (Taguette export), the frequency table (as an Excel or CSV file), the R script that produces the chi-squared / Fisher's exact comparison, and a 500-word interpretive memo that places the quantitative findings back into a qualitative reading.

Reflection

Before you run a single line of the R workflow, declare in writing what comparison you intend to test. Which subgroup contrast (caregiver/non-caregiver, immigrant/native-born, older/younger, women/men)? Which code do you predict will differ most? And what is the substantive reasoning behind your prediction?

Model answerA strong response makes the prediction before running the analysis, the pre-registration logic of an earlier course carrying into this course. Example: "I will test the contrast between participants who self-identify as caregivers (n=10) and those who do not (n=10). I predict that the code shame-prevents-disclosure will appear more frequently in caregiver transcripts, because the caregiving identity carries a moral demand of unselfish endurance that conflicts with admitting loneliness. I also predict that fading-at-the-edges, Helen's code, will appear more frequently in non-caregivers, because non-caregivers in this corpus skew older and more socially isolated, making the world-shrinkage experience more prominent. The technology code I expect to be invariant by caregiver status." Writing the prediction down first protects you from p-hacking and gives you a way to discuss findings that contradicted your expectation in the memo, which is often the most analytically valuable section.

Minimum 20 characters required.

✓ Reflection saved

Reference

Glossary: Key Terms, People & Methodological Stances

📚 Reference page: available throughout the lesson

This glossary collects the key concepts, people, and methodological stances introduced in this lesson. Use it as a reference while you work through the material, or as a review before the final assessment. Type in the search box to filter entries.

Core Concepts

Content Analysis A research technique for making replicable and valid inferences from texts (or other meaningful matter) to the contexts of their use (Krippendorff, 2018). In this course, content analysis is treated as the hybrid method that bridges qualitative and quantitative analysis by treating codes as variables.

Manifest Content The literal, surface-level content of a text, the words and phrases that are explicitly present. Manifest coding is fast, reliable, and limited in interpretive depth.

Latent Content The underlying meaning of a text that requires interpretation to surface. Latent coding is interpretively richer than manifest but harder to apply reliably; modern content analysis explicitly accommodates both.

Codes as Variables The operational move that defines content analysis. A code, once defined qualitatively, becomes a column in a data frame with a value for every recording unit. This unlocks descriptive and inferential statistical analysis of qualitative coding.

Sampling Unit The chunk of text drawn from the population, each interview transcript, each newspaper article, each tweet. Choice of sampling unit follows from the research question and the population definition.

Recording Unit The unit that is actually coded, where the variable takes its value. May be the whole transcript, the paragraph, the sentence, or the word/phrase. The choice shapes both what the analysis can measure and how reliable it will be.

Context Unit The surrounding text the coder is permitted to consult when deciding how to code a recording unit. A larger context unit increases interpretive richness; a smaller one increases reliability.

Coding Scheme (Codebook) The list of codes with operational definitions, inclusion rules, and exclusion rules that govern application. Must be exhaustive over the content of interest; codes may be mutually exclusive (classical Berelson) or allow multi-coding (modern compromise).

A Priori (Deductive) Codes Codes derived from theory, prior literature, or instrument before data analysis begins. Hsieh and Shannon (2005) call this directed content analysis.

Emergent (Inductive) Codes Codes that arise from close reading of the corpus. Hsieh and Shannon call this conventional content analysis. Most contemporary studies use a hybrid: a priori scaffold plus emergent expansion.

Dictionary-Based Content Analysis Manifest content analysis at scale, in which a pre-specified word list is applied by a computer to a corpus. LIWC and sentiment dictionaries are the best-known examples. Fast and replicable; blind to latent meaning, ambiguity, and irony.

LIWC (Linguistic Inquiry and Word Count) A dictionary system developed by James Pennebaker and colleagues (current version LIWC-22) that classifies words into approximately 90 categories (positive emotion, negative emotion, sadness, body, family, social processes, etc.). Widely used in psychology, health communication, and computational social science.

Keyness A manifest-content analysis that identifies words appearing disproportionately in one subcorpus compared to another, with a log-likelihood or chi-squared statistic for each word. Implemented in R as quanteda.textstats::textstat_keyness().

Reliability & Inference

Krippendorff's Alpha The field-standard reliability statistic for content analysis (Krippendorff, 2004, 2018). Accommodates any number of coders, any level of measurement, and missing data. Acceptable thresholds: α ≥ 0.80 for definitive claims; 0.67 ≤ α < 0.80 for tentative inference; below 0.67 the codebook should be revised.

Cohen's Kappa An older reliability statistic (Cohen, 1960; benchmarks from Landis & Koch, 1977) restricted to two coders and nominal data. Still reported in some literatures but largely superseded by Krippendorff's alpha for content-analytic work.

Chi-Squared Test of Independence The classical inferential test for whether the distribution of a categorical variable differs between groups. In content analysis, applied to codes × subgroup contingency tables. Assumes expected cell counts ≥ 5; use Fisher's exact instead when this fails.

Fisher's Exact Test An exact inferential test for contingency tables that gives accurate p-values regardless of cell size. The recommended replacement for chi-squared when expected cell counts may fall below the standard threshold, common in small qualitative corpora.

Methodological Variants

Conventional Content Analysis Hsieh and Shannon's (2005) name for inductive content analysis: codes emerge from close reading. Sensitive to participants' own categories but at risk of analyst preconceptions.

Directed Content Analysis Hsieh and Shannon's name for deductive content analysis: codes drawn from theory, prior literature, or instrument. Replicable and comparable but at risk of missing what the data are actually doing.

Summative Content Analysis Hsieh and Shannon's third type: starts with word counts and expands to latent interpretation. Bridges dictionary-based and human-coded approaches.

Qualitative Content Analysis An umbrella term used in nursing and health-services research for content-analytic work that emphasises interpretive depth over inferential machinery. Often refers to Hsieh & Shannon's (2005) typology or Schreier's (2012) framework; see also Elo & Kyngäs (2008) and Mayring (2000).

Computational Content Analysis The scalable extension of content analysis using topic models (LDA), word embeddings, or large-language-model-based coding. Developed later in the course. Valuable as a complement to human coding; requires validation against gold-standard human codes.

Key People

Harold D. Lasswell (1902–1978) Political scientist whose Propaganda Technique in the World War (1927) and WWII-era work at the U.S. Library of Congress Experimental Division established content analysis as a systematic social-science method. Famous formulation: "Who says what to whom in what channel with what effect."

Bernard Berelson (1912–1979) Author of Content Analysis in Communication Research (1952), the field's first textbook. His four-adjective definition, objective, systematic, quantitative, manifest, structured the method for two decades. Restricted content analysis to manifest content; this restriction was later relaxed by Krippendorff and the contemporary field.

Klaus Krippendorff (1932–2022) Communications scholar whose Content Analysis: An Introduction to Its Methodology (1980, 2004, 2013, 2018) is the field's most-cited methodological text. Broadened the method to accommodate latent content and developed Krippendorff's alpha, the field-standard reliability statistic.

James W. Pennebaker (b. 1950) Social psychologist who developed LIWC (Linguistic Inquiry and Word Count) and pioneered dictionary-based analysis of personal-narrative text. His work linking writing style to psychological state has been influential in health communication and digital-mental-health research.

Hsiu-Fang Hsieh & Sarah E. Shannon (2005) Authors of the highly cited Three Approaches to Qualitative Content Analysis (2005) in Qualitative Health Research. Their typology, conventional, directed, summative, has structured most qualitative content-analytic work in nursing and health-services research for the past two decades.

H. Russell Bernard, Amber Wutich, Gery W. Ryan Authors of Analyzing Qualitative Data: Systematic Approaches (2nd ed., 2017). Their Chapter 11 (pp. 243–268) is the source for this lesson's content. They present content analysis as the hybrid method bridging the qualitative and quantitative traditions.

No matching entries. Try a different search term.

HSCI 841 – Lesson 8

Qualitative Research Methods & Analysis in Public Health

Content Analysis

Learning objectives for this lesson:

What Content Analysis Is: History, Manifest/Latent Content, and Codes as Variables

What Content Analysis Is

Lasswell and the propaganda studies

Berelson (1952) to Krippendorff (2018)

Berelson, 1952

Krippendorff, 1980–2018

Manifest vs. latent content

Codes as variables

Thematic analysis

Content analysis

When content analysis is the right choice

Introduction and Overview

Learning Objectives for this section

1.1 A Short History: From Lasswell to Krippendorff

Why the history matters for your capstone

1.2 Manifest Versus Latent Content

Key insight - Manifest vs latent is a continuum, not a binary

A worked example from the loneliness corpus

1.3 Codes as Variables: The Operational Move

1.4 When Content Analysis Is the Right Tool

Reflection

Designing a Content Analysis: Sampling Text, Recording Units, and Reliability

Designing a Content Analysis

Sampling, recording, and context units

Sampling unit

Recording unit

Context unit

Three requirements

Exhaustive

Defensibly exclusive

Operationally defined

Krippendorff's alpha

α ≥ 0.80

0.67–0.80

< 0.67

The workflow in order

What to take into the next section

Introduction and Overview

Learning Objectives for this section

2.1 Sampling Text: The Three Kinds of Units

For your capstone, recommend: transcript-level recording units with paragraph-level context

2.2 Developing a Coding Scheme

Recommended scaffold for this course content analysis

2.3 Inter-coder Reliability: Stricter Than Thematic Analysis

What to do when reliability is low

2.4 The Workflow End-to-End

Reflection

Analyzing Coded Text: Hypothesis Testing, Comparisons, and Dictionaries

Analysing Coded Text

Frequency tables

Cross-tabulation by subgroup

Chi-squared and Fisher's exact

Chi-squared

Fisher's exact

Dictionary-based content analysis

Best practice

What to take into the next section

Introduction and Overview

Learning Objectives for this section

3.1 Frequency Tables: The Starting Point

3.2 Cross-Tabulation by Subgroup: The Comparison Move

3.3 Chi-Squared Tests on Codes × Subgroup

When to use Fisher's exact instead

3.4 Trend Analysis: Codes Over Time

3.5 Dictionary-Based Content Analysis

Dictionary-based vs. human-coded content analysis

3.6 Computational Content Analysis: A Preview of a later module

Reflection

Content-Analyzing the Loneliness Dataset: R Workflow and the Week 8 Capstone

The R Workflow and Week 8 Capstone

Load the Taguette export into R

Input

After the join

Reshape into codes × cases matrix

Frequency tables and visualisation

dplyr summary

4.3 Step 3: Frequency Tables With `dplyr`

4.4 Step 4: Visualise With `ggplot2`

4.6 Step 6: Keyness Analysis With `quanteda.textstats`