HSCI 841 — Lesson 5

Finding Themes & Building Codebooks

Qualitative Research Methods & Analysis in Public Health

Kiffer G. Card, PhD, Faculty of Health Sciences, Simon Fraser University

Learning objectives for this lesson:

  • Distinguish themes from codes, categories, and concepts — the terminology that introductory texts use loosely and that Bernard, Wutich, and Ryan tighten up
  • Apply the twelve Ryan & Bernard (2003) techniques for finding themes to a small corpus of loneliness transcripts
  • Differentiate inductive, deductive, and hybrid coding strategies and justify which is appropriate for a given research question
  • Build a structured codebook with code names, brief definitions, full definitions, inclusion criteria, exclusion criteria, and positive/negative exemplars
  • Explain coding mechanics — hierarchical codes, multiple codes per passage, axial coding, and the use of in vivo codes
  • Compute and interpret percent agreement, Cohen's kappa, and Krippendorff's alpha, and identify when intercoder reliability is the wrong measure
  • Operate the Taguette + R workflow: upload, code, export, and analyze coded extracts
  • Complete the Week 5 capstone milestone: a preliminary codebook tested on 3–5 transcripts, with a one-page memo on what coding revealed

This course was developed by Kiffer G. Card, PhD, as a companion to Bernard, H. R., Wutich, A., & Ryan, G. W. (2017). Analyzing Qualitative Data: Systematic Approaches (2nd ed.). SAGE. Lesson 5 covers Chapters 5 and 6 (pp. 101–160).

Section 1 of 5

What Themes Are — and Twelve Techniques for Finding Them

⏱ Estimated reading time: 35 minutes

Introduction and Overview

Lessons 1 through 4 gave you the upstream apparatus of a qualitative project: an operational definition of QDA, a research question, a sampling logic, and a data-collection procedure. You now arrive at the central act of the analysis. You have transcripts. You have read them. You sense that something is going on across them. You suspect there are patterns. The question of this lesson is: how do you find those patterns systematically, and how do you record what you find in a form another analyst could follow?

Two activities sit at the heart of analytic work on text: finding themes and building a codebook. These twin moves are the connective tissue of every major qualitative analytic tradition — from thematic analysis (Braun & Clarke, 2006; Braun & Clarke, 2019) through qualitative content analysis (Hsieh & Shannon, 2005) to the systematic coding manuals used in applied health research (Saldaña, 2021). They are tightly coupled but not the same. Finding themes is the discovery phase — the search for what recurs, what surprises, what is missing, what coheres. Building a codebook is the codification phase — turning what you found into operational rules that you (and other analysts) can apply consistently across the rest of the corpus. This lesson works through both, drawing on Chapters 5 and 6 of Bernard, Wutich, and Ryan (2017).

Learning Objectives for Section 1

  • Define theme, code, category, and concept and explain how the terms relate.
  • Recognize that themes are found by an analyst, not discovered in nature.
  • Identify all twelve Ryan and Bernard (2003) techniques for finding themes and recognize when each is most useful.
  • Apply at least four of the twelve techniques to passages from the loneliness dataset.

1.1 Themes, Codes, Categories, Concepts — A Precise Vocabulary

A theme is a recurrent meaning or pattern across a body of qualitative data. Themes are interpretive; they exist at a higher level of abstraction than what is literally in the text. 'Stigma as barrier to disclosure' is a theme; 'said the word stigma' is not.

A code is a label applied to a segment of data. Codes can be descriptive (close to what was said) or interpretive (analyst's reading). Codes are the granular units that, when aggregated, support claims about themes.

A category is a higher-level grouping of related codes. The hierarchy is typically: data segment → code → category → theme. Some traditions blur category and theme; the distinction is most useful in framework analysis and applied policy work.

A concept is a portable theoretical idea that can travel beyond the original study. 'Allostatic load,' 'cultural safety,' 'social capital' are concepts. Theme is local; concept is general. Grounded theory aims at concepts; thematic analysis is content to stop at themes.

The terms theme, code, category, and concept get used interchangeably in much of the published qualitative literature, including in papers that are otherwise methodologically careful. Bernard, Wutich, and Ryan (2017, Ch. 5) are precise where most writers are not. Adopting their vocabulary will save you grief when you write your methods section and will save the reader confusion about what you actually did.

A theme is a recurring abstract idea you identify in the data. It is the analyst's product. “Loneliness as the cost of love” is a theme. “Spatial metaphor for absence” is a theme. Themes are typically expressed as short phrases rather than single words. They sit at a higher level of abstraction than the individual statements that supply evidence for them.

A code is the operational label you attach to a passage of text when you encounter an instance of a theme (or sub-theme). Codes are the working units of analysis: they are what you actually mark up in Taguette or NVivo. A single theme may be supported by several codes; a single passage may receive multiple codes. The code is the marker; the theme is what the markers, when assembled, are about.

A category is a grouping of related codes. In a hierarchical codebook, categories are the parent nodes and codes are the children. “Coping strategies” is a category that might contain codes for “phoning a confidant,” “watching comfort television,” “going to a coffee shop to be around people,” and “cooking food from home.” The category organizes; the codes do the marking.

A concept is the most abstract of the four. A concept is a theoretically meaningful idea that may organize many themes. Liminality is a concept. Embodiment is a concept. Structural exclusion is a concept. Concepts are typically borrowed from theoretical literatures and used to organize themes into something a discipline can argue about.

TermLevel of abstractionExample from the loneliness dataset
Code Lowest — operational marker chair-absent-spouse (applied to Linda's “Bill's chair” passage and similar)
Theme Mid — recurring idea Spatial objects standing in for absent people
Category Mid — organizing bucket Material traces of relationship loss (contains codes for chair, side of bed, photograph, kitchen, mobility aids)
Concept Highest — theoretical Embodied memory; material culture of grief

Themes are found, not discovered

Bernard, Wutich, and Ryan are emphatic that themes do not emerge from data the way fossils emerge from rock — a point Braun and Clarke (2019) make forcefully in their reflexive reframing of thematic analysis. The analyst notices them, names them, decides which to keep, and decides where the boundaries are. The phrasing “themes emerged from the data” — ubiquitous in published qualitative papers — obscures the analytic work and is one of Bernard, Wutich, and Ryan's pet peeves. A more honest phrasing is “we identified the following themes through inductive coding” or “the themes below were developed iteratively as we read across the 20 transcripts.”

1.2 Ryan and Bernard's Twelve Techniques for Finding Themes

Word-level techniques (1, 6, 10)v

Repetitions, linguistic connectors, word lists/KWIC. Look for words and phrases that recur, that link causal claims ('because', 'so that'), or that cluster around your concepts. Computationally tractable; useful as a first pass before manual coding.

Conceptual techniques (2, 3, 8)v

Emic categories, metaphors and analogies, theory-related material. Look for the participant's own vocabulary ('feeling left behind'), figurative language ('drowning in the system'), and segments that engage existing theory. These techniques find themes that matter to participants or to the field.

Comparison techniques (4, 5, 11)v

Transitions, similarities and differences, co-occurrence. Look at moments of change in narrative, contrasts across cases, and codes that appear together. These produce the relational themes — ones describing how concepts move, contrast, or combine.

Structural techniques (7, 9, 12)v

Missing data (what people don't say), cutting-and-sorting (pile sort), metacoding. Notice what is absent (silences are themes too), physically rearrange coded segments to discover groupings, and code the codes themselves to surface higher-order patterns. The most labour-intensive techniques but often the most revealing.

In a now-classic 2003 paper, Gery Ryan and H. Russell Bernard (2003) catalogued twelve techniques qualitative researchers use to find themes. The paper became foundational because it disaggregated what had previously been described as “immersion” or “reading deeply” into a set of specifiable operations. Bernard, Wutich, and Ryan (Ch. 5) reproduce and updates the list. Each technique answers a slightly different question about the data, and most projects use several in combination. Below we work through all twelve, with examples drawn from the loneliness dataset.

Technique 1: Repetitions

ACTIVITY Try it - Code a paragraph two ways

Take a short paragraph from one of your transcripts (or sample text). Code it twice:

  1. Descriptive coding — one code per phrase, close to the surface meaning. Aim for 5-8 codes in a paragraph.
  2. Interpretive coding — one or two codes per paragraph, capturing what you think this passage is doing. Aim for 1-3 higher-level codes.

Compare the two coding passes. The descriptive layer organizes the corpus; the interpretive layer carries the argument. Most defensible thematic analyses move iteratively between them.

The most obvious technique. What words, phrases, ideas recur across transcripts? If multiple participants reach for the same word to describe something, that word is doing analytic work. In the loneliness corpus, the word chair appears in at least eight transcripts as a stand-in for an absent person: Linda's “Bill's chair,” Frank's chair, Helen's mention of the chair her brother used to sit in when he visited, and others. The word tired recurs across caregiver and bereaved participants in a way that goes beyond ordinary fatigue: it appears to mark a specific exhaustion of grieving-as-work. Repetition is what nearly all theme-finding starts with, and what every other technique builds on.

Technique 2: Indigenous Typologies and Categories (Emic Terms)

What categories do participants themselves use? When a participant reaches for a non-English word, a slang term, or a phrase that operates as a category in their world, the analyst should pay attention. In the loneliness corpus, Amira uses the Arabic word wahda to name something that the English category of “loneliness” cannot fully hold — a loneliness specific to having been the sole survivor of a particular life. Aarav uses ekantam and ekakitatvam as paired Sanskrit-Hindi terms that distinguish chosen solitude from involuntary aloneness. Marcus speaks of “code-switching” loneliness — an experience he names that has no neat one-word English equivalent. These emic categories are gifts; they often become themes that organize an entire section of your eventual paper.

Technique 3: Metaphors and Analogies

People describe abstract experiences (especially feelings) by reaching for concrete images that map onto them. Cataloguing the metaphors a corpus uses is a powerful theme-finding move. The loneliness corpus is dense with spatial metaphors of absence and erosion: Maya feels “hollow” in her chest; Sarah describes loneliness as her “witness-less hours”; Helen describes it as “fading at the edges”; Frank uses imagery of a slow disappearance; Maya again talks about feeling she could “disappear and nobody would notice”; Linda talks about “walking around with that absence.” Read together, these metaphors converge: loneliness is repeatedly figured as a thinning or vanishing of the self. That convergence is a theme that you would never have seen if you had read the metaphors one at a time.

Technique 4: Transitions

What does the participant move to right after they say what they say? Transitions are turn-taking shifts and topic changes. They tell you what feels related in the participant's mind. In Linda's transcript, every passage about Bill's chair is followed by a passage about Rufus the dog: “I haven't moved it... the dog is what keeps me up in the morning.” The transition from the empty chair to the dog suggests an analytic linkage you might otherwise have missed: the dog is functioning as the affective replacement for the chair's absence. In Maya's transcript, the loneliness topic transitions repeatedly to her phone — phone, food, TV — signalling that her coping repertoire is digitally mediated.

Technique 5: Similarities and Differences (Constant Comparison)

Read two passages side by side. What is the same? What is different? This is the constant-comparative move that Glaser and Strauss made foundational to grounded theory and that you will meet again in Lesson 7. As a theme-finding technique, it is most useful when you have already identified candidate themes and want to test whether they hold up across subgroups. In the loneliness corpus, the bereaved-spouse loneliness of Linda (age 67, widow of three years) and Frank (age 81, widower of one year) share most features but differ on duration of mourning and the role of children: Linda's adult sons are present (David in Toronto, Michael in Calgary), while Frank's are estranged. The comparison sharpens the theme rather than dissolving it.

Technique 6: Linguistic Connectors

Words like because, since, as a result, therefore, that's why, and so are causal connectives. They mark places where the participant is explaining a relationship between events or states. Searching for them is a fast way to find passages where causal accounts of loneliness appear. Maya's transcript contains: “I just — I felt like everyone's life is still happening and I'm just here. Like I left and life closed up where I used to be.” The connective like here is not strictly causal but it is doing relational work. Linda's transcript: “If you love deeply for a long time, you will, eventually, be lonely deeply for a long time. The two are connected.” That explicit linkage of love and loneliness is a participant's own causal model, surfaced by attention to a connective.

Technique 7: Missing Data — What People Don't Say

What the corpus does not contain is as analytically informative as what it does. If you asked every participant about coping and three avoided answering, that pattern of avoidance is itself data. In the loneliness corpus, several participants conspicuously do not use the word “lonely” about themselves until very late in the interview, even though the interview is explicitly about loneliness. Marcus repeatedly substitutes “disconnected” or “invisible.” Older men in the corpus avoid the word in a way younger women do not. That gendered pattern of avoidance is a finding. Missing data is also useful at the level of what the interview guide asked but participants deflected: questions about professional help, in particular, are routinely deflected.

Technique 8: Theory-Related Material

Bernard, Wutich, and Ryan call this “looking through a theoretical lens.” If you bring a specific theory to the data — Cacioppo and Patrick's loneliness-versus-aloneness distinction, Weiss's social-emotional loneliness typology, structural/situational/existential models — you can deliberately scan the corpus for passages that confirm, complicate, or contradict the theory. This is more deductive than the previous seven techniques (which start from the data) and we will return to it when we distinguish inductive from deductive coding in Section 2. As an example: Cacioppo and Patrick (2008) argue that loneliness and being alone are dissociable states. The loneliness corpus is full of explicit articulations of that distinction (Maya: “loneliness is different than being alone”; Helen: “I have lived alone all my adult life… the loneliness I have now is different from the solitude I had at 50”). The theoretical lens both helps you see the pattern and gives you a way to write about it.

Technique 9: Cutting and Sorting (The Pile-Sort)

A physical, tactile method. Print out striking quotes from your corpus, one per index card. Spread them out on a table. Group cards that feel related. Move cards around as your sense of groupings evolves. Give each group a name. This technique externalizes the analytic work and uses spatial cognition to find patterns that screen-based reading misses. It is especially useful when you have 20+ candidate themes and need to consolidate them into a manageable set. Bernard, Wutich, and Ryan recommend actual cutting-and-sorting (literal scissors) at least once per project; the tactile experience is what makes it work. We will do a digital version of this in the Section 4 workflow.

Technique 10: Word Lists and KWIC (Keyword-in-Context)

Generate a frequency list of all words in the corpus and inspect the high-frequency content words (after removing stopwords like “the” and “and”). For words that catch your eye, generate a KWIC concordance: every occurrence of the word with a few words of context on either side. KWIC concordances are how you check whether participants are using a word the same way. If “chair” appears 23 times across the corpus, are all 23 instances chairs-as-stand-ins-for-people, or are some just literal pieces of furniture? The KWIC concordance answers that quickly. This is also the natural bridge to computational text analysis (Module 12); we will use quanteda in Section 4 of this lesson to do a small word-list and KWIC exploration.

Technique 11: Co-occurrence

Which codes (or words) appear together more often than chance would predict? Co-occurrence is the first move toward axial coding (covered in Section 3) and toward concept-mapping. If your code chair-absent-spouse co-occurs in nearly every transcript with codes for pet-as-companion or volunteer-coping, that co-occurrence is a finding. Co-occurrence is also how you build the network displays you will meet in Module 12. In R, co-occurrence is straightforward once your coded extracts are in a long-format data frame.

Technique 12: Metacoding (Codes About Codes)

Key insight - A theme is a claim, not a topic

Beginners label themes with single nouns: Stigma. Access. Identity. These are topics. A theme is a defensible analytic claim about a pattern. Better theme labels are short statements: 'Stigma is managed strategically, not passively endured' or 'Access is described less in terms of geography than in terms of trust'. Themes-as-claims are easier to evidence, easier to dispute, and easier to write up. Themes-as-topics are easier to fall in love with and harder to defend.

Once you have a working set of codes, you can step up a level and label the codes themselves. Are some of your codes affective (about feelings) and others behavioural (about coping actions)? Are some emic (using participants' own words) and others etic (using your analyst-imposed terminology)? Sorting your codes by type is metacoding, and it often reveals structural patterns in your codebook that you would not have seen by staring at the codes one at a time. In the loneliness corpus, a useful metacoding move is to sort all codes into four buckets: experiential (what loneliness feels like), causal (what triggers it), responsive (what people do about it), and interpretive (what people think it means). The four-bucket frame becomes a candidate structure for the findings section of your eventual paper.

You do not need to use all twelve

The twelve techniques are a toolbox, not a checklist. A defensible project will use three or four of them, deliberately, and report which ones it used. Repetitions + indigenous categories + metaphors + missing data is a typical opening combination. The point of having twelve in your awareness is that when one technique is not turning anything up, you have eleven others to try. Your eventual methods section should name which techniques you used and why.

1.3 Working an Example Through Four Techniques

+
Code proliferation
Tap to reveal
+
Theme as topic
Tap to reveal
+
Description vs analysis
Tap to reveal
+
Cherry-picking quotes
Tap to reveal

Take three passages from the loneliness corpus and run them through four of the techniques to see how theme-finding actually works.

Passage 1 — Maya (P01, age 22, undergraduate)

“It's — okay, this is going to sound dramatic, but it feels like being hungry. Like, in my chest. It's a physical thing. I get this feeling, especially at night, where my chest just feels — hollow isn't the right word, it's like an ache.”

Passage 2 — Linda (P05, age 67, recent widow)

“I don't know that I'd say it has a physical feeling exactly. It's more like a weight. A weight that I carry around. I sleep on one side of the bed still — the right side, my side — and the left side is undisturbed for three years. So it's that. It's like half of my life is just not there anymore, and I'm walking around with that absence.”

Passage 3 — Helen (P11, age 78, never married)

“It feels like — fading. Like fading at the edges. When you do not speak for days, you become less real to yourself. Your voice sounds strange when you do speak.”

Technique 1 (Repetitions): All three passages reach for embodied descriptions of loneliness. The words physical, chest, weight, body, voice, real recur. Loneliness is a bodily experience for these participants, not just a cognitive one.

Technique 3 (Metaphors): Three different metaphors, all converging on the same image. Maya: hunger, hollow, ache (interior absence). Linda: weight, half a life not there, walking around with absence (carrying something missing). Helen: fading, less real, voice strange (thinning of the self). The metaphors are not the same, but the analytic abstraction over them is: loneliness as the somatic register of an absence.

Technique 5 (Similarities and Differences): The three passages share embodiment but differ on what is absent. For Maya it is a future not yet built; for Linda it is a specific deceased person; for Helen it is the simple act of social contact. A theme that holds across the differences: the body registers absence as presence (you feel something missing as something there).

Technique 7 (Missing Data): Notice what none of the three say. None invokes psychiatric vocabulary. None mentions a therapist or a medication. None reaches for the word “depression” even though the descriptions are compatible with depressive symptomatology. The absence of clinical framing is itself a finding: these participants are describing loneliness as a normal embodied condition, not as a disorder. That has implications for how an intervention should be framed.

Four techniques applied to three passages have already given you the beginnings of a theme: loneliness as embodied absence, narrated outside clinical vocabularies. That theme can now become a code (or several related codes) in your codebook.

Reflection

Pick any one of the twelve techniques you found most or least intuitive. Briefly explain why, and describe a passage from a transcript you have read (or could read) where the technique would or would not work well. The point is not to defend the technique — it is to articulate, for yourself, when it would and would not earn its keep.

Model answerA strong answer is concrete and self-aware. Example: “Missing data (Technique 7) feels least intuitive because my quantitative training taught me to analyze what is present, not what is absent. But Helen's transcript — where she talks for 42 minutes about loneliness without ever using the word 'depression' — shows the technique earning its keep: her refusal of the clinical frame is itself a finding about how older adults narrate their distress. The technique works when the absence is structured (everyone you would expect to mention X does not), and works poorly when absences could plausibly be coincidental or due to the interview guide not asking.” Another strong answer might pick metaphors (Technique 3): “Most intuitive because metaphors are emotionally salient; least useful when participants are speaking literally about logistics, where the metaphor density drops and other techniques (repetitions, linguistic connectors) do more work.”

Minimum 20 characters required.

✓ Reflection saved
Knowledge Check — Section 1

Question 1: According to Bernard, Wutich, and Ryan, what is the difference between a theme and a code?

Themes are higher-abstraction recurring ideas; codes are the operational markers analysts attach to passages. A single theme is usually supported by several codes, and a single passage may receive multiple codes.

Question 2: Which technique would best help you discover that older men in the loneliness corpus systematically avoid the word “lonely” itself?

A structured absence — participants you would expect to use a word who do not — is a missing-data finding. Ryan and Bernard's Technique 7 is the right tool, and missing data is often the most analytically productive of the twelve.

Question 3: Amira describes her loneliness using the Arabic word wahda, which she says holds a meaning that the English “loneliness” cannot. Which technique is most directly engaged?

Indigenous typologies (Technique 2) means the categories participants themselves use, often in their own language. Wahda is an emic term that may become a theme; Aarav's ekantam and Marcus's “code-switching loneliness” are similar examples in the corpus.
Section 2 of 5

Inductive, Deductive & Hybrid Coding — and Codebook Architecture

⏱ Estimated reading time: 30 minutes

Introduction and Overview

Theme-finding (Section 1) is the discovery phase. Coding is what you do once you have themes. Coding turns themes into operational rules and applies them across the corpus consistently — the bridge between recognition and analysis that Boyatzis (1998) and Braun and Clarke (2006) describe as the heart of thematic analytic rigor. The two big design questions for coding are: where do the codes come from (inductive, deductive, or hybrid), and what does a defensible codebook look like (the architecture). This section addresses both.

Learning Objectives for Section 2

  • Distinguish inductive (data-up), deductive (theory-down), and hybrid coding strategies.
  • Match each strategy to the kind of research question it best serves.
  • Build a codebook entry containing all seven required elements (name, brief definition, full definition, inclusion criteria, exclusion criteria, positive example, negative example).
  • Recognize that the codebook is a living document — revisable with audit-trail documentation.

2.1 Inductive Coding (Data-Up)

In inductive coding, you start with the data and let the codes develop from what you find. You read transcripts, mark passages that strike you, label the marks with provisional codes, and refine the labels as you read more. After three or four transcripts, you have a working set of perhaps 30–50 codes; you consolidate, merge, and rename until you have a coherent codebook of 8–15 codes you can apply across the remaining transcripts.

Inductive coding is the default for exploratory studies, for under-described phenomena, and for projects in the grounded-theory tradition (which we will meet in Lesson 7). Its virtue is that it stays close to the participants' own categories and avoids forcing the data into pre-existing analyst frames. Its risk is that, without theoretical anchoring, it can produce codebooks that are descriptive but not analytically interesting — what Charmaz (2014) warns against as “coding too close to the data.”

For most of the loneliness capstone work in this course, inductive coding is the appropriate starting point. The dataset is rich, the phenomenon is contested, and the participants speak in distinct vocabularies. Starting from their language is the right move.

2.2 Deductive Coding (Theory-Down)

In deductive coding, you bring a codebook to the data. The codes come from theory, from prior literature, or from a stakeholder framework (e.g., the WHO determinants of health, the CFIR implementation framework, the Cacioppo-and-Patrick loneliness model). You read each transcript and tag passages that instantiate the pre-specified codes. Codes that are not present in the data are recorded as absent.

Deductive coding is the right move when you are testing or extending an existing framework, when you are working in a confirmatory mode, or when your study is part of a multi-site collaboration that needs a shared coding scheme. Its virtue is that it produces results comparable across studies. Its risk is that you may miss things the data are saying that the framework was not designed to see.

An applied example: if you were coding the loneliness transcripts deductively using Weiss's (1973) social-emotional loneliness typology, you would have two pre-specified codes — social loneliness (deficits in a social network) and emotional loneliness (absence of a close attachment figure) — and you would tag each passage as one, the other, both, or neither. You would learn the distribution of the two types across the corpus and would have framework-comparable results. You would also miss the embodiment theme we developed in Section 1, because Weiss's framework does not contain it.

2.3 Hybrid Coding (the Practical Default)

Most contemporary qualitative health research uses a hybrid approach. You begin with a small set of deductive codes drawn from theory or prior literature (a provisional codebook), apply them to the first few transcripts, and let new codes emerge inductively as you read. The codebook grows from the bottom up while keeping its theoretical anchor.

The Fereday and Muir-Cochrane (2006) hybrid framework is a widely cited operationalization of this approach, and Hsieh and Shannon's (2005) "directed" qualitative content analysis is a closely related variant. Bernard, Wutich, and Ryan (Ch. 6) endorse hybrid coding as the practical default for applied health research because it preserves the inductive openness that makes qualitative work valuable while keeping the deductive anchoring that makes it interpretable by quantitatively trained reviewers.

The capstone milestone at the end of this lesson assumes a hybrid strategy: you will start with two or three theoretically motivated codes (perhaps experiential loneliness, causal accounts, and coping strategies, derived from the interview guide structure) and let the rest develop inductively from the first three to five transcripts you code.

StrategyCodes come fromBest fitRisk
Inductive The data Exploration; under-described phenomena; grounded theory Descriptive without theoretical purchase; long codebooks
Deductive Theory or framework Confirmatory work; multi-site studies; testing established models Misses what the framework was not built to see
Hybrid Both, in sequence Most applied health research; this course's default Requires explicit documentation of when codes were added or revised

2.4 The Anatomy of a Codebook Entry

A codebook is a structured document. Each entry describes one code with enough specificity that another analyst could apply it consistently (MacQueen, McLellan, Kay, & Milstein, 1998; DeCuir-Gunby, Marshall, & McCulloch, 2011). Bernard, Wutich, and Ryan (Ch. 6) recommend seven elements per entry. All seven matter; cutting any of them is the most common reason why intercoder reliability later turns out to be low.

  1. Code name. Short, mnemonic, unique. Use hyphens or underscores, not spaces. Example: chair-absent-spouse.
  2. Brief definition. One sentence. Example: “A physical object (chair, side of bed, photograph) that the participant marks as standing in for the absence of a deceased or departed partner.”
  3. Full definition. A paragraph. When to apply the code; what range of cases it covers; how it relates to neighbouring codes. The full definition is what an analyst reads when in doubt.
  4. Inclusion criteria. Bullet-pointed. What features must be present for the code to apply. Example: the participant explicitly names a physical object; the object is associated with an absent person; the participant gives the object affective weight.
  5. Exclusion criteria. Bullet-pointed. What looks similar but does not count. Example: mention of a physical object without affective weight (“the chair in the corner”) does not count; mention of an absent person without an associated object does not count.
  6. Positive example. A direct quote from the corpus that clearly fits. Example: Linda P05: “Bill sat in that chair every evening for thirty-some years. And it's still there. I haven't moved it. I haven't sat in it. I haven't given it away. It's just there. And every evening I look at it and it's empty.”
  7. Negative example. A near-miss quote that does not fit. Example: Helen P11: “I have this walker because my hip gave out.” A physical object is named, but it is not associated with an absent person.

The eighth element: the memo

Bernard, Wutich, and Ryan recommend a seven-element structure. Many experienced researchers add an eighth: a memo space attached to each code, where the analyst records how the code evolved, why edge cases were resolved a particular way, and what the code's relationship to neighbouring codes turned out to be. Memos are the audit trail for the codebook itself. We strongly recommend including a memo column in your capstone codebook. It is what your eventual methods section will be written from.

2.5 A Worked Codebook Entry

Here is a complete codebook entry for a code drawn from the loneliness corpus. We will return to this entry in Section 4 when we walk through the Taguette workflow.

Codebook entry — somatic-absence

Brief definition: Participant describes loneliness as a bodily sensation that registers the absence of someone or something.

Full definition: This code applies to passages where the participant locates loneliness in the body (chest, weight, fatigue, voice, hunger, ache, hollowness, fading) and the embodied sensation is described as a registering of an absence rather than as a free-standing physical symptom. The code is distinct from fatigue-grief (which is exhaustion specifically tied to grief work) and from illness-talk (which is the description of medical symptoms). When in doubt, apply somatic-absence if the participant uses a bodily metaphor and explicitly or implicitly links it to the absence of a person, role, or life-phase.

Inclusion criteria:

  • The passage contains a bodily reference (chest, weight, hunger, ache, fading, voice, body, physical, real).
  • The bodily reference is figurative or interpretive (not a literal medical complaint).
  • The bodily reference is tied (explicitly or by clear inference) to the absence of a person, role, or part of life.

Exclusion criteria:

  • Literal medical symptoms or complaints (apply illness-talk instead).
  • Mental-state descriptions without a body reference (apply affective-loneliness).
  • Bodily references not tied to absence (e.g., “I was tired from work”).

Positive example: Linda P05: “It's more like a weight. A weight that I carry around… I'm walking around with that absence.”

Negative example: Helen P11: “My hip gave out two years ago.” (Literal medical complaint — apply illness-talk.)

Memo: Added on iteration 2 after noticing the convergence of Maya's “hollow”/“ache,” Linda's “weight,” and Helen's “fading.” Distinguished from affective-loneliness after a near-miss in Sarah's transcript (“witness-less hours” was tagged both ways; resolved by requiring a body reference for somatic-absence).

2.6 The Codebook as a Living Document

Bernard, Wutich, and Ryan are clear that a codebook is not built once and frozen. You will revise it. The standard expectation is that revisions are documented in an audit trail that records: (a) the date of revision, (b) which codes changed, (c) what they changed to, and (d) the justification. When a codebook is revised midway through a project, the earlier transcripts must be re-coded under the new scheme — not the old one — or the codebook becomes inconsistent across the corpus.

The practical implication is that you should not commit to your final codebook until you have read most of the transcripts you intend to code. The Week 5 capstone milestone asks you to develop a preliminary codebook on 3–5 transcripts; the codebook you submit at Week 5 is not the codebook you will submit at Week 12. The Week 12 codebook will be the result of the Week 5 codebook revised at least twice on the basis of what the rest of the corpus reveals.

Reflection

Look ahead to your own capstone work. Will you adopt a primarily inductive, deductive, or hybrid coding strategy? Name two of the codes you anticipate ending up with, and indicate whether each one comes from theory (deductive) or from the data (inductive). There is no wrong answer; the point is to commit to a strategy you can defend in your methods section.

Model answerA strong answer is specific and self-aware. Example: “I will use a hybrid strategy. My two anchor codes will be deductive: loneliness-vs-aloneness, drawn from Cacioppo and Patrick's distinction, and structural-versus-existential, drawn from the public-health loneliness typology I reviewed in Week 2. From there I will let the rest develop inductively from the first three transcripts I code. I anticipate ending up with codes like somatic-absence (inductive, from the convergence of Maya's, Linda's, and Helen's embodied metaphors) and coping-pet (inductive, from Linda's Rufus and Maya's neighbour's cat).” The point is to articulate a strategy that you can defend, and to recognize that 'pure inductive' and 'pure deductive' are rare in applied work; most defensible projects are hybrid.

Minimum 20 characters required.

✓ Reflection saved
Knowledge Check — Section 2

Question 1: Which coding strategy is the practical default in applied health research?

Hybrid coding (Fereday and Muir-Cochrane 2006; endorsed by Bernard, Wutich, and Ryan Ch. 6) is the practical default because it preserves inductive openness while keeping theoretical anchoring.

Question 2: Which of the following is NOT one of the seven required elements of a codebook entry?

The seven required elements are code name, brief definition, full definition, inclusion criteria, exclusion criteria, positive example, and negative example. Statistical effect sizes are not a feature of qualitative codebooks.

Question 3: A codebook is revised midway through coding. What do Bernard, Wutich, and Ryan require for the project to remain methodologically defensible?

Codebook revision is normal and expected, but it must be documented, and earlier transcripts must be re-coded under the new scheme so the codebook is applied consistently across the corpus.
Section 3 of 5

Coding Mechanics & Intercoder Reliability

⏱ Estimated reading time: 35 minutes

Introduction and Overview

Sections 1 and 2 covered the conceptual side of coding: what themes are, where codes come from, what a codebook looks like. This section addresses two operational matters that determine whether your coding is methodologically defensible. The first is coding mechanics: how passages get marked up, how codes relate to each other, how the analyst handles complications. The second is intercoder reliability: when a second analyst applies your codebook to the same passages, how much agreement should you expect, how do you measure it, and what does the resulting number mean?

Learning Objectives for Section 3

  • Apply the four basic coding mechanics: hierarchical codes, multiple codes per passage, axial coding, and in vivo codes.
  • Compute percent agreement, Cohen's kappa, and Krippendorff's alpha for a pair of coders.
  • Interpret the magnitude of kappa using Landis and Koch (1977) thresholds.
  • Identify when intercoder reliability is the wrong measure to seek.

3.1 Hierarchical Codes (Parent and Child)

A codebook is rarely flat. Codes nest under broader codes, which nest under categories, which sit under the codebook root. The hierarchy is what makes a large codebook navigable and what allows you to aggregate findings at different levels. In the loneliness corpus, a plausible partial hierarchy looks like this:

LONELINESS_EXPERIENCE/
   somatic-absence
   affective-loneliness
   social-invisibility
   temporal-fading
CAUSAL_ACCOUNTS/
   bereavement-onset
   migration-onset
   life-stage-transition
   structural-isolation
COPING_STRATEGIES/
   coping-pet
   coping-volunteer
   coping-phone-confidant
   coping-comfort-media
   coping-cooking-home-food
INTERPRETIVE_FRAMES/
   loneliness-as-cost-of-love
   loneliness-as-failure
   loneliness-as-rebuilding
   loneliness-as-invisible-to-society

The four parent categories at the top are the metacoding bins from Section 1 (Technique 12). Each child code can be applied independently to passages. When you report your findings, you can report at the category level (“coping strategies appeared in all 20 transcripts”) or at the code level (“coping-pet appeared in 6 of 20 transcripts”), depending on what your argument needs.

3.2 Multiple Codes Per Passage

A single passage may instantiate more than one code. This is normal and expected. Linda's chair passage simultaneously instantiates chair-absent-spouse (a specific code), somatic-absence (the absence as carried), and loneliness-as-cost-of-love (the interpretive frame she later articulates). All three codes are applied to overlapping or identical text spans. Taguette and the major QDA packages handle multi-coding natively.

The implication for analysis is that when you later count code occurrences, you are counting passage-code pairs, not unique passages. A transcript with 80 unique coded passages might have 130 code applications because many passages got two or three codes. Both numbers are meaningful; report whichever supports your argument and be clear about which you are reporting.

3.3 Axial Coding (Relationships Between Codes)

The term axial coding comes from Strauss and Corbin's (1990) grounded-theory tradition; for an exhaustive catalog of coding methods including in vivo and axial styles, see Saldaña (2021). It refers to a second pass over the data, after initial coding, in which the analyst attends to the relationships between codes rather than to the codes themselves. Axial-coding questions include: which codes co-occur? Which codes appear in sequence (and in which order)? Which codes seem to be causes of which others, in participants' own accounts? Which codes are mutually exclusive in practice?

Axial coding is what turns a flat codebook into a model. In the loneliness corpus, axial coding might reveal that bereavement-onset codes are nearly always followed in the same transcript by coping-pet codes; that migration-onset codes co-occur with cooking-home-food codes; that loneliness-as-cost-of-love appears only in transcripts that also contain somatic-absence. These relationships are not in the codes themselves; they are in the pattern of co-occurrence. Section 4 will show how to compute co-occurrence in R.

3.4 In Vivo Codes (Using Participants' Exact Words)

An in vivo code is a code whose name is taken verbatim from a participant's speech. The convention is to set in vivo codes in quotation marks in the codebook to mark their origin. In the loneliness corpus, defensible in vivo codes include: “wahda” (Amira's word for a refugee-specific loneliness), “witness-less hours” (Sarah's phrase for the loneliness of being unobserved), “the cost of love” (Linda's interpretation), “code-switching loneliness” (Marcus's articulation of the loneliness of moving between cultural registers), and “fading at the edges” (Helen's metaphor).

In vivo codes do two analytic things at once. First, they keep the participant's voice in the codebook, which protects against analyst over-abstraction. Second, they give the eventual paper memorable language — reviewers and readers remember “wahda” in a way they do not remember refugee-specific-loneliness. Most well-written qualitative findings sections have at least three or four section headers built from in vivo codes. We will use Amira's “wahda” as a section header in the worked findings exercise of Lesson 11.

3.5 Intercoder Reliability — Why It Matters

Once you have a codebook, the central question is: can someone else apply it the way you intended? Intercoder reliability is the operational answer to that question. Two (or more) analysts independently code the same passages using the same codebook; you compute the agreement; you decide whether the agreement is good enough.

Bernard, Wutich, and Ryan (Ch. 6) frame intercoder reliability as the most common operational standard for the third of the three methodological commitments (replicability, from Lesson 1). It is not the only test of replicability, but it is the one most often expected by quantitatively trained reviewers in public-health journals. Methodologically, it serves three purposes: it forces the codebook to be specific enough to be applied consistently; it reveals which codes are unclear and need refinement; and it gives you a defensible number to report in your methods section.

3.6 Percent Agreement (Simple but Flawed)

The most intuitive measure of agreement: count the passages on which two coders agree, divide by the total number of passages coded, multiply by 100. If coders agree on 85 of 100 passages, percent agreement is 85%.

The problem with percent agreement is that it does not adjust for the agreement you would expect by chance. If two coders are using a codebook with only two codes (apply / do not apply), and one of the codes is used 90% of the time, two random coders would agree about 82% of the time by accident. An 85% percent-agreement score in that setting reflects only a few percentage points of real agreement above chance. The chance-adjusted measures below are designed to fix this problem.

3.7 Cohen's Kappa (Two Coders, Nominal Categories)

Cohen's kappa (Cohen, 1960) is the chance-corrected agreement measure used most often in qualitative health research. The formula is:

κ = (po − pe) / (1 − pe)

Where po is the observed proportion of agreement (the percent agreement, expressed as a decimal) and pe is the proportion of agreement expected by chance, given the marginal distributions of each coder's codings. Kappa ranges from −1 (perfect disagreement) through 0 (chance-level agreement) to +1 (perfect agreement).

Landis and Koch (1977) proposed interpretive thresholds for kappa that have become the field standard. They are guidance, not law — Bernard, Wutich, and Ryan are clear that the appropriate threshold depends on the stakes of the coding and the nature of the categories.

Kappa valueLandis & Koch (1977) interpretation
< 0.00Poor (worse than chance)
0.00–0.20Slight
0.21–0.40Fair
0.41–0.60Moderate
0.61–0.80Substantial
0.81–1.00Almost perfect

The convention in applied health research is that kappa ≥ 0.60 (substantial) is a defensible threshold for publication, and kappa ≥ 0.80 (almost perfect) is excellent. If your kappa is below 0.60 on a given code, the standard response is to refine the codebook entry for that code (usually by tightening the inclusion or exclusion criteria) and re-code the disputed passages.

3.8 Krippendorff's Alpha (the Preferred Measure)

Cohen's kappa has three limitations: it handles only two coders, it does not gracefully handle missing data, and it assumes the codes are nominal categories (mutually exclusive and unordered). For projects with more than two coders, with intermittent missingness, or with codes that are ordered (e.g., severity levels), Cohen's kappa is the wrong tool.

Krippendorff's alpha (Krippendorff, 2018) generalizes the kappa idea to handle all three limitations. It accommodates any number of coders, any pattern of missing codings, and any level of measurement (nominal, ordinal, interval, ratio). It is the measure most contemporary methodologists recommend, including Bernard, Wutich, and Ryan.

The formula is more complex than kappa's — alpha is built on the difference between observed and expected disagreement, computed from a coincidence matrix — but you will rarely compute it by hand. The R package irr contains kripp.alpha(), which we will use in Section 4. Interpretively, alpha values follow a similar convention to kappa: ≥ 0.80 is acceptable for tentative conclusions, ≥ 0.667 (two-thirds) is the often-cited minimum for any kind of claim. Below 0.667, the codebook is not yet reliable and needs revision.

Which measure should you use?

For your capstone, you have two coders (you and a peer partner you will work with in Week 5). If your codes are nominal and you have no missing data, Cohen's kappa is acceptable and is what most reviewers expect. If your codes are ordered (e.g., severity of loneliness on a 1–3 scale), or if you have three coders, or if there is missingness, use Krippendorff's alpha. Bernard, Wutich, and Ryan recommend defaulting to alpha because it generalizes — if you can compute alpha, you can report it for any future project — but kappa remains the most common reported statistic in published health qualitative work.

3.9 When Intercoder Reliability Is the Wrong Measure

Not all qualitative work asks for intercoder reliability. Bernard, Wutich, and Ryan are explicit about this and so are many methodologists writing in the interpretivist tradition. In two situations, the reliability framing is actually misleading.

The first is interpretivist or constructivist work. In Charmaz-style constructivist grounded theory (Charmaz, 2014), the analyst's interpretation is understood to be partly constitutive of what the data mean. Two competent analysts may legitimately reach different but defensible interpretations of the same passages. The standard of evaluation is not identity (the kappa standard) but coherence — whether each interpretation is internally consistent, evidence-based, and methodologically transparent. For this kind of work, the equivalent quality check is investigator triangulation (multiple analysts compare interpretations and document where they differ and why) rather than a kappa score.

The second is narrative and discourse-analytic work (Lessons 9 and 10). Narrative analysis attends to the structure of a single telling; discourse analysis attends to the rhetorical work an utterance performs. Neither tradition typically treats codings as nominal categories applied independently to passages, so a kappa is not the right object. The methods sections of credible papers in these traditions usually report on member-checking, on theoretical sensitivity, and on the writing trail rather than on kappa.

For your capstone — which adopts Bernard, Wutich, and Ryan's pragmatist-positivist stance and uses inductive-deductive hybrid coding — intercoder reliability is the right measure. We will compute it in Section 4. But you should be aware that there are defensible qualitative methodologies for which it is the wrong measure, so that you do not later read an interpretivist paper and mistake the absence of a kappa for a methodological failure.

3.10 The QDA Software Landscape

Before we move into the workflow in Section 4, a quick orientation to the software landscape. There are four commercial options and a growing free-and-open-source ecosystem. None of them does anything you cannot do by hand on a small corpus; what they buy you is speed, consistency, and the ability to scale.

ToolTypeStrengthsLimitations
NVivo (Lumivero) Commercial, desktop Industry standard; excellent multi-coder workflow; rich visualization Expensive licence; closed format; steep learning curve
ATLAS.ti Commercial, desktop and cloud Strong network views; good multimedia support Expensive licence; some workflow quirks
MAXQDA Commercial, desktop Mixed-methods friendly; good visualizations Expensive; smaller user base than NVivo
Dedoose Commercial, browser-based, subscription Cloud collaboration; lower monthly cost Subscription model means access ends when you stop paying
Taguette (this course's pick) Free, open-source, browser or local Free; open data format (SQLite + CSV/HTML export); transferable beyond the course; runs locally Fewer visualizations than commercial tools; smaller feature set

We chose Taguette for HSCI 841 because it is free (every student in the world can use it), it is open-source (your project is not held hostage by a licence), and its export format is standard (CSV and HTML, which any analysis tool can read). The features it lacks — advanced visualizations, sophisticated network views — you will build in R, which is also free and open-source. The combination is a complete, transferable workflow.

Reflection

Imagine you compute Cohen's kappa for your capstone codebook with a peer partner and the result is κ = 0.52 (moderate) overall, with one specific code (somatic-absence) scoring κ = 0.31 (fair). What is your next step? Be concrete: what would you do to bring the kappa up?

Model answerThe standard response when a specific code has poor agreement is to revise that code's codebook entry, not to abandon the coding altogether. Concrete steps: (1) pull every passage on which you and your partner disagreed about somatic-absence; (2) read them side by side and articulate why you each coded as you did; (3) revise the inclusion and exclusion criteria so the disagreements would have been resolvable from the codebook text alone (e.g., specify that the bodily reference must be figurative not literal, or that a clear absence-link must be present); (4) re-code the disputed passages under the revised entry; (5) re-compute kappa on the same passages and on a fresh set. The overall kappa of 0.52 is also addressable: typically two or three codes are dragging the average down, and fixing them brings the overall up. Document all revisions in the audit trail. Do not silently change codings to inflate kappa; that defeats the entire purpose of intercoder reliability.

Minimum 20 characters required.

✓ Reflection saved
Knowledge Check — Section 3

Question 1: Why is percent agreement an inadequate measure of intercoder reliability on its own?

Percent agreement does not adjust for chance. When one code is much more common, the expected agreement by chance is high, and the apparent agreement is inflated. Cohen's kappa and Krippendorff's alpha correct for chance.

Question 2: Using Landis and Koch (1977) thresholds, a Cohen's kappa of 0.72 is interpreted as:

The Landis and Koch thresholds: 0.61–0.80 is substantial; 0.81–1.00 is almost perfect. A kappa of 0.72 falls in the substantial range and is generally considered acceptable for publication in applied health research.

Question 3: Which is the primary advantage of Krippendorff's alpha over Cohen's kappa?

Krippendorff's alpha generalizes kappa to handle multiple coders, missing data, and various levels of measurement. It is the measure most contemporary methodologists recommend, including Bernard, Wutich, and Ryan.
Section 4 of 5

The R + Taguette Workflow on the Loneliness Dataset — and the Week 5 Capstone

⏱ Estimated reading time: 40 minutes

Introduction and Overview

The first three sections of this lesson laid out the conceptual apparatus: themes versus codes, the twelve theme-finding techniques, inductive versus deductive coding, codebook architecture, coding mechanics, intercoder reliability. This section turns operational. You will see, end to end, how to find candidate themes in the loneliness corpus using R, how to build a codebook for them, how to apply the codebook in Taguette, how to export the coded extracts back into R for analysis, and how to compute Krippendorff's alpha on a small two-coder reliability check. The section ends with the Week 5 capstone milestone.

Learning Objectives for Section 4

  • Import a corpus of plain-text transcripts into R with readtext.
  • Use quanteda to compute word frequencies and KWIC concordances as theme-finding aids.
  • Run a complete Taguette project: upload, code, export.
  • Re-import coded extracts into R for co-occurrence and code-frequency analysis.
  • Compute Krippendorff's alpha with irr::kripp.alpha() from a two-coder agreement matrix.
  • Plan and complete the Week 5 capstone deliverable.

4.1 Step 1 — Import the Corpus into R

The loneliness transcripts live in ../term projects/HSCI_841/transcripts/ as plain-text files (P01_Maya.txt through P20_Frank.txt). Each transcript opens with a metadata header (participant ID, age, gender, occupation, etc.) followed by the interview proper. We will read all 20 into a single R object using the readtext package.

RImport the loneliness transcripts as a quanteda corpus

Open RStudio. Set your working directory to the repository root. Then run:

# Load the stack you installed in Lesson 1
library(tidyverse)
library(quanteda)
library(readtext)

# Point at the transcripts folder
transcript_dir <- "term projects/HSCI_841/transcripts"

# Read all 20 transcripts at once into a readtext data frame
loneliness_rt <- readtext(
  file.path(transcript_dir, "P*.txt"),
  docvarsfrom = "filenames",
  docvarnames = c("participant_id", "pseudonym"),
  dvsep = "_"
)

# Convert to a quanteda corpus (this is the object the rest of the workflow uses)
loneliness_corpus <- corpus(loneliness_rt)

# Sanity checks
summary(loneliness_corpus, n = 5)  # first five transcripts: tokens, types, sentences
ndoc(loneliness_corpus)         # number of documents — should be 20
docvars(loneliness_corpus)      # participant_id and pseudonym columns

What success looks like: ndoc() returns 20. summary() shows tokens-per-document ranging roughly from 1,500 to 4,000. docvars() returns a tibble with one row per transcript and columns for participant_id and pseudonym.

4.2 Step 2 — Repetitions and KWIC as Theme-Finding Aids

Theme-finding technique 1 (Repetitions) and technique 10 (Word lists and KWIC) are computationally tractable. Once your transcripts are in a quanteda corpus, you can compute word frequencies in seconds and pull keyword-in-context concordances for any word that catches your eye. The result is not the end of theme-finding; it is the front edge of it. The patterns the computer surfaces are then read closely by you, in their original transcripts, to decide whether they support a theme.

RWord frequencies and KWIC for theme discovery

Continuing from the previous code block:

# Tokenize: split each transcript into words, drop punctuation, lowercase
loneliness_tokens <- tokens(
  loneliness_corpus,
  remove_punct = TRUE,
  remove_numbers = TRUE
) |> tokens_tolower()

# Remove stopwords (the, and, of, etc.) — what's left is content words
loneliness_tokens <- tokens_remove(loneliness_tokens, stopwords("en"))

# Build a document-feature matrix and compute global word frequencies
loneliness_dfm <- dfm(loneliness_tokens)
freq <- textstat_frequency(loneliness_dfm, n = 40)
print(freq)

# KWIC: look at every occurrence of "chair" with 5 words on either side
chair_kwic <- kwic(loneliness_tokens, pattern = "chair*", window = 5)
print(chair_kwic)

# Repeat for words that emerged as candidate themes in Section 1
kwic(loneliness_tokens, pattern = "tired", window = 5)
kwic(loneliness_tokens, pattern = "hollow", window = 5)
kwic(loneliness_tokens, pattern = "fading", window = 5)
kwic(loneliness_tokens, pattern = c("because", "since"), window = 6)  # Technique 6: linguistic connectors

What to look for: The frequency table will surface obvious words (loneliness, alone, people, feel) and a few less obvious ones that turn into candidate themes. The chair* KWIC will show every chair-mention with context, letting you confirm that the chair-as-stand-in-for-absent-person reading is supported across transcripts (not just Linda's). The linguistic-connector KWIC will pull every causal account out of the corpus, ready to be read together.

4.3 Step 3 — Set Up the Taguette Project

R surfaces patterns; Taguette is where you mark passages with codes. The two work together: you use R to find what to look for, and Taguette to record where you found it. The workflow below assumes you set up Taguette in Lesson 1; if not, return to Section 4.5 of Lesson 1 first.

🔎 Hands-on: Build the Taguette project
  1. Open your HSCI 841 Loneliness Capstone Taguette project (or create it if you have not).
  2. Upload the 3–5 transcripts you intend to code for the Week 5 milestone. Recommended starter set: P01 Maya, P05 Linda, P11 Helen, P15 Amira, P20 Frank. The variation across these five is deliberate (age, gender, life-stage, immigration, life-circumstance) and gives you the widest analytic surface for codebook development.
  3. Create the parent codebook structure as a small set of broad codes: experiential, causal, coping, interpretive (the four metacoding bins from Section 1, Technique 12).
  4. As you read transcript 1, highlight passages and tag them with provisional codes, drawing on the candidate themes you identified in R and on whatever else strikes you. Expect to create 20–40 provisional codes on this first transcript.
  5. After transcript 1, consolidate. Merge codes that turned out to be the same thing; rename codes whose names did not turn out to be right; sort codes under the four parent categories.
  6. Repeat for transcripts 2–5, expecting the codebook to grow more slowly each time (the curve flattens; this is theoretical saturation in miniature, which we will revisit in Lesson 7).
  7. Once you have coded all 3–5 transcripts, export the codebook (Project → Codebook → Export) and the coded extracts (Project → Highlights → Export as CSV).

The Taguette export is the bridge back to R: the CSV contains one row per highlighted passage with columns for document, code, and the passage text.

4.4 Step 4 — Re-import Coded Extracts into R for Analysis

Taguette's CSV export gives you a long-format data frame: one row per (passage, code) pair. Multi-coded passages appear in multiple rows. This is the format you want for almost any quantitative analysis of your qualitative coding — code frequencies, co-occurrence, code-by-participant matrices, comparison across subgroups.

RAnalyze the coded extracts
# Read the Taguette export
codings <- read_csv("term projects/HSCI_841/taguette_export_week5.csv")

# What the export contains (columns vary slightly by Taguette version)
glimpse(codings)
# Typical columns: tag (= code), content (= passage), document (= transcript filename)

# Code frequencies: how many times each code was applied across the corpus
code_freq <- codings |>
  count(tag, sort = TRUE)
print(code_freq)

# Code frequencies by participant: which codes appeared in which transcripts
code_by_participant <- codings |>
  count(document, tag) |>
  pivot_wider(names_from = tag, values_from = n, values_fill = 0)
print(code_by_participant)

# Simple co-occurrence: how often each pair of codes appears in the same transcript
# (transcript-level co-occurrence, not passage-level — passage-level is also computable)
co_occurrence <- codings |>
  distinct(document, tag) |>
  inner_join(distinct(codings, document, tag), by = "document") |>
  filter(tag.x < tag.y) |>
  count(tag.x, tag.y, sort = TRUE)
print(head(co_occurrence, 20))

What the output gives you: The code_freq table tells you which codes carried the most weight. The code_by_participant wide table is the basis of any subgroup comparison you might do in Lesson 7. The co_occurrence table is the starting point for axial coding and for the network displays of Module 12.

4.5 Step 5 — Compute Krippendorff's Alpha for a Two-Coder Reliability Check

For the intercoder reliability portion of your Week 5 work, you and a peer partner will independently code one shared transcript using the same provisional codebook. You then compute Krippendorff's alpha (or Cohen's kappa) on the resulting codings. The R code below assumes you have produced a wide-format matrix where rows are passages and columns are coders, with the cell value being the code each coder assigned.

RKrippendorff's alpha with the irr package
library(irr)

# Simulated example: 12 passages, 2 coders, codes encoded as integers
# In practice you will build this matrix from your own and your partner's Taguette exports
# Rows = passages; Columns = coders
codings_matrix <- matrix(c(
  # Coder A    Coder B
       1,         1,   # Passage 1: both said somatic-absence
       1,         1,   # Passage 2: both said somatic-absence
       2,         2,   # Passage 3: both said coping-pet
       1,         3,   # Passage 4: A said somatic-absence, B said affective-loneliness  ← disagree
       3,         3,   # Passage 5: both said affective-loneliness
       4,         4,   # Passage 6: both said loneliness-as-cost-of-love
       2,         2,   # Passage 7
       1,         1,   # Passage 8
       3,         1,   # Passage 9: disagree
       2,         2,   # Passage 10
       4,         4,   # Passage 11
       1,         1    # Passage 12
), nrow = 12, byrow = TRUE)

# Krippendorff's alpha (nominal codes)
kripp.alpha(t(codings_matrix), method = "nominal")

# Compare to Cohen's kappa (two coders, nominal)
kappa2(codings_matrix, weight = "unweighted")

# And simple percent agreement (for reference — note how it overstates agreement)
agree(codings_matrix)

What to expect: On this simulated 12-passage example with 2 disagreements, percent agreement is around 83% but Cohen's kappa and Krippendorff's alpha are both lower (around 0.75) because they correct for chance. Report alpha (or kappa) in your eventual methods section; report percent agreement only as a descriptive supplement, never as the primary number.

4.6 The Iterative Cycle

The five steps above describe one pass through the workflow. In practice you will iterate. After computing reliability on a first transcript, you will revise the codebook for codes that scored poorly, re-code the disputed passages, and only then move on to the remaining transcripts. The audit trail (Section 2.6) records each iteration. By the end of Week 12, your capstone will have gone through three or four iterations of this cycle, each documented, each producing a better codebook than the one before.

4.7 The Week 5 Capstone Milestone

The Week 5 milestone is the first piece of analytic work in the capstone arc that produces a real deliverable from the loneliness data. The positionality memo (Week 1), the research question (Week 2), the sampling rationale (Week 3), and the data-collection critique (Week 4) were design pieces. From Week 5 onward, you are doing analysis on transcripts.

Reflection

Imagine you have just finished coding three transcripts using a 10-code hybrid codebook. You compute Krippendorff's alpha with your peer partner and you get α = 0.71 overall. What do you do next, and why? Be specific.

Model answerAn α of 0.71 is above the often-cited 0.667 minimum but below the 0.80 threshold for confident publication-level claims. The right next step is to break the overall alpha down by code, identify the two or three codes dragging the average down, and revise their codebook entries (typically by tightening inclusion and exclusion criteria). Re-code the disputed passages under the revised entries, recompute alpha on that subset, and document the revision in the audit trail. Do not silently change codings or remove disagreements to inflate the number — that defeats the entire purpose. If after revision the overall alpha is still in the 0.70–0.80 range, the codebook is acceptable for the next coding round; you would commit to refining further in subsequent iterations rather than waiting indefinitely for α = 0.80. The key insight is that intercoder reliability is a diagnostic tool, not a gatekeeping test — it tells you which codes need work, not whether to give up.

Minimum 20 characters required.

✓ Reflection saved
Knowledge Check — Section 4

Question 1: In the R workflow, which package reads the plain-text transcripts into a tibble with one row per transcript?

readtext reads a directory of plain-text files (or DOCX, PDF, etc.) and returns a tibble with one row per document, ready to convert to a quanteda corpus.

Question 2: The kwic() function in quanteda is a computational implementation of which theme-finding technique?

kwic() is the computational form of Ryan and Bernard's Technique 10. Every occurrence of a target word is returned with a window of surrounding context, letting you check whether participants are using the word consistently.

Question 3: Which R function from the irr package computes Krippendorff's alpha?

kripp.alpha() from the irr package computes Krippendorff's alpha for any number of coders and any level of measurement (nominal, ordinal, interval, ratio). kappa2() computes Cohen's kappa for two coders; agree() computes simple percent agreement; icc() is the intraclass correlation for interval-level reliability.
Section 5 of 5

Final Assessment

⏱ Estimated time: 25 minutes

Bringing It All Together

Lesson 5 is the heaviest analytic lesson in the first half of HSCI 841. It is the lesson that converts the conceptual apparatus of the first four lessons (definitions, design, sampling, collection) into operational coding moves. The twelve theme-finding techniques (Section 1), the inductive/deductive/hybrid distinction (Section 2), the seven-element codebook entry (Section 2), the four coding mechanics (Section 3), and the intercoder reliability statistics (Section 3) are tools you will use, in some combination, in every qualitative project for the rest of your career.

The Week 5 capstone milestone is the first piece of real analytic work on the loneliness dataset. The preliminary codebook you build this week will be revised in Lesson 6 (when we add a conceptual model), expanded in Lesson 7 (when we add constant-comparative analysis across subgroups), and stress-tested in Lessons 8–11 (when we run content, narrative, discourse, and analytic-induction passes on the same coded extracts). The codebook is the spine of your capstone paper, and Lesson 5 is where it is born.

Key Takeaways from Lesson 5

  • Themes are found, not discovered. The analyst notices, names, and bounds them. Replace “themes emerged” with “we identified the following themes” in your writing.
  • Theme, code, category, and concept are different levels of abstraction. Code is the operational marker; theme is the recurring idea; category is the organizing bucket; concept is the theoretical idea.
  • Ryan and Bernard (2003) catalogue twelve techniques for finding themes: repetitions, indigenous typologies, metaphors, transitions, similarities/differences, linguistic connectors, missing data, theory-related material, cutting and sorting, word lists and KWIC, co-occurrence, and metacoding.
  • Coding strategies come in three flavours: inductive (data-up), deductive (theory-down), hybrid (the practical default in applied health research).
  • A codebook entry has seven required elements: name, brief definition, full definition, inclusion criteria, exclusion criteria, positive example, negative example. A memo column is a strongly recommended eighth.
  • Coding mechanics include hierarchical codes, multiple codes per passage, axial coding (relationships between codes), and in vivo codes (using participants' exact words).
  • Intercoder reliability: percent agreement is intuitive but uncorrected for chance; Cohen's kappa is the field-standard chance-corrected measure for two coders with nominal codes; Krippendorff's alpha generalizes to any number of coders, any level of measurement, and missing data.
  • Reliability is the wrong measure for interpretivist and narrative work, where the standard is coherence among defensible interpretations rather than identity of codings.
  • The Taguette + R workflow is end-to-end transferable: use R for theme discovery, Taguette for coding, R for code-frequency and reliability analysis.

Core Concepts Reviewed

Section 1: Theme vs code vs category vs concept; the twelve Ryan and Bernard theme-finding techniques (repetitions, indigenous typologies, metaphors, transitions, similarities/differences, linguistic connectors, missing data, theory-related material, cutting and sorting, word lists and KWIC, co-occurrence, metacoding); worked examples from the loneliness corpus (chair, wahda, fading, somatic-absence).

Section 2: Inductive vs deductive vs hybrid coding; the seven-element codebook entry (name, brief def, full def, inclusion, exclusion, +example, −example); the codebook as a living document with audit trail.

Section 3: Hierarchical codes; multiple codes per passage; axial coding; in vivo codes; percent agreement; Cohen's kappa with Landis and Koch thresholds; Krippendorff's alpha; when reliability is the wrong measure; the QDA software landscape (NVivo, ATLAS.ti, MAXQDA, Dedoose, Taguette).

Section 4: The end-to-end R + Taguette workflow: import with readtext, theme discovery with quanteda (word frequencies, KWIC), Taguette coding, export and re-import, Krippendorff's alpha with irr::kripp.alpha(). The Week 5 capstone milestone.

The final reflection below asks you to take stock of one specific decision you will make in your own coding work. Concrete, specific, defensible — that is the standard the rest of the capstone will be evaluated against.

Final Reflection

In one paragraph, articulate your provisional coding plan for the Week 5 capstone deliverable. Name the transcripts you will code first, the coding strategy you will adopt (inductive, deductive, hybrid), three theme-finding techniques you will use, two anchor codes you anticipate, and how you will handle intercoder reliability with your peer partner. This is not a contract; it is a working plan you can defend in your methods memo.

Model answerA strong answer is specific on all five points. Example: “I will code P01 (Maya), P05 (Linda), P11 (Helen), P15 (Amira), and P20 (Frank) first — the deliberate-variation starter set. My strategy is hybrid: I will start with two deductive anchor codes (loneliness-vs-aloneness from Cacioppo and Patrick, and a structural-versus-existential typology code from the public-health literature I reviewed in Week 2), and let the rest develop inductively. The three theme-finding techniques I will use are repetitions (because R surfaces these quickly), metaphors (because the loneliness corpus is metaphor-dense), and missing data (to surface what participants avoid). My two anticipated inductive codes are somatic-absence (from Maya/Linda/Helen embodied metaphors) and coping-pet (from Linda's Rufus and Maya's neighbour's cat). For intercoder reliability, my partner and I will independently code P11 (Helen) using my provisional codebook; I will compute Krippendorff's alpha with irr::kripp.alpha() and target alpha ≥ 0.70 for the Week 5 submission, with the understanding that codes scoring < 0.60 will be revised before Lesson 6.” The point is to commit to a plan that is concrete enough to execute and to defend, with the recognition that it will be revised.

Minimum 30 characters required.

✓ Reflection saved
Final Assessment — Lesson 5: Finding Themes & Building Codebooks (15 Questions)

Question 1: Bernard, Wutich, and Ryan's preferred phrasing in writing up qualitative findings is:

Themes are found by analysts, not discovered in nature. The active phrasing (“we identified”) makes the analytic work visible — one of Bernard, Wutich, and Ryan's transparency commitments.

Question 2: How many techniques for finding themes do Ryan and Bernard (2003) catalogue?

Ryan and Bernard (2003) catalogue twelve techniques: repetitions, indigenous typologies, metaphors, transitions, similarities/differences, linguistic connectors, missing data, theory-related material, cutting and sorting, word lists and KWIC, co-occurrence, and metacoding.

Question 3: Amira's use of the Arabic word wahda to name a loneliness specific to refugee experience is an example of which theme-finding technique?

A non-English term used by the participant as a category that the analyst's first-language vocabulary cannot fully hold is an emic category — Ryan and Bernard's Technique 2.

Question 4: What is the difference between a code and a category?

In a hierarchical codebook, categories are parent nodes that group related codes (the children). “Coping strategies” is a category; coping-pet, coping-phone-confidant, and coping-volunteer are codes within it.

Question 5: A researcher begins with two codes drawn from a published theoretical framework, applies them to the first transcripts, and lets new codes emerge inductively as they read. Which coding strategy is this?

Starting with a small deductive anchor and letting inductive codes grow from the data is the hybrid (Fereday and Muir-Cochrane 2006) strategy. It is the practical default for applied health research and the recommended approach for the HSCI 841 capstone.

Question 6: Which of the following is NOT one of the seven required elements of a codebook entry?

The seven elements are code name, brief definition, full definition, inclusion criteria, exclusion criteria, positive example, and negative example. P-values are not features of qualitative codebooks.

Question 7: An in vivo code is:

In vivo codes use participants' own words. “wahda” (Amira) and “witness-less hours” (Sarah) are in vivo codes. They keep the participant's voice in the codebook and give the eventual paper memorable language.

Question 8: What does axial coding attend to?

Axial coding (Strauss and Corbin 1990) is a second pass over the data that attends to relationships among codes rather than to the codes themselves. It turns a flat codebook into the beginnings of a model.

Question 9: Why is simple percent agreement an inadequate measure of intercoder reliability on its own?

Percent agreement does not correct for chance, which inflates apparent agreement, especially when one code is much more common than others. Cohen's kappa and Krippendorff's alpha correct for chance.

Question 10: Using Landis and Koch (1977) thresholds, a Cohen's kappa of 0.45 is interpreted as:

Landis and Koch (1977): 0.41–0.60 is moderate. A kappa of 0.45 is moderate but below the substantial threshold (0.61) typically expected for publication-level claims. Most reviewers would ask for codebook refinement.

Question 11: Krippendorff's alpha has at least three advantages over Cohen's kappa. Which of the following is NOT one of them?

Alpha generalizes kappa on three dimensions: number of coders, missing data, and level of measurement. Alpha does not systematically produce higher values than kappa on the same data — on simple two-coder nominal data they are usually very close.

Question 12: Intercoder reliability is the wrong measure for which kind of qualitative work?

In interpretivist and narrative traditions, the analyst is understood as partly constitutive of what the data mean; the relevant standard is coherence (multiple defensible interpretations, transparently documented), not identity of codings. Kappa or alpha is the wrong object for this kind of work.

Question 13: Which package and function in R compute Krippendorff's alpha?

irr::kripp.alpha() is the standard R function for Krippendorff's alpha. The same package's kappa2() computes Cohen's kappa and agree() computes simple percent agreement.

Question 14: Why did HSCI 841 choose Taguette over NVivo or ATLAS.ti?

Taguette was chosen because it is free (every student can use it), open-source (the project is not held hostage by a commercial licence), and uses standard export formats. Features it lacks (advanced visualizations, network views) are built in R, which is also free and open-source.

Question 15: What is the Week 5 capstone deliverable?

Week 5 produces a preliminary codebook tested on 3–5 transcripts, with a reliability check and a memo. The codebook is preliminary because it will be revised in Lessons 6 and 7; the final codebook is delivered with the capstone paper in Week 12.
✦ Complete the final reflection above before submitting

Congratulations!

You have successfully completed Lesson 5: Finding Themes & Building Codebooks.

You can now name the difference between themes, codes, categories, and concepts; apply the twelve Ryan and Bernard theme-finding techniques; build a codebook entry with all seven required elements; compute Cohen's kappa and Krippendorff's alpha; and operate the R + Taguette workflow end-to-end. The Week 5 preliminary codebook is the first piece of real analytic work on the loneliness corpus.

Next up — Lesson 6: Analysis Frameworks & Conceptual Models, which takes the preliminary codebook you built this week and asks you to turn it into a model that explains the patterns you found, not just one that catalogues them.

Continue to Lesson 6 →
Reference

Glossary — Themes, Codes, Codebooks & Reliability

📚 Reference page — available throughout the lesson

This glossary collects the key concepts, people, and methodological terms introduced in Lesson 5. Use it as a reference while you work through the material, or as a review before the final assessment. Type in the search box to filter entries.

Core Concepts: Themes & Codes
Theme A recurring abstract idea identified by the analyst across the corpus. Themes are higher-abstraction than codes; they are the analyst's product, not a feature of the data itself. Bernard, Wutich, and Ryan insist that themes are found, not discovered.
Code An operational label attached to a passage of text. Codes are the working markers analysts apply in Taguette, NVivo, or by hand. A single theme may be supported by several codes; a single passage may receive multiple codes.
Category A grouping of related codes. In a hierarchical codebook, categories are parent nodes and codes are children. “Coping strategies” is a category that might contain codes for coping-pet, coping-phone-confidant, and so on.
Concept A theoretically meaningful idea, often borrowed from a disciplinary literature, that may organize many themes. Liminality, embodiment, and structural exclusion are concepts. Concepts sit above themes in the abstraction hierarchy.
In Vivo Code A code whose name is taken verbatim from a participant's speech, conventionally set in quotation marks in the codebook to mark its origin. Examples from the loneliness corpus: “wahda” (Amira), “witness-less hours” (Sarah), “fading at the edges” (Helen).
Emic vs. Etic Emic = the insider's account, in the participant's own categories. Etic = the outside analyst's framework, in theory-driven categories. Theme-finding Technique 2 (indigenous typologies) specifically surfaces emic categories.
Ryan & Bernard's Twelve Theme-Finding Techniques
Repetitions (Technique 1) Words, phrases, or ideas that recur across transcripts. The starting point for nearly all theme-finding. Example: “chair” recurring in 8+ loneliness transcripts.
Indigenous Typologies / Emic Categories (Technique 2) Categories the participant uses in their own language. Example: Amira's wahda, Aarav's ekantam, Marcus's “code-switching loneliness.”
Metaphors and Analogies (Technique 3) Concrete images that map onto abstract experiences. Especially productive in talk about feelings. Example: Maya's “hollow,” Linda's “weight,” Helen's “fading.”
Transitions (Technique 4) Turn-taking shifts and topic changes. What the participant moves to right after they say what they say. Reveals what feels related in the participant's mind.
Similarities and Differences (Technique 5) Side-by-side reading of two passages: what is the same, what is different. The foundation of grounded theory's constant-comparative method (Lesson 7).
Linguistic Connectors (Technique 6) Words like because, since, as a result, therefore that mark causal accounts. Searching for connectives is a fast way to find participants' own causal models.
Missing Data (Technique 7) What the corpus does not contain. Structured absences (participants who would be expected to mention X who do not) are findings, often the most analytically productive. Example: older men in the loneliness corpus avoid the word “lonely.”
Theory-Related Material (Technique 8) Reading the data through a specific theoretical lens, looking for passages that confirm, complicate, or contradict the theory. The most deductive of the twelve techniques.
Cutting and Sorting (Technique 9) Physical pile-sort of quotes printed on index cards. Externalizes the analytic work and uses spatial cognition. Bernard, Wutich, and Ryan recommend doing this literally, with scissors, at least once per project.
Word Lists and KWIC (Technique 10) Frequency lists of content words plus keyword-in-context concordances showing every occurrence of a target word with surrounding text. Bridge to computational text analysis (Module 12). Implemented in quanteda::textstat_frequency() and quanteda::kwic().
Co-occurrence (Technique 11) Which codes (or words) appear together more often than chance. First move toward axial coding and toward network displays of code structure.
Metacoding (Technique 12) Labelling the codes themselves: sorting them by type (affective vs behavioural, emic vs etic, experiential vs causal vs responsive vs interpretive). Reveals structural patterns in the codebook.
Coding Strategy & Codebook
Inductive Coding Data-up coding: codes develop from what the analyst finds in the transcripts. Default for exploratory work and grounded theory. Stays close to participants' categories.
Deductive Coding Theory-down coding: codes come from a pre-specified framework. Used for confirmatory work and multi-site studies. Produces framework-comparable results but may miss what the framework was not built to see.
Hybrid Coding Small deductive anchor plus inductive growth. The practical default in applied health research (Fereday and Muir-Cochrane 2006; endorsed by Bernard, Wutich, and Ryan Ch. 6). The recommended strategy for the HSCI 841 capstone.
Codebook A structured document describing every code in a project. Each entry has seven required elements: code name, brief definition, full definition, inclusion criteria, exclusion criteria, positive example, negative example. A memo column is a strongly recommended eighth.
Audit Trail A documented record of all analytic decisions, including codebook revisions: date, code changed, what it changed to, justification. Required for replicability; Bernard, Wutich, and Ryan insist that revised codebooks be applied retroactively to already-coded transcripts.
Axial Coding A second pass over coded data that attends to relationships among codes (co-occurrence, sequence, causality) rather than to the codes themselves. Term from Strauss and Corbin (1990). Turns a flat codebook into the beginnings of a model.
Hierarchical Codes Codebook structure in which codes nest under broader categories. Allows aggregation at different levels (category-level findings vs. specific code-level findings).
Intercoder Reliability
Intercoder Reliability The degree to which two or more analysts independently apply the same codebook to the same passages and arrive at the same codings. Operational standard for the replicability commitment in applied health research.
Percent Agreement Proportion of passages on which two coders agree. Intuitive but does not adjust for chance agreement, so it inflates estimates when one code is much more common than others. Use only as a descriptive supplement to a chance-corrected measure.
Cohen's Kappa (κ) Chance-corrected agreement measure for two coders on nominal codes (Cohen 1960). Formula: κ = (po − pe) / (1 − pe). Ranges from −1 (perfect disagreement) through 0 (chance) to +1 (perfect agreement). Field-standard measure in applied health qualitative research.
Landis & Koch (1977) Thresholds Interpretive thresholds for Cohen's kappa: <0 poor, 0–0.20 slight, 0.21–0.40 fair, 0.41–0.60 moderate, 0.61–0.80 substantial, 0.81–1.00 almost perfect. Guidance, not law; the appropriate threshold depends on the stakes and the categories.
Krippendorff's Alpha (α) Generalization of kappa that handles any number of coders, missing data, and any level of measurement (nominal, ordinal, interval, ratio). The measure most contemporary methodologists recommend, including Bernard, Wutich, and Ryan. Computed in R with irr::kripp.alpha(). Conventionally ≥ 0.667 is the minimum, ≥ 0.80 is acceptable for confident claims.
Investigator Triangulation Alternative quality check used in interpretivist work: multiple analysts compare interpretations and document where they differ and why, accepting coherence rather than identity as the standard. Appropriate when intercoder reliability is the wrong measure.
QDA Software & Tooling
Taguette Free, open-source qualitative coding application. Browser-based or local. Exports to CSV and HTML. This course's pick because it is free, transferable, and uses open data formats.
NVivo Commercial QDA package (Lumivero). Industry standard in many applied health settings. Strong multi-coder workflow and visualization, but expensive licence and closed file format.
ATLAS.ti Commercial QDA package. Strong network views; good multimedia support. Expensive licence.
MAXQDA Commercial QDA package. Mixed-methods friendly; good visualizations. Smaller user base than NVivo.
Dedoose Commercial, browser-based, subscription-billed QDA platform. Good for cloud collaboration but access ends when subscription ends.
quanteda R package for industrial-strength text analysis. Core functions used in this lesson: corpus(), tokens(), dfm(), textstat_frequency(), kwic().
readtext R package for reading text corpora (plain text, DOCX, PDF, etc.) into a tibble with one row per document. Standard entry point for any text-analysis workflow.
irr (R package) R package for intercoder reliability. Key functions: agree() (percent agreement), kappa2() (Cohen's kappa for two coders), kripp.alpha() (Krippendorff's alpha for any number of coders).
Key People
Gery W. Ryan & H. Russell Bernard Authors of the foundational 2003 paper “Techniques to identify themes” (Field Methods, 15(1), 85–109), which catalogued the twelve theme-finding techniques. Bernard is a foundational figure in cultural anthropology and research methods; Ryan works in applied anthropology and health research at the RAND Corporation.
Jacob Cohen (1923–1998) Psychologist and statistician whose 1960 paper “A coefficient of agreement for nominal scales” introduced kappa. The same Cohen who gave us Cohen's d, statistical power, and the “Cohen tradition” in applied statistics.
Klaus Krippendorff Communications methodologist who developed Krippendorff's alpha as a generalization of agreement measures across number of coders, missingness, and level of measurement. Author of Content Analysis: An Introduction to Its Methodology (4th ed., 2018), a key reference for Module 8.
J. Richard Landis & Gary G. Koch Biostatisticians whose 1977 paper “The measurement of observer agreement for categorical data” (Biometrics, 33, 159–174) proposed the kappa interpretive thresholds (slight, fair, moderate, substantial, almost perfect) that have become the field standard.
Kathy Charmaz (1939–2020) Medical sociologist who developed constructivist grounded theory, which we will meet in detail in Lesson 7. Her warning against “coding too close to the data” (Charmaz 2014) is part of why inductive coding needs theoretical anchoring.
Anselm Strauss & Juliet Corbin Sociologists whose Basics of Qualitative Research (1990 and later editions) introduced axial coding as a stage in grounded theory analysis. The axial-coding term is one of their durable contributions.
Jennifer Fereday & Eimear Muir-Cochrane Nursing researchers whose 2006 International Journal of Qualitative Methods paper formalized hybrid inductive-deductive thematic analysis. Widely cited and is the operationalization endorsed by Bernard, Wutich, and Ryan.
No matching entries. Try a different search term.