# Lesson 10 — Controlled Studies (v3 expanded)

*Companion-podcast transcript • Sarah & Kiffer* 
*~5,210 words • ~28.2 min audio*

---

**Sarah:** Welcome back to Office Hours. I'm Sarah.

**Kiffer:** And I'm Kiffer. Today we're working through Lesson 10, Controlled Studies. This is a real moment, because it's the first experimental-design lesson in the entire sequence. Up until now, every study design we've talked about has been observational.

**Sarah:** Yeah, let me mark how big a shift this is. The earlier lessons walked through the observational toolkit. Cross-sectional studies. Cohort studies. Case-control studies. Ecological studies. Hybrid variants. And every one of them has the same essential feature. The investigator measures exposure as it occurs naturally, and then has to defend the comparison against confounding.

**Kiffer:** In a controlled study, the investigator does something different. They deliberately assign who gets the exposure. And that one move, assigning rather than observing, changes the inferential game completely.

**Sarah:** Quick definition. A randomized controlled trial, abbreviated R-C-T, is a planned experiment in which the investigator deliberately allocates participants to one or more interventions and then follows them to see what outcomes occur. Randomized just means the allocation is decided by a random process, like a computer-generated random number sequence, not by participant choice or clinical judgment.

**Kiffer:** And that random allocation is the engine of the whole design. Because assignment is determined by chance and not by anything about the participant, randomization produces groups that are comparable on both measured and unmeasured factors at baseline. That's the property epidemiologists call exchangeability. No observational design can fully match it.

**Sarah:** Let's slow down on exchangeability for a moment, because it's the concept that does all the work here.

**Kiffer:** Exchangeability means the groups are interchangeable. If you swapped the labels on the two arms, the underlying disease frequency in each arm would be the same. The only thing distinguishing them is the assignment to the intervention. Income, age, education, genetics, comorbidities, things you measured and things you'd never even think to measure, all balanced on average across arms because chance is doing the assigning.

**Sarah:** And that's what observational designs cannot reproduce. In an observational study, smokers and non-smokers differ on dozens of characteristics other than smoking. Statistical adjustment helps with the confounders you measured. It does nothing about the ones you didn't.

**Kiffer:** Whereas randomization handles measured confounders, unmeasured confounders, and confounders the investigator never even imagined. By design rather than by analysis.

**Sarah:** Some terminology before we go further. The lesson uses R-C-T as a broad term for any planned experiment evaluating products or procedures outside the laboratory. You'll also see clinical trial, which usually implies a therapeutic setting, and field trial, which implies a general population setting. The factor under investigation is the intervention. The effect of interest is the outcome. People participating are subjects or participants.

**Sarah:** And one more piece of context up front, because students sometimes ask why we even need this lesson. The randomized trial is, at present, the unchallenged source of the highest standard of evidence used to guide clinical decision-making. That's a paraphrase of Lavori and Kelsey from the methodological literature. When regulators decide whether to approve a new drug, when guideline committees decide whether to recommend a new procedure, the trial sits at the top of the evidence pyramid.

**Kiffer:** Although a single trial is rarely sufficient to settle a question. And there are persistent worries that some trials lack relevance to real-world practice because the setting is too tightly controlled. But the principle stands. When you can run a trial, the trial generally gives you the cleanest causal evidence you'll get.

**Kiffer:** Before we get into design, there are two infrastructure pieces worth flagging. Trial registration and reporting standards.

**Sarah:** Trial registration first. As of 2005, the International Committee of Medical Journal Editors, abbreviated I-C-M-J-E, required investigators to register their trials before participant enrollment as a precondition for publishing in member journals. The International Committee of Medical Journal Editors is a working group of editors of major medical journals like the New England Journal of Medicine, JAMA, the Lancet, and the BMJ.

**Kiffer:** And the reason registration matters is publication bias. If a trial only shows up when it gets a positive result, the published evidence is biased upward. Registration creates a record of every trial that was started. The infrastructure that supports this is the World Health Organization International Clinical Trials Registry Platform, often called the I-C-T-R-P. The World Health Organization, the W-H-O, is the United Nations agency for global health based in Geneva. The I-C-T-R-P aggregates data from primary trial registries around the world.

**Sarah:** Although registration remains effectively voluntary in many jurisdictions. Journals can refuse to publish unregistered trials, but no global regulator forces every trial to register.

**Kiffer:** Then the reporting standard. The Consolidated Standards of Reporting Trials, abbreviated CONSORT. It's a set of guidelines telling investigators what to include when they write up a trial. First version 1996, revised in 2001, and again in 2010. We'll come back to its twenty-five-item checklist near the end of the lesson.

**Sarah:** Now the phases of clinical research. The vocabulary comes from drug development specifically, but it gives a useful map of how a new intervention moves from idea to standard practice.

**Kiffer:** Before Phase 0, there are pre-clinical studies, done in vitro, meaning in cell culture, and in vivo, meaning in animals. Then come the human phases.

**Sarah:** Phase 0 is first-in-human. Very small studies, typically ten to fifteen participants. Microdosing is common. The goal is mostly pharmacokinetic. You're learning whether the drug behaves in a person the way it did in mice.

**Kiffer:** Phase 1 is safety. Typically twenty to one hundred participants, often healthy volunteers, although in oncology the participants may be patients. The goals are to establish safe dose ranges, identify dose-limiting toxicities, and refine the pharmacokinetic profile.

**Sarah:** Phase 2 is the first efficacy signal. Typically one hundred to three hundred participants who actually have the target condition. This is where you get a preliminary read on whether the drug works.

**Kiffer:** Phase 3 is the pivotal efficacy trial. Hundreds to thousands of patients, often across multiple sites and countries. This is the rigorous randomized comparison against placebo or standard of care, and it's the data package that supports regulatory approval.

**Sarah:** Phase 4 is post-marketing surveillance. After approval, often tens of thousands of participants in real-world conditions. Phase 4 catches rare adverse effects that smaller trials missed and identifies safety signals in subgroups, like pregnant people, who were excluded from earlier phases.

**Kiffer:** Now trial setup. Once you've decided to run a trial, the first decisions lock in everything that follows. The objectives, the comparator, the population, and the intervention specification.

**Sarah:** Objectives must be stated clearly and succinctly. A good objective describes the intervention, the allocation design, and the primary outcome or outcomes. Each trial should have a limited number of objectives plus a small number of secondary outcomes. Adding objectives complicates the protocol, jeopardizes compliance, and sacrifices power.

**Kiffer:** Then the choice of comparator. Most trials are two-arm trials. The comparator might be a placebo, no treatment, the usual treatment, the standard of care, or a different dose. The choice is one of the most consequential decisions in the design.

**Sarah:** Placebos are ideal when there is no established alternative. A placebo is a product indistinguishable from the active intervention, given to the comparison group. Where possible, a placebo is preferred to no treatment, because it controls for the psychological effects of being in a study.

**Kiffer:** But when an effective standard treatment already exists, withholding it from the comparison arm may be unethical. So the standard of care serves as comparator, and the trial often takes the form of a non-inferiority trial. A non-inferiority trial aims to show that the new intervention is no worse than the existing standard by more than a clinically unimportant margin, usually denoted by the Greek letter delta.

**Sarah:** Choosing delta is one of the most contentious decisions in trial design. Set delta too narrow and you can never declare a new treatment non-inferior. Set delta too wide and you risk approving a treatment substantially worse than the standard. Delta has to capture what physicians and patients actually consider clinically tolerable.

**Kiffer:** Let me walk through a worked example. Andriole and colleagues in 2012 randomized roughly thirty-eight thousand men aged fifty-five to seventy-four to a screening intervention and another thirty-eight thousand to usual care across ten screening centers in the United States. Enrollment ran from 1993 to 2001. Men in the intervention arm were offered annual prostate-specific antigen tests for six years and digital rectal examinations for four years. Follow-up extended through 2009 or thirteen years from trial entry. The primary analysis compared prostate cancer-specific mortality between the two arms.

**Sarah:** And the comparator was usual care, which sometimes included opportunistic screening. That pragmatic choice makes the results applicable to the real United States health system. But it complicates interpretation of the true effect of organized screening, because some men in the usual care arm were getting screened anyway, just outside a structured program.

**Kiffer:** Then participants. Three nested populations have to be distinguished in any trial. Target, source, and study group.

**Sarah:** The target population is the population to which you want results to apply. If your trial is about a hypertension drug for adults, the target might be all adults with hypertension in Canada. The source population is the subset that is actually eligible and reachable, like adults with hypertension attending three British Columbia clinics during the recruitment window. And the study group is the smaller subset that meets eligibility criteria and consents to participate.

**Kiffer:** Three nested circles. Target on the outside. Source in the middle. Study group inside. How well the volunteers represent the source and target has to be considered when extrapolating results.

**Sarah:** Then eligibility criteria. There's a fundamental trade-off baked in. A narrow set of criteria yields a more homogeneous study group. That increases statistical power, because participants respond similarly. But it reduces generalizability.

**Kiffer:** A broad set does the opposite. It increases the participant pool, can reveal subgroup variation, and improves generalizability. But it introduces background variability that may hurt power.

**Sarah:** And the textbook recommendation is to use criteria that reflect the breadth of subjects who would receive the intervention in real-world practice. If your drug will eventually be used in adults aged eighteen to eighty, your eligibility should mirror that. If you study only healthy thirty-year-olds, your trial says nothing about who will actually take the drug.

**Kiffer:** Last piece of Section 1. Specifying the intervention. The nature of the intervention and how it is administered must be defined with enough detail that another investigator could replicate the trial. Not paraphrase it. Replicate it.

**Sarah:** For a drug, that means the molecule, the formulation, the dose, the schedule, the duration, the route of administration. For a surgical technique, the specific maneuvers and instrumentation. For a behavioral program, the curriculum, the trainer characteristics, the dose of intervention measured in sessions or hours. Reproducibility starts with specification.

**Kiffer:** And there's a useful contrast between fixed and flexible interventions. A fixed intervention is one with no flexibility. The protocol pins down every dose and every visit. Fixed protocols are appropriate for new products in Phase 3 trials, where you want to evaluate the intervention as designed.

**Sarah:** A more flexible protocol is appropriate when the product has been in use long enough that some clinical judgment has accumulated. Like, the dose can be adjusted within a range based on the patient's response. Pragmatic trials often use flexible protocols because they're trying to mimic how the intervention would actually be used.

**Kiffer:** And a quick illustrative example. Groeneveld and colleagues in 2003 ran a sequential trial of creatine in patients with amyotrophic lateral sclerosis, abbreviated A-L-S, recruited from neuromuscular outpatient clinics in Utrecht and Amsterdam. The two interventions were creatine monohydrate and a matching placebo, designated A and B. An independent physician, masked to which letter was which, instructed the research pharmacist which medication to dispense. The careful specification of that masking-and-allocation chain, from masked physician through pharmacist to patient, is what protects the integrity of the trial.

**Sarah:** That brings us to Section 2. Central design choices. Allocation, outcome measurement, sample size, and blinding. Each is a place where a poorly run trial can lose its inferential advantage.

**Kiffer:** Allocation first. Six common methods. Random allocation does not mean haphazard allocation. A formal process has to be used. A computer-based random number generator. A sealed-envelope randomization scheme. Or, in low-resource settings, even a coin toss, with the result documented.

**Sarah:** Method one. Simple randomization. Each participant is randomized independently, like flipping a coin for each enrollee. With large samples it produces well-balanced groups. With smaller trials it can produce noticeable imbalance just by chance.

**Kiffer:** Method two. Stratified randomization. The population is divided into strata based on important prognostic factors. Within each stratum, participants are randomly assigned. Stratifying by sex and age band, for example, ensures balance across arms within each cell, which improves power.

**Sarah:** Method three. Cross-over. Each participant receives both interventions in sequence, with random ordering. So a person gets drug A for four weeks, then a wash-out period, then drug B for four weeks, with the order of A and B randomized. Each subject serves as their own control. The condition has to be stable, the intervention's effect has to be reversible and short-lived, and mortality is the obvious counter-example, because you can't cross a person over from a treatment that affects mortality.

**Kiffer:** Method four. Factorial design. Two or more interventions are evaluated simultaneously. In a two-by-two factorial trial, participants are randomized to one of four combinations of two interventions. The CAESAR trial used a two-by-two-by-two factorial in women undergoing first Caesarean section. Three independent factors. Single-layer versus double-layer uterine closure. Closure versus non-closure of the peritoneum. Liberal versus restricted use of a subsheath drain. Three factors, two levels each, eight combinations, all in one trial.

**Sarah:** Method five. Cluster randomization. Groups, rather than individuals, are randomized to arms. The groups might be schools, clinics, neighborhoods, or villages. Cluster randomization is appropriate when the intervention is naturally delivered at a group level, or when individual randomization would produce contamination between participants in the same setting. Dalum and colleagues in 2012, for example, randomized twenty-two continuation schools in Denmark by coin toss to deliver a smoking cessation intervention or to act as control. If you'd randomized students within a school instead, intervention and control students in the same hallway would have influenced each other.

**Kiffer:** Method six. Split-plot design. A hybrid where one factor is randomized at the cluster level and another at the individual level within clusters. Watson and colleagues in 2008 ran a split-plot trial across United Kingdom general practices. Physicians in ninety-one practices were randomized to additional training in shoulder injection or none. Within the practices, two hundred fifteen patients were then randomized to a corticosteroid or a lignocaine injection. The design answers two questions at once. Does training help? Does corticosteroid outperform lignocaine?

**Sarah:** And there's a multicentre extension on top of these. If an adequate sample is not available at one site, the trial is run across multiple centres. Within-centre and between-centre variance has to be accounted for. Multicentre trials enhance generalizability and create opportunities to detect interaction effects across sites.

**Kiffer:** Now once you've chosen an allocation method, the next concern is allocation concealment. And students sometimes confuse this with blinding, but they're different.

**Sarah:** Allocation concealment is about the moment of assignment. It asks whether the person enrolling participants can know in advance which arm a given participant will be assigned to. If they can know in advance, they may, consciously or not, manipulate enrollment, like bringing in healthier participants when they expect the next slot to be the active arm.

**Kiffer:** Blinding is about what happens after the assignment is made. Once the allocation is fixed, are the participant, the clinician, the outcome assessor, and the analyst kept unaware of which arm the participant is in?

**Sarah:** Adequate methods of concealment include central randomization, where the enrolling clinician messages a central office that holds the allocation list and reveals it only after the participant is enrolled. Sealed opaque envelopes, where the allocation is in a numbered envelope that can't be seen through. And pharmacy-controlled allocation, where the pharmacy holds the list and dispenses the assigned product.

**Kiffer:** Inadequate methods include open random number tables that the enroller can see in advance, and predictable alternation. Anything that lets the enroller anticipate the next assignment is inadequate.

**Sarah:** Now blinding, also called masking. There are three levels worth distinguishing.

**Kiffer:** Single-blind. Only the participant is unaware of which arm they were assigned to. This addresses what's called performance bias. Participants who know they're getting the active treatment may report symptoms differently, may be more compliant, may have a placebo response tied to expectation. Single-blinding equalizes those effects.

**Sarah:** Double-blind adds the clinician and, in many trials, the outcome assessor. If a clinician doesn't know which arm a patient is in, they can't subconsciously deliver more intensive co-interventions to the active arm. If the outcome assessor doesn't know, they can't subconsciously rate symptoms more favorably. Those biases are sometimes called reporting bias and detection bias.

**Kiffer:** Triple-blind extends masking to the data analyst. The analyst sees data labeled A and B without knowing which letter is the intervention. This addresses subtle decisions about how to define subgroups, how to handle outliers, which model specifications to try, that can be unconsciously influenced by knowledge of which arm is the new treatment.

**Sarah:** And these are nested. Triple-blinding includes double-blinding, which includes single-blinding. Each layer prevents an additional category of bias and is harder to achieve in practice.

**Kiffer:** Now the empirical evidence on why this matters. Schulz and colleagues in 1995 examined trials with and without adequate allocation concealment. Trials with inadequate concealment inflated odds ratios on average by thirty to forty percent compared to trials with adequate concealment.

**Sarah:** And Wood and colleagues in 2008 looked at the impact of blinding on subjective outcomes. Lack of blinding inflated effect estimates on subjective outcomes by fifteen to twenty-five percent. Combined, inadequate concealment and lack of blinding could inflate effects by up to fifty percent.

**Kiffer:** Half the apparent effect, gone, just because of methodological choices. So when reviewers and meta-analysts pay close attention to concealment and blinding, it's not pedantry. They're correcting for substantial measurable bias.

**Sarah:** Then outcome measurement. A controlled trial should be limited to one or two primary outcomes and a small number, typically one to three, of secondary outcomes. Too many outcomes leads to multiple comparisons problems and inflates the false-positive rate.

**Kiffer:** Outcomes can be on three scales. Dichotomous, like cured or not cured. Continuous, like blood pressure or a quality-of-life score. And time-to-event, where what matters is when the event happens, not just whether it happens.

**Sarah:** Outcomes should ideally be assessed using validated instruments, meaning measurement tools whose reliability and validity have been established in prior research. And every trial needs to measure both efficacy outcomes and safety outcomes. A trial that only reports efficacy is incomplete.

**Kiffer:** Then sample size. Inputs are the Type 1 error rate, often denoted alpha, conventionally zero point zero five. The power, conventionally zero point eight zero. The expected effect size. And the variability of the outcome. Alpha is the probability of falsely declaring an effect when none exists. Power is the probability of correctly detecting an effect when one truly exists.

**Sarah:** Special case worth flagging. Sample size for cluster randomized trials. The analysis has to account for within-cluster similarity, because people within the same school, the same village, the same clinic, tend to be more similar to each other than to people in other clusters. That similarity reduces the effective sample size.

**Kiffer:** The amount of within-cluster similarity is summarized by the intracluster correlation coefficient, abbreviated as I-C-C, sometimes denoted by the Greek letter rho. The I-C-C is on a zero-to-one scale. Health behaviors typically show I-C-C values between about zero point zero one and zero point one zero.

**Sarah:** And the inflation factor is called the design effect. The design effect equals one plus the quantity m minus one times the intracluster correlation coefficient, where m is the average cluster size.

**Kiffer:** Concrete numbers help here. If your I-C-C is zero point zero two and your average cluster size is fifty-one students per school, the design effect is one plus fifty times zero point zero two, which is two. Twice the sample size of an individually randomized trial. If the I-C-C is zero point zero five with cluster size of forty-one, the design effect is one plus forty times zero point zero five, which is three. Three times the sample size.

**Sarah:** Even small intracluster correlations produce substantial inflation. A useful rule is that adding more individuals to existing clusters yields diminishing returns once the per-cluster size exceeds one over the I-C-C. Adding more clusters is often more efficient than adding more individuals per cluster.

**Kiffer:** That takes us to Section 3. Trial conduct, analysis, and reporting.

**Sarah:** Trial conduct first. The practical work is dominated by recruitment, retention, and adherence monitoring. The follow-up period must be long enough to capture the outcomes of interest. Some loss is inevitable through dropout or non-compliance.

**Kiffer:** Strategies include regular communication with participants through reminders and updates. Capturing data on dropouts through routine administrative databases when participants consent. And verifying compliance directly through interviews or biological samples like drug or metabolite levels in blood, or indirectly through pill counts and packaging returns.

**Sarah:** Plus ethical oversight. Most trials of any size are monitored by an independent data and safety monitoring board, abbreviated as a D-S-M-B. The data and safety monitoring board reviews accumulating safety and efficacy data on a pre-specified schedule and has the authority to recommend stopping the trial if there's clear evidence of harm, clear evidence of benefit, or futility, meaning the trial is unlikely to detect a meaningful effect even if completed.

**Kiffer:** And the board is independent of the investigators precisely because the investigators have a stake in the trial continuing. An independent board has no such stake.

**Sarah:** Now analysis. The most important distinction is intention-to-treat versus per-protocol.

**Kiffer:** Intention-to-treat, abbreviated I-T-T, analyzes all subjects in the arm to which they were originally randomized, regardless of whether they actually received the intervention or complied with the regimen. So if a participant was randomized to the drug arm but never took the drug, they're still analyzed as part of the drug arm.

**Sarah:** That can sound counterintuitive. If someone didn't take the drug, why count them as a drug-arm participant? Because intention-to-treat preserves the benefits of randomization. The randomization made the arms comparable at baseline. The moment you start moving people between analytical groups based on what happened after randomization, you're using post-randomization information, which can re-introduce selection effects.

**Kiffer:** Per-protocol analysis is the alternative. Only subjects who complied with and completed the study as specified are analyzed. Per-protocol gives you an estimate of the effect under ideal compliance. The trouble is that non-compliance is rarely random. Non-compliers are often sicker, have more side effects, less social support. Excluding them creates a selection process that breaks the randomization.

**Sarah:** There's a beautiful canonical example here. The Coronary Drug Project, run in the 1960s and 1970s, was a placebo-controlled trial of lipid-lowering drugs in men with prior heart attacks. Once the trial was over, the investigators looked at the placebo arm alone and asked whether placebo adherers, the men who reliably took their assigned placebo pills, did better than placebo non-adherers.

**Kiffer:** And the result was striking. The placebo adherers had about fifteen percent lower mortality than the placebo non-adherers. Same arm. Same placebo. Same trial. The only difference was who took their pills.

**Sarah:** Which means the gap had nothing to do with the placebo. People who adhere to medication regimens tend to be healthier in other ways. Steadier social support. More regular sleep. More predictable eating. They show up to medical appointments.

**Kiffer:** That phenomenon is called the healthy adherer effect. And it's why per-protocol analyses are dangerous. If you compare adherers to non-adherers, you're not really comparing the drug to placebo. You're comparing healthy adherers to less-healthy non-adherers.

**Sarah:** So intention-to-treat is the recommended primary analysis. Per-protocol can be reported as a secondary or sensitivity analysis but should not replace I-T-T. And whichever is primary, the number of subjects in each group, and whether they complied, must be reported transparently.

**Kiffer:** One more analytic concern worth flagging. The multiple comparisons problem comes up in trials from three sources. Multiple outcomes. Multiple subgroups. And periodic interim analyses. Each additional test inflates the experiment-wise error rate.

**Sarah:** The simplest correction is the Bonferroni adjustment. Divide the desired family-wise alpha by the number of comparisons. So with five comparisons at family-wise alpha of zero point zero five, each test must reach p less than zero point zero one to claim significance. And subgroup analyses deserve a special warning. Data-driven subgroup analyses generate spurious associations at alarming rates. Only subgroups planned in advance should be analyzed, ideally tested through a single overall interaction term rather than a battery of subgroup-specific tests.

**Kiffer:** There's a special analytic topic for vaccine trials worth spending time on, because vaccines break a key assumption of standard trials.

**Sarah:** Which assumption are we breaking?

**Kiffer:** That the intervention's effect on one subject is independent of the effect on another. In a typical drug trial, my taking the drug doesn't affect whether you respond. We're independent. But for vaccines against communicable infections, that's not true. If I get vaccinated, I'm less likely to be infected, which means less likely to transmit, which means your risk drops even if you weren't vaccinated.

**Sarah:** To get a fuller picture, epidemiologists distinguish three measures of vaccine efficacy.

**Kiffer:** Direct vaccine efficacy is computed within a single population by comparing vaccinated and unvaccinated individuals. The direct effect captures the biological protection conferred by the vaccine on the individual.

**Sarah:** Indirect vaccine efficacy is the herd-immunity effect. It compares unvaccinated individuals in a high-coverage population to unvaccinated individuals in a low-coverage population. The indirect effect measures how much lower the disease risk is among unvaccinated people in the high-coverage area, simply because there's less transmission going around them.

**Kiffer:** And total vaccine efficacy compares the overall incidence in the high-coverage population to the overall incidence in the low-coverage population. The total effect captures the combined direct plus indirect protection at the population level.

**Sarah:** Estimating all three requires data from at least two populations with different vaccination coverage. The lesson illustrates this with a striking example from cholera. Ali and colleagues in 2005, with later analysis by Hudgens and Halloran in 2008, examined an individually randomized, placebo-controlled trial of killed oral cholera vaccines in residential areas, called baris, in Bangladesh. They compared two groups. Group A had more than fifty percent vaccination coverage. Group B had less than twenty-eight percent.

**Kiffer:** In group A, the high-coverage area, the direct vaccine efficacy was just zero point one four. About a fourteen percent reduction in disease among the vaccinated relative to the unvaccinated within group A. Looking at group A alone, you might conclude the vaccine barely worked. But in group B, the direct efficacy was zero point six two. The vaccine reduced disease risk by sixty-two percent. And the indirect effect, comparing unvaccinated in group A to unvaccinated in group B, was zero point seven nine. Herd immunity reduced the unvaccinated risk by seventy-nine percent.

**Sarah:** Limiting analysis to the high-coverage population alone would have suggested the vaccine barely worked. Looking across populations with different coverage reveals the dominant role of indirect effects. For any vaccine evaluation that aims to inform public-health policy, you need all three measures.

**Kiffer:** Last piece of Section 3. Reporting. The CONSORT 2010 statement is the dominant reporting guideline for parallel-group randomized trials. It's a twenty-five-item checklist covering the trial design, the participants, the interventions, the outcomes, the sample-size calculation, the randomization sequence and concealment, the blinding, the statistical methods, the participant flow, the recruitment, the baseline data, the numbers analyzed, the effect sizes, the harms, the discussion, and trial registration and funding.

**Sarah:** And the centerpiece of CONSORT is the participant flow diagram. The CONSORT flow diagram shows, for each arm, four stages. Enrollment, where you assess eligibility. Allocation, where the enrolled participants are randomized. Follow-up, where you track who continued and who was lost. And analysis, where you specify how many participants from each arm contributed to the primary analysis.

**Kiffer:** The flow diagram lets a reader trace exactly what happened to every randomized participant. How many were screened. How many were eligible. How many consented. How many were assigned to each arm. How many actually received the assigned intervention. How many were lost to follow-up, and for what reasons. And how many were included in the analysis.

**Sarah:** There are CONSORT extensions for cluster trials, non-inferiority and equivalence trials, non-pharmacological treatments, herbal interventions, and pragmatic trials. Each adds items specific to the design type.

**Kiffer:** And the reason CONSORT matters, beyond being a stylistic preference, is that poor reporting is associated with biased estimates of treatment effects. It prevents readers from judging reliability. It prevents extraction for systematic reviews. Following CONSORT during planning, not just at write-up, helps ensure that the methodological choices required for transparent reporting are actually made in the first place.

**Sarah:** Okay. Let's pull the takeaways together.

**Kiffer:** Takeaway one. Random allocation produces exchangeability. Groups comparable on measured and unmeasured factors at baseline. That's what no observational design can match.

**Sarah:** Takeaway two. Trial infrastructure matters. Trial registration through the World Health Organization International Clinical Trials Registry Platform reduces publication bias. The CONSORT 2010 statement provides a twenty-five-item checklist.

**Kiffer:** Takeaway three. Phases of clinical research. Phase 0 first-in-human, Phase 1 safety, Phase 2 efficacy signal, Phase 3 pivotal efficacy, Phase 4 post-marketing surveillance. Each asks a different question and uses a different sample size.

**Sarah:** Takeaway four. Three nested populations. Target, source, study group. Eligibility criteria balance internal validity, which favors narrow criteria, against generalizability, which favors broad criteria. Use criteria that mirror real-world use.

**Kiffer:** Takeaway five. Six allocation methods. Simple, stratified, cross-over, factorial, cluster, and split-plot. Plus the multicentre extension.

**Sarah:** Takeaway six. Allocation concealment guards the gate at enrollment. Blinding manages knowledge after assignment. Schulz and colleagues in 1995 showed inadequate concealment inflates odds ratios by thirty to forty percent. Wood and colleagues in 2008 showed lack of blinding inflates subjective-outcome effects by fifteen to twenty-five percent. Combined, inflation can reach fifty percent.

**Kiffer:** Takeaway seven. Sample size calculations rest on alpha, power, effect size, and variability. For cluster trials, multiply by the design effect, which equals one plus the quantity m minus one times the intracluster correlation coefficient.

**Sarah:** Takeaway eight. Intention-to-treat analysis preserves randomization. Per-protocol analysis is vulnerable to the healthy adherer effect, illustrated by the Coronary Drug Project, where placebo adherers had fifteen percent lower mortality than placebo non-adherers.

**Kiffer:** Takeaway nine. Vaccine trials require three efficacy measures. Direct, indirect, and total. Estimating all three requires populations with different coverage.

**Sarah:** Takeaway ten. CONSORT 2010 is the reporting standard. Twenty-five items. The flow diagram tracking enrollment, allocation, follow-up, and analysis is the centerpiece. Following CONSORT during design, not just write-up, makes transparent reporting achievable.

**Kiffer:** And one more piece of framing before we close. The randomized trial is the cleanest design we have, and it sits at the top of the conventional hierarchy of evidence. But that doesn't mean every question can or should be answered with a trial. You can't randomize people to poverty. You can't randomize them to live in a polluted neighborhood. You can't randomize them to experience racism. The trial sits alongside the observational toolkit, not above it. Each design has its place. Knowing when to reach for which is most of the methodological game.

**Sarah:** And Lesson 11 picks up from there, turning to validity in observational studies. The reporting checklists you've met here, CONSORT, and in earlier lessons, the STROBE statement for observational research, both anchor that conversation about what makes a study trustworthy.

**Kiffer:** That's Lesson 10. The design, conduct, analysis, and reporting of randomized controlled trials. Thanks for listening.

**Sarah:** See you in Lesson 11.