# Lesson 2 — Systematic Reviews & Meta-Analysis (v3 expanded)

*Companion-podcast transcript • Sarah & Kiffer* 
*~5,450 words • ~29.5 min audio*

---

**Sarah:** Welcome back to Office Hours, the companion podcast. I'm Sarah.

**Kiffer:** And I'm Kiffer. Today we're working through Lesson 2, Systematic Reviews and Meta-Analysis. And before we get into it, I want to flag something about where this lesson sits in the course.

**Sarah:** Yeah, the placement is unusual. Most introductory epidemiology courses save systematic reviews and meta-analysis for the very end. Why are we doing it second?

**Kiffer:** It's a deliberate choice, and the lesson opens by defending it. Lesson 1 set up the foundations — the history, the ways of knowing, the trust scaffolding that lets a body of evidence accumulate. Lesson 1 ended with you committing to evidence-based reasoning as a way of knowing. Lesson 2 picks up at exactly that seam and asks the obvious follow-up: once you commit to evidence, how do you tell stronger evidence from weaker?

**Sarah:** And the lesson argues you need that answer before you start reading individual studies. Otherwise every study you pick up is floating in space without any sense of how it fits into the wider literature.

**Kiffer:** Right. From Lesson 3 onward in this course, we're going to walk through the rungs below the apex — case-control, cohort, ecological, cross-sectional studies, and all the threats that compromise each one. But before we look down at any single study, we need a way of thinking about how the field as a whole organises its evidence. That organising idea is the hierarchy of knowledge, and it's where this lesson starts.

**Sarah:** Okay, so let's start there. Section 1 — Hierarchy of Knowledge and Systematic Reviews.

**Kiffer:** The lesson opens with what's probably the single most recognisable picture in epidemiology — the evidence pyramid. Imagine a triangle. At the base, the easy stuff: expert opinion, individual experience, single case reports. As you move up, you add structured comparison — case series, cross-sectional studies, case-control, cohort. Higher still, you get interventional designs — randomised controlled trials, where the investigator randomly assigns the exposure. And at the very apex, you get the designs that synthesise across everything below: systematic reviews and meta-analyses.

**Sarah:** And the lesson is careful about what the pyramid actually is. It calls it a heuristic, not a verdict.

**Kiffer:** Yeah, this is important. The pyramid is not a strict ranking of every study against every other. What it tells you is how much variation a particular design has *ruled out*. An RCT under ideal conditions rules out confounding because randomisation balances unmeasured variables. A cohort study has structured comparison but can't randomise. A case report rules out nothing.

**Sarah:** And the headline caveat — a poorly conducted systematic review is weaker evidence than a well-designed cohort study.

**Kiffer:** Exactly. The pyramid tells you where to look for the strongest *available* evidence on a question. It does not exempt you from appraising what you actually find. A textbook meta-analysis can still be wrong if the underlying studies are all biased in the same direction, or if half the literature was never published.

**Sarah:** Before we leave the pyramid, the lesson introduces a second hierarchy that I think is one of the most useful frames in the whole course. The DIKW pyramid.

**Kiffer:** Yeah, this one's worth slowing down for. DIKW stands for Data, Information, Knowledge, Wisdom. It comes out of informatics and decision science. The idea is that there are four tiers as you move from raw observations to actionable judgement.

**Sarah:** Walk us through them.

**Kiffer:** Data are raw observations. Counts, lab values, dates, measurements. Information is data with structure. The same numbers, organised, contextualised, compared. Knowledge is information that's been appraised — filtered for quality, integrated across sources, defensible in print. And wisdom is knowledge applied in a particular context, under uncertainty, in service of a particular decision.

**Sarah:** And the punchline is that a meta-analysis is not the same thing as wisdom.

**Kiffer:** Right, and this is the part I want students to really sit with. A meta-analysis is the highest-order *knowledge* layer in the pyramid — the most thoroughly appraised, most heavily aggregated form of knowledge the field knows how to produce. But wisdom is what a clinician, a policymaker, or a community partner does with that knowledge in their particular case. And that step is irreducibly human. It's contextual. It's value-laden.

**Sarah:** And the critical epidemiology call from Lesson 1 — the questions about who benefits, who's harmed, whose priorities — those live at the wisdom layer.

**Kiffer:** They live at the wisdom layer. They don't live in the forest plot. The forest plot is the knowledge layer. Keeping those two hierarchies side by side is a useful corrective against the temptation to treat a pooled estimate as if it answered every question on its own.

**Sarah:** Okay. With the hierarchy in mind, the rest of Section 1 turns to the question — what *is* a systematic review, and how is it different from the kind of review you've probably read a thousand times in undergrad?

**Kiffer:** The lesson contrasts two approaches. Narrative reviews, sometimes called traditional reviews, and systematic reviews. And the contrast is sharp.

**Sarah:** Walk through narrative reviews first.

**Kiffer:** A narrative review is what you typically get when a senior expert writes an essay summarising what they think the literature shows. It's an informal qualitative summary. There's no protocol. There's usually no comprehensive search. There's no formal risk-of-bias appraisal. Each study is considered individually, and the reviewer subjectively assesses the evidence.

**Sarah:** And the problems with that approach are predictable.

**Kiffer:** Predictable and severe. The reviewer brings preconceived opinions. The methodology isn't structured, so you can't tell whether they found everything relevant. Small but well-designed studies often get omitted because they look underpowered. The inclusion criteria are often unstated. And there's a tendency to weight all studies equally, when in fact a 30-patient case series should not get the same vote as a 5,000-patient cohort study.

**Sarah:** And the lesson is pointed about what narrative reviews are good for and what they're not good for.

**Kiffer:** Narrative reviews are fine for orientation. If you want a senior expert's read on where a field is, they're useful. But they should not be used to guide treatment decisions or policy decisions. The risk of bias is too high. That's the headline.

**Sarah:** Systematic reviews are the answer to those problems.

**Kiffer:** Right. A systematic review uses a structured, transparent methodology to identify, evaluate, and synthesise all relevant studies on a specific question. The word *systematic* is doing real work. Every decision is pre-specified, documented, and reproducible. Two researchers given the same protocol should be able to produce broadly the same result. That's the standard.

**Sarah:** And a systematic review may or may not include a meta-analysis. Those are two different things.

**Kiffer:** Yeah, this trips students up. The systematic review is the *process* — identify, appraise, synthesise. The meta-analysis is one particular *synthesis tool*, the quantitative pooling of effect estimates. A systematic review can synthesise qualitatively if the studies are too heterogeneous to pool. Most modern reviews include a meta-analysis when possible.

**Sarah:** Then the lesson walks through the seven steps of a systematic review. These come from Sargeant and colleagues in 2006, but they're the standard sequence across the field.

**Kiffer:** Let me run through them quickly because they organise everything else in the lesson.

**Sarah:** Go for it.

**Kiffer:** Step one. Specify the question. And the question should be driven by a clinical or policy objective, not by what data happens to be available. The lesson nudges you toward broader questions when possible — the ability of beta-blockers as a class to reduce heart attack risk, rather than one specific drug — because broader questions generalise better.

**Sarah:** And the standard way of structuring the question is the PICO framework. Population, intervention or exposure, comparator, outcome. Sometimes extended with study design and timeframe, which makes it PICOS or PICOT.

**Kiffer:** Right. PICO turns a fuzzy clinical question into a precise, answerable one. *Does drug X work?* becomes *In adults with hypertension, does drug X compared to placebo reduce stroke incidence over five years?* That second version you can actually go searching for.

**Sarah:** Okay, step two.

**Kiffer:** Step two is laying out the protocol. The protocol is the systematic-review equivalent of the Materials and Methods section of a primary study. It specifies everything you'll do — the question, the eligibility criteria, the search strategy, the appraisal plan, the synthesis plan. And critically, it's written *before* you start.

**Sarah:** And this is where PROSPERO comes in.

**Kiffer:** Yeah. PROSPERO is the international prospective register of systematic reviews. It's hosted by the Centre for Reviews and Dissemination at the University of York. You submit your protocol there before you begin screening studies. It's the systematic-review equivalent of the clinical trials registry that primary trials use.

**Sarah:** And the role it plays is the same role pre-registration plays for trials. Time-stamping the plan so the world can tell if you changed it midway.

**Kiffer:** Exactly. It reduces outcome reporting bias. It reduces post-hoc tweaks to the inclusion criteria. It prevents two teams from duplicating effort on the same question. And it gives peer reviewers and editors a way to compare the final review against the planned protocol. Cochrane requires it. Most major journals require or encourage it.

**Sarah:** Right, and step three.

**Kiffer:** Find all the studies. And by *all*, the lesson means *all*. The literature search has to be complete and well-documented. You search the major databases — PubMed, Embase, the Cochrane Library — but you don't stop there. You hand-search reference lists of the papers you identify. You search grey literature, which means conference abstracts, theses, government reports, preprints. Anything that might have been done but not formally published.

**Sarah:** And the reason you go after grey literature is publication bias, which we'll come back to in Section 4.

**Kiffer:** Yeah. If you only search the journals, you systematically miss the studies that found nothing. Those are the ones least likely to make it into print.

**Sarah:** And then step four.

**Kiffer:** Determine relevance. This is where inclusion and exclusion criteria do their work. Inclusion criteria specify the populations, interventions, outcomes, and study designs you want. Exclusion criteria specify the things you'll knock out — language restrictions, date cutoffs, accessibility constraints. And the screening should be done by two independent reviewers — typically a title-and-abstract pass first, then a full-text pass — with disagreements resolved by a third reviewer or by discussion.

**Sarah:** The independence matters because subjective judgement creeps in fast. Two people doing it independently and then reconciling catches a lot of inconsistency.

**Kiffer:** Step five. Evaluate study quality. Each study's internal and external validity gets appraised using a formal risk-of-bias tool. For randomised trials, that's the Cochrane Risk of Bias tool, version 2 — often shortened to RoB 2. It walks you through five domains — the randomisation process, deviations from intended interventions, missing outcome data, measurement of the outcome, and selection of the reported result.

**Sarah:** And for non-randomised studies, there's a parallel tool called ROBINS-I.

**Kiffer:** Right. ROBINS-I — Risk Of Bias In Non-randomised Studies of Interventions. Same idea, but it adds domains for confounding and selection of participants, because those are the threats that randomisation would have handled. You'd use ROBINS-I for cohort or case-control studies. There's also AMSTAR 2 — A Measurement Tool to Assess Systematic Reviews — which is what you'd use to appraise other systematic reviews, so for a review of reviews.

**Sarah:** Okay, step six.

**Kiffer:** Extract the data. From each included study, you need two things — the point estimate of the outcome and a measure of its precision. So that's typically an effect size and its standard error, or a confidence interval you can convert to a standard error. The extraction is done independently by two investigators using a standardised template. Disagreements are reconciled by discussion.

**Sarah:** And watch for duplicate publication. The same trial can appear in multiple papers, sometimes with slightly different reported results.

**Kiffer:** Yeah, you'd hate to count the same data twice. That's a real error in the literature — there are documented cases where the same trial appears under different study names and ends up in a meta-analysis as if it were two independent studies.

**Sarah:** And the final one — step seven.

**Kiffer:** Summarise and synthesise. This is where you decide qualitative versus quantitative. A qualitative synthesis is a structured narrative with a summary table and maybe a figure. A quantitative synthesis is a meta-analysis. The meta-analysis gives you a pooled estimate weighted by each study's precision, plus tools for exploring why studies disagree.

**Sarah:** Two more tools sit at the end of Section 1. PRISMA and GRADE.

**Kiffer:** PRISMA is the reporting standard. It stands for Preferred Reporting Items for Systematic Reviews and Meta-Analyses. The 2020 version has a 27-item checklist and a four-phase flow diagram — identification, screening, eligibility, included. You've all seen PRISMA flow diagrams in papers. The boxes with the numbers — *X records identified, Y after duplicates removed, Z screened*, and so on down to the final included set.

**Sarah:** PRISMA doesn't tell you the review is good. It tells you what information should be present so you can decide.

**Kiffer:** Same point we'll make about STROBE in later lessons. Reporting completeness is necessary for evaluation. It doesn't guarantee validity.

**Sarah:** And GRADE.

**Kiffer:** GRADE stands for Grading of Recommendations, Assessment, Development and Evaluation. It's a framework for rating how much confidence the reader should have in the body of evidence for each outcome of a review. The logic is straightforward but the bookkeeping is detailed.

**Sarah:** Walk through how it works.

**Kiffer:** Each outcome starts at a baseline level of certainty depending on the design — randomised trials start at *high*, observational studies start at *low*. Then GRADE asks you to downgrade or upgrade across eight domains. Five domains can downgrade — risk of bias, inconsistency, indirectness, imprecision, and publication bias. Three domains can upgrade — large effect size, dose-response gradient, and plausible confounding that would attenuate rather than create the observed effect.

**Sarah:** And the final rating is one of four levels.

**Kiffer:** High, moderate, low, or very low. High means we're very confident the true effect is close to the estimate. Very low means we have very little confidence — the true effect could be substantially different. And the GRADE rating is usually presented in a Summary of Findings table alongside the pooled effect estimates, so decision-makers can weigh the size of the effect against the trustworthiness of the evidence.

**Sarah:** GRADE is endorsed by Cochrane, by the World Health Organization, by NICE in the UK, and by over a hundred other organisations. It's the de facto standard.

**Kiffer:** That's Section 1. The systematic review is a structured, reproducible process. Section 2 turns to the quantitative half — the meta-analysis itself.

**Sarah:** And the central question in Section 2 is — how do you combine effect estimates from multiple studies into a single pooled estimate?

**Kiffer:** The lesson starts with a definition from Gene Glass in 1976. A meta-analysis is, quote, the statistical analysis of a large collection of analysis results from individual studies for the purpose of integrating the findings.

**Sarah:** And it has two objectives.

**Kiffer:** Two objectives. First, provide an overall estimate of the effect by pooling across studies. Second, explore reasons for variation across studies. The first gives you a pooled answer. The second tells you whether that pooled answer is actually meaningful or whether studies are disagreeing so badly that the average is misleading.

**Sarah:** And the lesson catalogues three types of data you can pool.

**Kiffer:** Yeah. Three types, in increasing order of richness and decreasing order of availability.

**Sarah:** Walk us through them.

**Kiffer:** Summary estimate data. This is what most meta-analyses use because it's what's reported in published papers. From each study you grab the point estimate of the effect — a risk ratio, an odds ratio, a mean difference — plus its standard error or confidence interval. That's it. Two numbers per study.

**Sarah:** And summary data are easy to find but limited.

**Kiffer:** Limited because you can only explore *study-level* sources of variation. You can ask whether older studies show different effects from newer ones, but you can't ask whether older *patients* show different effects from younger ones unless every study reported separately by age group.

**Sarah:** And the second type.

**Kiffer:** Group data. The two-by-two table per study. So for a binary outcome, those are the cell counts for treated versus control, with and without the event. For a continuous outcome, it's the number, mean, and standard deviation in each arm. Group data let you compute different effect measures from the same underlying data — you can produce a risk ratio, an odds ratio, or a risk difference depending on what the question calls for.

**Sarah:** And the third.

**Kiffer:** Individual patient data. IPD. This is the raw data — every participant's outcome and every participant's covariates. It's the gold standard. With IPD you can explore patient-level effect modification. You can ask whether the drug works differently in older versus younger patients without depending on whether the original authors happened to publish that subgroup. You can re-analyse with consistent methods across studies.

**Sarah:** But you almost never get it.

**Kiffer:** Yeah. IPD meta-analyses are expensive and time-consuming because you have to negotiate with the original investigators for access to their raw data. Some collaborations build the infrastructure to do this — the IPD meta-analyses of statins, of antihypertensives — but it's the exception, not the rule. Most meta-analyses you'll read use summary data.

**Sarah:** Okay. Now the central design choice in Section 2. Fixed-effects versus random-effects models.

**Kiffer:** And the lesson is really clear that this is not a technical detail. It's a substantive choice about what you think the studies are doing.

**Sarah:** Walk us through fixed-effects first.

**Kiffer:** The fixed-effects model assumes there is a single, true treatment effect that's the same across every study. Call it theta. Each study estimates that one true theta with some sampling error. The only reason studies disagree is that each one is a noisy estimate of the same underlying truth.

**Sarah:** And mathematically it's clean.

**Kiffer:** Yeah, the equation in the textbook — Equation 28.1 — says the observed effect in study *i* is equal to theta plus an error term with within-study variance V sub i. Weights come from inverse variance — W sub i equals one over V sub i. Larger, more precise studies get more weight. The pooled estimate is the weighted average.

**Sarah:** And the advantage of the fixed-effects model is simplicity. You don't have to estimate any between-study variance.

**Kiffer:** Right. The disadvantage is that the assumption is often untenable. If your studies were done in different populations, with different doses, different settings, different lengths of follow-up — is the true effect really constant? Almost certainly not. And when you assume it's constant and it isn't, your confidence interval is too narrow. Your significance tests are too aggressive.

**Sarah:** Which brings us to the random-effects model.

**Kiffer:** Random-effects assumes a *distribution* of true treatment effects across studies. There's still a central tendency — call it theta — but each study's true effect bounces around theta according to a normal distribution with variance tau-squared.

**Sarah:** And tau-squared is the heterogeneity parameter.

**Kiffer:** Tau-squared is the between-study variance. It's the new piece. The model now says the observed effect in study *i* equals theta, plus a random-effect term u sub i drawn from a normal with variance tau-squared, plus a sampling error term epsilon with variance V sub i. The weights become one over the sum V sub i plus tau-squared. So compared to fixed-effects, larger studies still get more weight but the weighting is more even. The pooled estimate's confidence interval is wider because it accounts for both within- and between-study variation.

**Sarah:** And the key distinction the lesson draws is about what question each model is answering.

**Kiffer:** Yeah, this is the framing I find most useful. The fixed-effects model is asking, what is the single true effect? The random-effects model is asking, what is the average of the distribution of true effects? Two different questions. Two different answers.

**Sarah:** And the lesson is fairly directive about which to use.

**Kiffer:** Random-effects is now the default, because the assumption of a constant treatment effect across populations and settings is rarely justified. If you're combining studies done in different countries, different decades, different patient mixes — there's heterogeneity. Acknowledge it. Use random-effects. The exception is when you really do think you're pooling near-replications of the same protocol, which is rare.

**Sarah:** Quick note on weighting methods. The lesson mentions a couple of alternatives.

**Kiffer:** Yeah. The default is inverse-variance weighting, which works for both continuous and binary outcomes. For binary outcomes with sparse data — rare events, few studies — the Mantel-Haenszel or Peto methods may be preferred because they're more stable. And for continuous outcomes where studies used different measurement scales — five different depression questionnaires, say — you switch to standardised mean differences. Cohen's d or Hedges' g. These express the effect in standard deviation units so different scales become comparable.

**Sarah:** Okay. That's Section 2. Section 3 turns to the visual and quantitative tools for interpreting what your pooled estimate actually means.

**Kiffer:** And the centrepiece of Section 3 is the forest plot. If you've read any meta-analysis, you've seen one. It's the most important graphical output of the field.

**Sarah:** Walk us through what a forest plot is doing.

**Kiffer:** Imagine a vertical axis with one row per study. Each row has a horizontal line that represents the 95% confidence interval for that study's effect estimate. In the middle of the line is a box marking the point estimate. The *area* of the box is proportional to the study's weight in the pooled analysis — so larger, more precise studies get bigger boxes. At the bottom, a diamond. The horizontal centre of the diamond is the pooled estimate. The horizontal width of the diamond is the confidence interval of the pooled estimate. And a solid vertical line cuts through the plot at the null value — zero for differences, one for ratios.

**Sarah:** And then a dashed vertical line at the pooled estimate.

**Kiffer:** Yeah, often a dashed line is drawn through the pooled estimate so you can visually see how each study's CI relates to the summary.

**Sarah:** And reading it is largely pattern recognition.

**Kiffer:** Largely. If all the study CIs overlap considerably and cluster near the summary diamond, you're looking at low heterogeneity — the studies basically agree. If the CIs are scattered widely and many of them don't overlap, you've got substantial heterogeneity. Studies might be ordered by year, by sample size, by effect magnitude — depending on what pattern you want the reader to see.

**Sarah:** The lesson includes an R box showing you how to do this end-to-end with the metafor package. Three lines of code — and that's a feature.

**Kiffer:** Yeah, this is something the course really wants students to internalise. The end-to-end meta-analysis — pooled estimate, heterogeneity diagnostics, the figure for the paper — is essentially three function calls in metafor. You set up your data, you call *escalc* to compute the log-OR and its variance for each trial, you call *rma* to fit the random-effects model, you call *forest* to plot it. The R activity in the lesson runs through this with simulated smoking-cessation data and walks you through interpreting the output.

**Sarah:** And the output gives you the pooled estimate, tau-squared, I-squared, and Cochran's Q.

**Kiffer:** Which brings us nicely into the heterogeneity section. Because once you've got that pooled diamond, the next question is — should I trust it?

**Sarah:** And the lesson is really clear that heterogeneity always needs to be evaluated.

**Kiffer:** Always. Always. The pooled estimate without a heterogeneity assessment is half an answer. The first distinction the lesson makes is real versus artefactual heterogeneity.

**Sarah:** Real heterogeneity is when there are genuine differences in treatment effects across studies — different populations, different doses, different settings, different durations. The variation is in the world.

**Kiffer:** Artefactual heterogeneity comes from study design issues that aren't tracking real differences. Things like different follow-up durations, different reliability of the outcome measure, lack of blinding in some studies but not others. And one particular cause the lesson flags — the choice of effect measure. Three studies with identical underlying data, identical risk ratios, can show different odds ratios or risk differences depending on the baseline risk in each.

**Sarah:** And the second distinction — clinical versus statistical heterogeneity.

**Kiffer:** Clinical heterogeneity is real differences between populations, interventions, settings. It's almost always present. People are different, settings are different. The question is whether on top of that clinical heterogeneity, the *statistical* heterogeneity — variation in observed effects beyond what sampling error alone can explain — is also present.

**Sarah:** Two main statistics for measuring statistical heterogeneity. Cochran's Q and Higgins I-squared.

**Kiffer:** Cochran's Q is a chi-squared test. Equation 28.7 — Q equals the sum across studies of the weight times the squared deviation of each study's effect from the pooled estimate. Under the null of no heterogeneity, Q follows a chi-squared distribution with k minus one degrees of freedom, where k is the number of studies. A significant Q means heterogeneity. But — and this is the catch — the Q test has *low statistical power* when you have few studies, which is most meta-analyses. So a non-significant Q doesn't prove homogeneity. The convention is often to use a more relaxed P-value cutoff of 0.10 instead of 0.05.

**Sarah:** And I-squared is the more interpretable companion statistic.

**Kiffer:** Yeah. I-squared is one of the most useful statistics in the field because it's percentage-scaled. Equation 28.8 — I-squared equals Q minus k minus one, divided by Q, times 100 percent. It tells you the proportion of the total variance in effect estimates that's due to real heterogeneity rather than sampling error. The conventional benchmarks — 25 percent is low, 50 percent is moderate, 75 percent is high. An I-squared above 25 percent should prompt you to investigate possible causes.

**Sarah:** And tau-squared is the third heterogeneity statistic.

**Kiffer:** Yeah, tau-squared is the actual between-study variance — the parameter we met in the random-effects model. Larger tau-squared means more dispersion of true effects across studies. The three statistics work together. Cochran's Q gives you a hypothesis test. I-squared gives you a percentage. Tau-squared gives you the actual magnitude of dispersion in the units of your effect measure.

**Sarah:** When you do detect heterogeneity, the lesson walks through four ways to investigate its causes.

**Kiffer:** First, subgroup analysis. You identify a characteristic of interest — patient age, region, dose level — and you compute the pooled effect within each subgroup separately. The lesson is careful to warn against over-interpretation. The best estimate for any one subgroup actually comes from considering all the evidence, not just the data from that subgroup — that's called Stein's Paradox. And subgroup analyses should be pre-specified in the protocol. If you go fishing for subgroups after the fact, you'll find some, and they won't replicate.

**Sarah:** And the second.

**Kiffer:** Stratified analysis. Similar idea but more formal — you stratify by a factor thought to influence the effect and run a separate meta-analysis in each stratum. You can test between-strata heterogeneity using a formula that decomposes the total Q into a between-strata component and within-strata components. The disadvantage is that individual strata may contain very few studies.

**Sarah:** And third.

**Kiffer:** The Galbraith plot. This is more diagnostic than analytic. You plot the Z statistic — the effect divided by its standard error — on the vertical axis, against the inverse of the standard error on the horizontal. The slope of the fitted line through the cloud of points is the fixed-effect estimate. Lines drawn at plus and minus 2 units from this line should encompass roughly 95 percent of observations if there's no significant heterogeneity. Points falling outside those bounds are potential outliers.

**Sarah:** Fourth, and most flexible.

**Kiffer:** Meta-regression. This is a weighted regression of the observed effect against study-level predictors. With inverse-variance weights. You can include year, dose, country, risk-of-bias rating — anything coded at the study level. Meta-regression extends the random-effects model by adding predictors. The cautions are real though. Even with RCTs, meta-regression is observational at the study level. You can fit many predictors and inflate Type I error. And it's vulnerable to ecological fallacy, because the predictors are study-level averages, not individual values.

**Sarah:** Okay. That's Section 3. You have a pooled estimate. You've assessed heterogeneity. Section 4 turns to three threats that survive correct technique.

**Kiffer:** And the framing is important. Each of these threats can produce a precise but misleading pooled answer. The confidence interval can look tight, the diamond can look like a real signal, and the actual underlying truth can be somewhere else entirely.

**Sarah:** First threat. Publication bias.

**Kiffer:** Publication bias is the systematic distortion that happens because studies with statistically significant or favourable results are more likely to be published than those with null findings. So the literature you can search isn't a random sample of all the research that's been done. It's a biased subset. And the direction of the bias is consistent — the published literature overestimates effects.

**Sarah:** And the consequence for meta-analysis is direct. If you pool only published studies, and published studies overestimate the effect, the summary estimate will be biased away from the null. You'll think the drug works better, or the exposure is more harmful, than it really is.

**Kiffer:** Yeah. And this is one of the strongest arguments for searching grey literature and for trying to identify unpublished trials through public trial registries. You want every study, not just the ones that found something.

**Sarah:** The main diagnostic tool for publication bias is the funnel plot.

**Kiffer:** Yeah. You plot each study's effect estimate on the horizontal axis against its standard error on the vertical — with the vertical axis usually inverted so precise studies are at the top. In the absence of publication bias, the plot should look like an inverted funnel. Symmetric around the pooled estimate. Small studies — large standard errors — scattered widely at the bottom. Large studies — small standard errors — clustered near the top.

**Sarah:** And the diagnostic signal is asymmetry.

**Kiffer:** Asymmetry, particularly a *gap* on one side at the bottom of the funnel. If small studies with large effects are present but small studies with small or null effects are missing, that's the fingerprint of selective publication. But — and this caveat matters — asymmetry can also come from real heterogeneity, from systematic differences in quality between small and large studies, or from true small-study effects. So you have to interpret cautiously.

**Sarah:** Two formal statistical tests for funnel-plot asymmetry — Begg's and Egger's.

**Kiffer:** Begg's test uses a rank correlation between effect estimates and their standard errors. It's simple but has low power. Egger's test is a linear regression of standardised effects on precision — it's generally more powerful for detecting publication bias. But neither test is sensitive when the number of studies is small, and both can produce false positives when there are large treatment effects or when all trials are similar in size.

**Sarah:** And the trim-and-fill method goes one step further — it doesn't just detect publication bias, it adjusts for it.

**Kiffer:** Yeah. Trim-and-fill comes from Duval and Tweedie in 2000. The logic is in the name. First you *trim* — you sequentially remove the most extreme studies on the asymmetric side of the funnel until the plot looks symmetric. The centre of the trimmed plot gives you an adjusted estimate of the treatment effect. Then you *fill* — you put the trimmed studies back and add their hypothetical counterparts on the other side, mirrored across the new centre line. Then you redo the meta-analysis including both the real and the imputed studies. The difference between the original pooled estimate and the trim-and-fill adjusted estimate tells you how sensitive your result is to potential publication bias.

**Sarah:** The trim-and-fill estimate isn't a corrected truth. It's a sensitivity check.

**Kiffer:** Exactly. It's saying — if the funnel-plot asymmetry really does reflect publication bias, here's roughly what the result would look like if all the missing studies had actually been published. If your trim-and-fill estimate is similar to your original, you're robust. If it's substantially different, that's a warning.

**Sarah:** Second threat. Influential studies.

**Kiffer:** The question here is — is the pooled result actually a summary of many studies, or is it really being driven by one big trial that's pulling everything in its direction?

**Sarah:** And the standard tool is leave-one-out sensitivity analysis.

**Kiffer:** Yeah. You sequentially remove each study from the analysis, recompute the pooled estimate without it, and see how much the result moves. If removing any single study shifts the pooled estimate substantially, that study is influential. You'd want to examine it more closely — is it the largest? Is it the most extreme? Is its risk of bias different from the others?

**Sarah:** The lesson uses an example with 25 studies.

**Kiffer:** Yeah. A pooled estimate of negative 2.121. One study, identified as a potential outlier in the Galbraith plot, was removed. The estimate changed to negative 2.011 — about a 5 percent reduction in magnitude. And the I-squared dropped from 95.6 percent down to 88.1 percent. So that one study was contributing both to the magnitude and to a chunk of the heterogeneity.

**Sarah:** And the lesson's framing on what to do is important. If one study is highly influential, you don't automatically drop it. You report the result with and without it. You consider whether its design is fundamentally different. You let the reader weigh it.

**Kiffer:** Yeah. Influential isn't synonymous with wrong. A large, well-conducted trial *should* be influential — it's the most informative single piece of evidence. The issue is when the meta-analysis is doing very little beyond reproducing that one trial.

**Sarah:** Third threat. Outcome-scale issues.

**Kiffer:** This is the one that's easiest to miss because it doesn't show up in a forest plot. Published studies vary substantially in how they report data. Some studies report odds ratios. Others report risk ratios. Others report mean differences on idiosyncratic scales. If you pool studies that look superficially similar but used different effect measures or different scales, the pooled estimate may not mean what you think.

**Sarah:** And the lesson highlights three practical issues.

**Kiffer:** First, computing standard errors. Sometimes a study reports a confidence interval but not a standard error. You can recover the SE — SE equals upper limit minus lower limit, divided by 2 times 1.96, for a 95 percent CI. For ratio measures like odds ratios, you do this on the log scale. For small samples, use the relevant t statistic instead of 1.96. It's basic but it's the kind of practical fix that comes up constantly when you're extracting data.

**Sarah:** And the second.

**Kiffer:** Different outcome scales. If three studies measured dental plaque with three different indices, you can't pool the raw mean differences because they're in different units. You convert to a standardised mean difference. The standardised mean difference is the mean difference divided by the pooled standard deviation. It expresses the effect in standard-deviation units, which is comparable across scales. Cohen's d is the basic version. Hedges' g is the small-sample-corrected version. Glass's delta uses just the control group SD.

**Sarah:** And the third.

**Kiffer:** Combining binary and continuous outcomes. Some studies report a binary outcome — cure, yes or no. Others report a continuous outcome — symptom score on a scale of zero to a hundred. There are conversion formulas — for example, you can approximate the standardised mean difference from a log-odds-ratio using d equals log-OR times square root of 3 over pi. But these conversions involve assumptions and should be used cautiously. Honestly, if the underlying outcomes are conceptually different, pooling them may not be appropriate at all.

**Sarah:** And the broader point the lesson is making with all of Section 4 is that a precise pooled estimate is not the same thing as a correct one.

**Kiffer:** Yeah. This is what I want students to take away from Section 4. The forest plot can look clean. The diamond can be narrow. The I-squared can be low. And the underlying truth can still be somewhere else, because publication bias has shifted the literature, because one big trial is driving the result, or because you're pooling outcomes that don't actually share a common scale. The diagnostic tools in this section are how you check.

**Sarah:** Okay. Let me try to pull this together into takeaways.

**Kiffer:** Yeah. Go for it.

**Sarah:** First. Two hierarchies to keep in mind. The evidence pyramid is a heuristic about how much variation a design has ruled out. It tells you where to look for the strongest available evidence — not whether what you find is good. The DIKW hierarchy reminds you that a meta-analysis is the top of the *knowledge* layer, not wisdom. The contextual, value-laden judgement about what to do with the evidence belongs to the people doing the deciding.

**Kiffer:** Second. Systematic reviews are the structured, reproducible answer to the question of how to summarise a body of evidence. Narrative reviews depend on the reviewer's preconceptions. Systematic reviews specify the question, search strategy, inclusion criteria, appraisal, and synthesis transparently. The seven steps from Sargeant and colleagues — specify the question, lay out the protocol, find the studies, determine relevance, evaluate quality, extract data, synthesise — are the workflow. PROSPERO and PRISMA make the workflow auditable. GRADE rates the certainty of what comes out.

**Sarah:** Third. Meta-analysis is the quantitative half of a systematic review. The central design choice is fixed-effects versus random-effects. Fixed-effects assumes one true effect and treats variation as sampling error. Random-effects assumes a distribution of true effects with between-study variance tau-squared. Random-effects is the default because the assumption of a constant effect across populations is rarely justified.

**Kiffer:** Fourth. The forest plot makes the pooled story visible at a glance — each study as a box and CI, the pooled estimate as a diamond. Cochran's Q, I-squared, and tau-squared quantify how much the studies actually disagree. I-squared has rough benchmarks — 25 percent low, 50 moderate, 75 high. When heterogeneity is meaningful, you investigate causes — subgroup analysis, stratification, the Galbraith plot, or meta-regression. Pre-specify these in the protocol, don't go fishing after the fact.

**Sarah:** Fifth. Three threats can produce a precise but misleading pooled estimate. Publication bias — published studies are not a random sample of all research. Detect it with funnel plots, Begg's and Egger's tests; sensitivity-check it with trim-and-fill. Influential studies — leave-one-out diagnostics tell you whether one trial is doing all the work. Outcome-scale issues — the choice of effect measure, different measurement scales, and combining binary and continuous outcomes can all make a clean-looking pooling exercise mean less than it appears to.

**Kiffer:** Sixth. PRISMA is the reporting standard. GRADE is the certainty rating. Both are now de facto requirements for trustworthy reviews. Use them when you write. Use them when you read.

**Sarah:** And one connection to the rest of the course before we wrap.

**Kiffer:** Yeah. This is the part I want to flag. From Lesson 3 onward, you're going to spend the rest of this material working through the individual designs that sit below the apex of the pyramid. Cross-sectional. Case-control. Cohort. Ecological. Plus the threats — sampling, information bias, confounding — that each design has to fight.

**Sarah:** And the question to keep in mind for every one of those lessons is — *and what does the synthesised literature on this question say?*

**Kiffer:** Right. Because no single observational study, no matter how well done, settles a question. The synthesis layer is the lens through which every individual paper should be read. If the literature is heterogeneous and conflicting, the result of a single new study has limited weight on its own. If the literature is consistent, the new study either confirms or, if it dissents, raises a real question about what's different.

**Sarah:** And on the capstone side, this week's milestone has you doing a miniature version of this whole process. PICO question. Reproducible search strategy. Inclusion and exclusion criteria. Tracked screening with stated reasons. Plus reading at least one existing forest plot and saying something correct about what its heterogeneity tells you.

**Kiffer:** Yeah, the capstone is the place where this material becomes operational. You'll be glad you put the scaffolding in now, because every later week of the capstone builds on the search and synthesis foundation you put down here in Week 2.

**Sarah:** Practical advice for anyone listening?

**Kiffer:** Two things. First, don't try to memorise every formula in the lesson. Get the shape of the workflow — the seven steps, the central design choice between fixed- and random-effects, the diagnostic tools for heterogeneity and publication bias. The formulas are reference material you can look up. The workflow is what you have to internalise.

**Sarah:** And second?

**Kiffer:** Run the R box in metafor. Even if you've never used the package before. Three function calls. *escalc*, *rma*, *forest*. Doing the actual pooling once — and producing your own forest plot from your own data — is worth more than a week of reading. Once the workflow is in your hands, the lesson is no longer abstract.

**Sarah:** Next up is Lesson 3. Introduction to Observational Studies. The pivot from framing into actual research practice — where the design choices we evaluate from above in this lesson become the design choices we have to make on the ground.

**Kiffer:** See you there.

**Sarah:** Take care, everyone.
