Systematic Reviews and Meta-Analysis

Evaluating Epidemiological Research

Learning objectives for this lesson:

Place systematic reviews and meta-analyses within the hierarchy of evidence, and explain why the pyramid is a heuristic, not a verdict
Use the DIKW hierarchy (data → information → knowledge → wisdom) to frame the role of evidence synthesis in public-health decision-making
Carry out the steps of a systematic review, from specifying the question to synthesising results
Complete the data-extraction process to provide data suitable for meta-analysis
Calculate summary estimates of effect and evaluate heterogeneity among study results
Choose between fixed- and random-effects models and explain when each is appropriate
Present and interpret forest plots and other graphical displays of meta-analysis results
Evaluate potential causes of heterogeneity using subgroup analysis, stratification, and meta-regression
Evaluate the potential impact of publication bias using funnel plots and related methods
Determine if results have been influenced by an individual study (sensitivity analysis)

This course was developed by Dr. Kiffer G. Card, Faculty of Health Sciences, Simon Fraser University.

Reference

Glossary: Key Terms, People & Concepts

📚 Reference page, available throughout the lesson

This glossary collects the key concepts, people, and ideas you will meet in this lesson. Use it as a reference while you work through the material, or as a review before assessments. Type in the search box to filter entries.

Key Concepts & Ideas

Hierarchy of Evidence (Evidence Pyramid) A heuristic ranking of study designs by the amount of variation each design typically rules out, from expert opinion and case reports at the base, through cross-sectional, case-control, and cohort studies, to randomised controlled trials, and finally to systematic reviews and meta-analyses at the apex. A useful first sort for "where to look for the strongest available evidence," not a verdict on the quality of any individual study.

DIKW Pyramid A four-tier model from informatics and decision science: Data → Information → Knowledge → Wisdom. Raw data become information when organised; information becomes knowledge when appraised and integrated; knowledge becomes wisdom when exercised in a specific context under uncertainty. A meta-analysis is the highest-order knowledge layer; the wisdom layer is where contextual, value-laden public-health judgement happens.

Systematic Review A structured, protocol-driven review that uses pre-specified, reproducible methods to identify, appraise, and synthesise all studies relevant to a focused research question. Aims to minimise selection and reporting bias in evidence synthesis.

Narrative (Traditional) Review An informal qualitative summary of literature, typically without protocol, comprehensive search, or risk-of-bias appraisal. Useful for orientation but vulnerable to author selection bias.

Meta-Analysis The statistical pooling of effect estimates from multiple studies into a single weighted summary estimate, with quantification of between-study variation.

Protocol & PROSPERO A systematic-review protocol pre-specifies the question, eligibility criteria, search strategy, and analysis plan. PROSPERO is the international prospective register of systematic reviews where protocols are deposited before screening begins.

PICO(S/T) Framework Population, Intervention/Exposure, Comparator, Outcome (and optionally Study design / Time), the structure used to write a focused, answerable review question.

Grey Literature Research published outside conventional peer-reviewed journals (theses, conference abstracts, government reports, preprints). Searching grey literature helps mitigate publication bias.

Heterogeneity Variation in true effect estimates across studies beyond what is expected from sampling error. Sources include clinical, methodological, and statistical heterogeneity.

I² Statistic The percentage of total variation across studies attributable to heterogeneity rather than chance (Higgins & Thompson, 2002). Rough benchmarks: 25% low, 50% moderate, 75% high.

Cochran's Q Test A chi-squared test of the null hypothesis of homogeneity across studies. Often underpowered with few studies and overpowered with many; used alongside I² and tau².

Tau² (between-study variance) The estimated variance of the true effect across studies in a random-effects meta-analysis. Larger tau² means greater between-study variation.

Fixed-Effect Model Assumes a single common true effect underlies all studies; differences between studies arise only from sampling error. Inverse-variance weighting gives more weight to larger studies.

Random-Effects Model Assumes the true effect varies across studies according to a distribution. Pooled estimate has a wider confidence interval and weighting is more even across studies (DerSimonian & Laird, 1986).

Publication Bias Systematic distortion of meta-analytic results because studies with statistically significant or favourable findings are more likely to be published, indexed, or available in English.

Risk of Bias A study-level judgement of internal validity that considers selection, performance, detection, attrition, and reporting domains. Drives sensitivity analyses and GRADE downgrading.

GRADE Grading of Recommendations Assessment, Development and Evaluation, a system for rating the certainty of evidence (high, moderate, low, very low) and the strength of resulting recommendations (Guyatt et al., 2008).

Methods, Measures & Tools

PRISMA 2020 Preferred Reporting Items for Systematic Reviews and Meta-Analyses, the 27-item checklist and four-phase flow diagram (identification, screening, eligibility, included) for transparent reporting of systematic reviews (Page et al., 2021).

Forest Plot A graphical display of effect estimates from each study (squares sized by weight) with their confidence intervals, plus the pooled estimate (diamond) and a line of no effect.

Funnel Plot A scatter plot of effect size against standard error (or precision) used to inspect for asymmetry suggestive of publication bias or small-study effects.

Egger's Test & Begg's Test Statistical tests for funnel-plot asymmetry. Egger's (1997) regresses standardised effect on precision; Begg's uses rank correlation. Both are underpowered with few studies.

Trim-and-Fill A non-parametric method that imputes hypothetical missing studies to make a funnel plot symmetric, then re-estimates the pooled effect to gauge sensitivity to publication bias.

RoB 2 (Risk of Bias 2) The Cochrane risk-of-bias tool for randomised trials, organised by signalling questions across five domains (randomisation, deviations, missing data, measurement, selection of result) (Sterne et al., 2019).

ROBINS-I Risk Of Bias In Non-randomised Studies of Interventions, the Cochrane tool for appraising non-randomised intervention studies, adding domains for confounding and participant selection (Sterne et al., 2016).

AMSTAR 2 A 16-item critical appraisal tool for systematic reviews of randomised and non-randomised studies of healthcare interventions (Shea et al., 2017).

Leave-One-Out Sensitivity Analysis A meta-analytic robustness check that recalculates the pooled estimate with each study removed in turn to identify influential studies.

Meta-Regression & Subgroup Analysis Methods for exploring whether study-level covariates (year, dose, region, risk-of-bias rating) explain heterogeneity in effect sizes across studies.

Key People & Organisations

Archie Cochrane (1909–1988) Scottish epidemiologist who argued in Effectiveness and Efficiency (1972) that medical practice should be guided by systematically appraised RCT evidence; namesake of the Cochrane Collaboration.

Iain Chalmers (1943– ) British health-services researcher who founded the Cochrane Collaboration in 1993 and helped establish the modern infrastructure for systematic reviews.

Cochrane (Collaboration) An international not-for-profit network producing the Cochrane Database of Systematic Reviews and stewarding the standard tools (RoB 2, ROBINS-I, RevMan) for evidence synthesis.

No matching entries. Try a different search term.

Section 1 of 5

Hierarchy of Knowledge & Systematic Reviews

⏱ Estimated reading time: 55 minutes

Section 1 of 5

Hierarchy of Knowledge & Systematic Reviews

The evidence pyramid, the DIKW framework, and the seven-step systematic review process.

Why start here

The apex before the rungs below

Almost every public health guideline rests on a systematic review. Understanding how that synthesis works, and where it can mislead, is the lens for reading every study that follows.

This section sets up that lens before the course works down the pyramid.

The evidence pyramid

Ruling out variation, one tier at a time

Higher tiers rule out more sources of variation, but the hierarchy is a heuristic. Appraise what you find.

The DIKW hierarchy

What the evidence is for

Data

Raw observations: counts, measurements, lab values.

Information

Data given structure, context, and comparison.

Knowledge

Information appraised, integrated, and defensible in print.

Wisdom

Knowledge applied in context, under uncertainty, for a specific decision.

A meta-analysis is the highest knowledge layer. Wisdom is what happens next, and that step is irreducibly human.

Why narrative reviews fail

From subjective summary to documented procedure

Narrative review

Subjective selection. No documented search. All studies weighted equally. Prone to reviewer preconceptions.

Systematic review

Structured, transparent methodology. Reproducible search. Quality appraisal. Pre-specified inclusion criteria.

Antman et al. (1992): expert narrative reviews of myocardial-infarction treatments lagged years behind the cumulative meta-analytic evidence.

Seven steps

The systematic review process (Sargeant et al., 2006)

Specify the question: intervention, population, outcome, comparator
Lay out the protocol: transparent, pre-specified, reproducible
Find all the studies: databases, reference lists, grey literature
Determine relevance: inclusion/exclusion criteria, two independent reviewers

Evaluate quality: RoB 2, ROBINS-I, or equivalent tools
Extract the data: point estimates and precision, standardised template
Synthesise results: qualitatively or quantitatively (meta-analysis)

Quality assurance

PROSPERO and GRADE

PROSPERO

Prospective protocol registration before study selection. Creates a public audit trail. Mirrors the role of ClinicalTrials.gov for primary trials.

GRADE

Rates certainty of evidence per outcome: High, Moderate, Low, or Very Low. Accounts for risk of bias, inconsistency, indirectness, imprecision, and publication bias.

Carry forward

What to take into the next section

The evidence pyramid is a heuristic for where to look, not a verdict on individual studies.
Systematic reviews replace subjective selection with documented, reproducible procedure.
PROSPERO registers the protocol; GRADE rates certainty in the conclusions.
A meta-analysis is knowledge, not wisdom. The decision step is still human.

Introduction and Overview

An earlier lesson set up the foundations of epidemiology, its history, its ways of knowing, and the trust scaffolding (research integrity, reproducibility) that lets a body of evidence accumulate. A later section of that lesson made the case that empiricism is one way of knowing among several, and a later section left you with a personal stance on which of those ways you would lean on when the stakes are public-health decisions. This lesson picks up at exactly that seam: once you commit to evidence-based reasoning, how do you tell stronger evidence from weaker? Before we open up the machinery of systematic reviews and meta-analyses, we need a shared way of organising the answer. That organising idea (the hierarchy of knowledge) is where this lesson starts, and it is the lens through which every later lesson in the series will be read.

The Hierarchy of Knowledge

Before reading further, sit with the prompt below for a moment. The rest of the section is easier to follow if you have already put your own intuitions on paper.

💭 Discussion Prompt, Rank the Evidence

Suppose four sources tell you the same intervention works:

A trusted senior clinician's experience over a 30-year career.
A single, well-written case report.
A well-designed cohort study of 4,000 people followed for five years.
A meta-analysis of twelve randomised trials covering 18,000 participants.

Rank them from weakest to strongest evidence for a public-health recommendation. Then ask: what is doing the ranking work, sample size? design? the number of independent studies? Could any single feature of one of the weaker sources overturn your ranking?

Epidemiologists answer this question with a layered model usually called the hierarchy of evidence, or simply the evidence pyramid. The pyramid is not a strict ranking of every study against every other; it is a heuristic for thinking about how much variation a particular design has ruled out. At the base sit forms of evidence that are easy to produce but hard to defend on their own: expert opinion, individual experience, single case reports. Moving up, we add structured comparison (case series → cross-sectional → case-control → cohort). Higher still are interventional designs that randomise the exposure (randomised controlled trials), which under ideal conditions rule out confounding entirely. At the apex sit the designs that synthesise across all of the layers below: systematic reviews and meta-analyses.

Figure 2.1, The traditional hierarchy of evidence. Higher tiers rule out more sources of variation, but the hierarchy is a heuristic, not a verdict on any individual study.

A Complementary Hierarchy: DIKW

The evidence pyramid tells us where to look. A second, complementary hierarchy tells us what the evidence is for. The DIKW pyramid (Data → Information → Knowledge → Wisdom) is widely used in informatics, decision science, and clinical practice:

Data are raw observations: counts, measurements, dates, lab values.
Information is data given structure: organised, contextualised, compared.
Knowledge is information appraised: filtered for quality, integrated across sources, ready to defend in print.
Wisdom is knowledge exercised in a specific context, under uncertainty, in service of a particular decision.

A meta-analysis is not the same thing as wisdom. It is the highest-order knowledge layer in the evidence pyramid, the most thoroughly appraised, most heavily aggregated form of knowledge the field knows how to produce. Wisdom is what a clinician, policy-maker, or community partner does with that knowledge in their particular case, and that step is irreducibly human, contextual, and value-laden. An earlier lesson's call for a critical epidemiology that asks who benefits, who is harmed, whose priorities lives at the wisdom layer, not the knowledge layer. Keeping the two hierarchies side-by-side is a useful corrective against treating a forest plot as if it answered every question on its own.

Why Start at the Top of the Pyramid?

This course spends later lessons working through the rungs below the apex: case-control, cohort, ecological, and cross-sectional designs, plus the threats (selection, information, confounding) that compromise each. So why does this lesson begin at the top? Three reasons, each of which shapes the rest of the series:

You need the appraisal lens before you read any single study. When you encounter a primary observational study in later lessons, the first question is not "is this study good?" but "what does the synthesised literature on this question already say, and how does this study fit in?" The synthesis layer is the lens through which every later study should be read.
The hierarchy is a heuristic, not a verdict. A poorly conducted systematic review is weaker evidence than a well-designed cohort study. A well-designed case series of an unprecedented exposure may be the only evidence in existence. The hierarchy tells you where to look for the strongest available evidence; it does not exempt you from appraising what you find. A later section of this lesson will make that point concrete with publication bias, influential studies, and outcome-scale issues that can compromise even a textbook meta-analysis.
Every tool in this lesson exists because the lower tiers can mislead. PROSPERO registration, PRISMA reporting, GRADE certainty ratings, and the Cochrane risk-of-bias appraisals (so named for Archie Cochrane, whose 1972 monograph Effectiveness and Efficiency framed the case for systematically appraised RCT evidence); each is a field-wide response to a specific failure of unstructured evidence accumulation. An earlier lesson's trust ledger and the open-science reforms in its later sections are the wider movement these tools belong to. The hierarchy of evidence is, in a real sense, a history of the lessons epidemiologists learned the hard way.

With the hierarchy in mind, the four content sections of this lesson move from broad to specific. This section sets up systematic reviews as a structured way to identify and appraise all relevant studies; a later section turns to meta-analysis as the quantitative pooling of effect estimates, including the central choice between fixed- and random-effects models; a later section covers the forest plot and heterogeneity analysis; a later section closes with the threats to a meta-analysis that survive correct technique. You will revisit these tools throughout the remainder of this course as you appraise individual observational studies, every appraisal sits inside the question, "and what does the synthesised literature say?"

Learning Objectives

Place systematic reviews and meta-analyses within the hierarchy of evidence, and explain why the pyramid is a heuristic rather than a verdict on any single study.
Articulate how the DIKW hierarchy (data → information → knowledge → wisdom) frames evidence synthesis in public-health decision-making.
Distinguish a narrative review from a systematic review and explain why narrative reviews are unsuitable for guiding policy.
Walk through the seven steps of a systematic review (Sargeant et al., 2006), from question specification through synthesis.
Describe the role of the PRISMA reporting checklist and the Cochrane Risk of Bias tools in producing a trustworthy review.
Identify the inclusion/exclusion criteria, search strategy, and quality-appraisal steps that make a systematic review reproducible.

16.1 Why Systematic Reviews?

When making decisions about health interventions, we want to use all available information. Unfortunately, the literature is often inconclusive and conflicting, individual studies may produce results ranging from statistically significant to inconsequential, and the variation among results may be greater than expected from chance alone. A classic demonstration of the cost of relying on narrative summaries is Antman et al. (1992), who showed that expert reviews of myocardial-infarction treatments lagged years behind what a cumulative meta-analysis of randomised trials had already established.

There are two fundamental approaches to formally reviewing available data: a narrative review and a systematic review (which may include a meta-analysis).

Narrative Reviews

In a narrative review, each study is considered individually, and the reviewer subjectively assesses the evidence. Narrative reviews have several limitations:

They tend to be carried out by subject experts who may bring preconceived opinions, resulting in biased review
They often lack a structured methodology for identifying and assessing relevant studies
Small but well-designed studies may be omitted if they lack statistical power
Inclusion criteria are often not described in adequate detail
There is a tendency to weight all studies equally, when they should not all receive equal weight

Narrative reviews should only be used to provide an overview of literature, not to guide treatment or policy decisions.

Systematic Reviews

A systematic review uses a structured, transparent methodology to identify, evaluate, and synthesise all relevant studies on a specific question. It minimises bias and provides reproducible results. A systematic review may or may not include a quantitative meta-analysis, depending on the nature and quality of available data.

16.2 Steps of a Systematic Review

A systematic review follows seven key steps (Sargeant et al., 2006):

1. Specify the Question

The question should be driven by a clinical or health-policy objective, not by data availability. It is often more desirable to address a broad question (e.g., the ability of β-blockers as a class to reduce myocardial infarction risk) rather than a narrow one (e.g., one specific drug), to enhance generalisability. The question should specify the intervention(s), outcomes, comparisons, and eligible study designs.

2. Lay Out the Protocol

The review protocol should be objective and transparent, a reader should be able to duplicate it. This corresponds to the “Materials and Methods” of a primary study and covers all subsequent steps. A clear protocol minimises subjective decisions during the review process.

3. Find All the Studies

The literature search must be complete and well-documented. This involves searching major electronic databases (e.g., PubMed/Medline), reviewing reference lists of identified papers, and searching for grey literature (conference proceedings, theses, unpublished studies). The search strategy, databases, date ranges, and keywords must all be documented.

4. Determine Relevance (Inclusion/Exclusion Criteria)

Inclusion criteria specify the intervention(s), population(s), outcome(s), and study types eligible for the review. Exclusion criteria may include language restrictions, publication date cutoffs, or accessibility. Relevance should be assessed independently by two or more reviewers using the title and abstract, followed by full-text review.

5. Evaluate Study Quality

Each study’s internal and external validity must be evaluated. The Cochrane Collaboration’s risk-of-bias tool (now RoB 2 for randomised trials; Sterne et al., 2019) assesses domains including sequence generation, allocation concealment, blinding, incomplete data, and selective reporting; ROBINS-I (Sterne et al., 2016) provides the parallel framework for non-randomised studies. Quality assessment results can be used to exclude studies, weight them differentially, or evaluate quality as a source of heterogeneity.

6. Extract the Relevant Data

From each study, you need the point estimate of the outcome and a measure of its precision (SE or CI). Data extraction should be carried out independently by two investigators using a standardised template, with any differences resolved by discussion. Watch for duplicate reporting of the same data in multiple publications.

7. Summarise and Synthesise the Results

Results can be summarised qualitatively (narrative description with tabular/graphical display) or quantitatively (meta-analysis). A quantitative meta-analysis computes a pooled summary estimate of the effect, weighted by the precision of each study, and investigates reasons for variation across studies.

16.2.1 Two Tools That Underpin a Trustworthy Review

Modern systematic reviews are expected to do more than simply follow the seven steps; they must do so transparently and they must communicate how confident readers should be in the conclusions. Two widely adopted tools have become the de facto standards for these expectations: PROSPERO for protocol registration and GRADE for rating the certainty of the evidence.

📝 Call-Out: PROSPERO, Prospective Protocol Registration

What it is. PROSPERO is the international prospective register of systematic reviews, hosted by the Centre for Reviews and Dissemination at the University of York. Reviewers submit their protocol, research question, eligibility criteria, search strategy, planned analyses, and outcomes, before beginning study selection or data extraction.

Why it matters. Prospective registration mirrors the role of ClinicalTrials.gov for primary trials. It creates a public, time-stamped record of the review’s plan, which:

Reduces the risk of outcome-reporting bias and post hoc changes to inclusion criteria
Helps avoid duplication of effort across research teams working on the same question
Allows readers, peer reviewers, and editors to compare the final review against the planned protocol

PROSPERO registration is now required or strongly encouraged by Cochrane, by many journals (e.g., BMJ, JAMA, Annals of Internal Medicine), and by the PRISMA 2020 reporting guideline (Page et al., 2021; original statement: Moher et al., 2009). Pair the registration with PRISMA at submission and you have a fully transparent audit trail from question to conclusion.

🎯 Call-Out: GRADE, Rating the Certainty of the Evidence

What it is. The Grading of Recommendations, Assessment, Development and Evaluations (GRADE) framework (Guyatt et al., 2008) provides a structured method for rating how much confidence we should have in the body of evidence for each outcome of a systematic review, separately from rating the strength of any clinical recommendation that follows.

How it works. Each outcome starts at a baseline level of certainty based on study design (RCTs start at high; observational studies start at low), and is then downgraded or upgraded based on eight domains:

Domain	Direction	What it captures
Risk of bias	Downgrade	Methodological limitations of the included studies
Inconsistency	Downgrade	Unexplained heterogeneity across studies
Indirectness	Downgrade	Differences in population, intervention, comparator, or outcome from the question of interest
Imprecision	Downgrade	Wide confidence intervals or few events
Publication bias	Downgrade	Suspected selective reporting of positive findings
Large effect	Upgrade	Very large or very consistent observed effects
Dose-response	Upgrade	Plausible dose-response gradient
Plausible confounding	Upgrade	Residual confounding would reduce rather than create the observed effect

The body of evidence is then summarised in one of four certainty levels:

High: we are very confident the true effect lies close to the estimate
Moderate: the true effect is likely close to the estimate, but could be substantially different
Low: the true effect may be substantially different from the estimate
Very low: we have very little confidence in the estimate

GRADE certainty ratings are typically presented in a Summary of Findings table alongside the pooled effect estimates, allowing decision-makers to weigh the magnitude of the effect against the trustworthiness of the underlying evidence. GRADE is endorsed by Cochrane, the WHO, NICE (UK), and over 100 other organisations worldwide.

Reflection

Think of a public health question relevant to your interests. How would you specify the question for a systematic review? What databases would you search, and what inclusion/exclusion criteria would you set?

Model answerA strong response converts the topic into a PICO(S) question (Population, e.g., adults with type 2 diabetes; Intervention (SGLT2 inhibitors; Comparator) standard care or DPP-4; Outcome (cardiovascular mortality; Study type) RCTs and large cohorts). Databases should include MEDLINE/PubMed, Embase, CENTRAL (Cochrane), CINAHL for nursing-relevant outcomes, and PsycINFO for behavioural exposures. Grey literature: ClinicalTrials.gov, WHO ICTRP, conference abstracts. Inclusion: study design, year range with justification, language with translation plan, outcome measured by accepted instrument. Exclusion: editorials, animal studies, single-case reports, irrelevant comparators. Document hand-searching of key journals and forward/backward citation tracking. The point is reproducibility: another team should be able to run your search string and arrive at the same hit list.

Minimum 20 characters required.

✓ Reflection saved

Knowledge Check; this section

1. Which of the following is a limitation of narrative reviews?

They use a structured, reproducible methodology They always include a meta-analysis They may bring preconceived opinions and selectively include studies

Narrative reviews tend to be subjective, with reviewers potentially bringing biased perspectives and selectively including studies that support their opinions. They lack the structured methodology of systematic reviews.

2. What is the first step in conducting a systematic review?

Searching all electronic databases Specifying the question to be answered Evaluating study quality

The first step is to specify a clear research question driven by a clinical or health-policy objective. This question guides all subsequent steps of the review, including the search strategy and inclusion criteria.

3. Why should data extraction in a systematic review be carried out by two independent investigators?

To minimise errors and subjective bias in recording study results To double the amount of data available for analysis To ensure that narrative and systematic reviews produce the same results

Duplicate independent data extraction minimises errors and subjective bias. The two datasets are then compared, and any differences are resolved by discussion, ensuring the accuracy and reliability of extracted data.

Section 2 of 5

Meta-Analysis: Data Types & Effect Models

⏱ Estimated reading time: 50 minutes

Section 2 of 5

Meta-Analysis: Data Types & Effect Models

Combining effect estimates; choosing between fixed- and random-effects models.

Definition and purpose

What a meta-analysis does

Glass (1976): the statistical analysis of a large collection of results from individual studies for the purpose of integrating the findings.

Objective 1

Provide an overall estimate of an association or effect, pooled across all included studies.

Objective 2

Explore reasons for variation in the observed effect across studies.

Types of data

Summary, group, and individual patient data

Summary data

Point estimate (RR, OR, MD) plus SE or CI. Most common; extracted from published reports.

Group data

Cell values (2×2 table or group means and SDs). Allows computation of different effect measures.

Individual patient data

Raw outcome values per person. Most flexible for exploring heterogeneity; rarely available.

Fixed-effects model

One true effect for all studies

Fixed-effects model (Eq 28.1)

\[ \color{#0B7B6B}{T_i} = \color{#C2410C}{\theta} + \color{#6D28D9}{\varepsilon_i} \quad \text{where} \quad \color{#6D28D9}{\varepsilon_i} \sim N(0,\, \color{#1D4ED8}{V_i}) \]

T_i study effect θ common true effect ε_i within-study error V_i within-study variance

\(T_i\) is the observed effect in study \(i\); \(\theta\) is the single common true effect; \(V_i = [\text{SE}(T_i)]^2\) is the within-study variance. Study weights are \(W_i = 1/V_i\) (inverse variance weighting).

Limitation: Assumes one true effect across all populations and settings. Violated in most real-world pools.

Random-effects model

A distribution of true effects

Random-effects model (Eq 28.4)

\[ \color{#0B7B6B}{T_i} = \color{#C2410C}{\theta} + \color{#BE185D}{u_i} + \color{#6D28D9}{\varepsilon_i} \quad \text{where} \quad \color{#BE185D}{u_i} \sim N(0,\,\color{#047857}{\tau^2}),\; \color{#6D28D9}{\varepsilon_i} \sim N(0,\, \color{#1D4ED8}{V_i}) \]

T_i study effect θ average true effect u_i study deviation ε_i within-study error τ² between-study variance V_i within-study variance

\(\tau^2\) is the between-study variance. Weights become \(W_i = 1/(V_i + \tau^2)\), giving smaller studies more weight than under fixed effects.

Produces a wider confidence interval that accounts for true between-study variation. DerSimonian & Laird (1986) is the foundational estimator for \(\tau^2\).

Carry forward

Choosing between the models

Use fixed effects when…

You have strong grounds to believe all studies estimate one exchangeable true effect. Rare in practice.

Use random effects when…

Studies differ in population, intervention, or setting. The default for most real-world reviews.

For sparse binary data: Mantel-Haenszel or Peto weighting. For different outcome scales: standardised mean differences (Cohen’s d, Hedges’ g).

Introduction and Overview

An earlier section covered how to identify and appraise the studies that go into a review. This section turns to the quantitative half: combining the effect estimates from those studies into a single pooled estimate. The central design choice here (fixed-effects versus random-effects) is not arbitrary; it reflects an underlying assumption about whether all the studies are estimating the same true effect.

Learning Objectives

Define meta-analysis and state its two principal objectives (pooled estimate; explore variation).
Distinguish summary, group, and individual-patient data, and describe the tradeoffs of each.
Compare the assumptions of fixed-effects and random-effects models and state when each is appropriate.
Interpret pooled point estimates, confidence intervals, and study weights produced by either model.

16.3 What Is a Meta-Analysis?

A meta-analysis is “the statistical analysis of a large collection of analysis results from individual studies for the purpose of integrating the findings” (Glass, 1976). It is a formal process for combining results from multiple studies and is considered the “gold standard” for providing summary information about health interventions.

Objectives of Meta-Analysis

The objectives are to: (1) provide an overall estimate of an association or effect based on data from multiple studies, and (2) explore reasons for variation in the observed effect across studies. Because it combines data from multiple studies, meta-analysis gains statistical power for detecting effects.

16.3.1 Types of Data in Meta-Analysis

Three types of data can be used in a meta-analysis, each with different capabilities:

Data Type	Binary Outcome	Continuous Outcome
Summary estimate	Point estimate: RR, OR, RD, IR Precision: SE or CI	Point estimate: mean difference (MD) Precision: SE or CI
Group data	Cell values for treated and control groups (2×2 table)	Number, mean, and SD in each group
Individual patient data (IPD)	Raw data: outcome value (0 or 1) and individual characteristics	Raw data: outcome value and individual characteristics

Summary data are most commonly used. Group data allow computation of various effect measures. IPD are the most flexible but rarely available; they allow evaluation of study-, group-, and individual-level variables as sources of heterogeneity.

16.4 Fixed- vs. Random-Effects Models

A fundamental decision in any meta-analysis is whether to use a fixed-effects or random-effects model:

Fixed-Effects Model

Assumes the true treatment effect is constant across all studies. Any variation among observed study results is due solely to within-study random variation (sampling error).

Fixed-effects model (Eq 28.1)

\[ \color{#0B7B6B}{T_i} = \color{#C2410C}{\theta} + \color{#6D28D9}{\varepsilon_i} \quad\text{where}\quad \color{#6D28D9}{\varepsilon_i} \sim N(0,\, \color{#1D4ED8}{V_i}) \]

The observed effect in a study equals the single true effect shared by all studies plus within-study error, where that error has variance equal to the within-study variance.

Where T_i is the observed effect from study i, θ is the true overall effect, and V_i = [SE(T_i)]² is the known within-study variance. Weights are computed as W_i = 1/V_i (inverse variance weighting).

Advantage: Does not require estimating between-study variance (τ²).

Limitation: The assumption of a constant effect across all studies is often untenable, and ignoring between-study variation can lead to Type I errors and confidence intervals that are too narrow.

Random-Effects Model

Assumes a distribution of true treatment effects across studies (heterogeneity), with additional variability beyond within-study sampling error.

Random-effects model (Eq 28.4)

\[ \color{#0B7B6B}{T_i} = \color{#C2410C}{\theta} + \color{#BE185D}{u_i} + \color{#6D28D9}{\varepsilon_i} \quad\text{where}\quad \color{#BE185D}{u_i} \sim N(0,\,\color{#047857}{\tau^2}),\; \color{#6D28D9}{\varepsilon_i} \sim N(0,\,\color{#1D4ED8}{V_i}) \]

The observed effect in a study equals an average true effect plus a study-specific deviation plus within-study error. The deviations vary with between-study variance, and the error with within-study variance.

Where u_i is the random effect for study i, and τ² is the between-study variance (heterogeneity). Weights become W_i = 1/(V_i + τ²).

Result: Produces a similar point estimate to fixed-effects but with a wider confidence interval (because it accounts for between-study variation). Random-effects models are now more commonly used; the foundational estimator is from DerSimonian & Laird (1986).

Key Distinction

The fixed-effects model asks: “What is the single true effect?” The random-effects model asks: “What is the average of the distribution of true effects?” Random-effects models are generally preferred because the assumption of a constant treatment effect across all studies is rarely justified.

16.4.1 Weighting Methods

The most common weighting procedure is inverse variance weighting, applicable to both continuous and binary outcomes. The name states the recipe: each study's weight is one divided by its variance, so a precise estimate (small variance) earns a large weight and a noisy one earns little, which is why large, well-measured studies dominate the pooled result. For binary outcomes with sparse data, the Mantel-Haenszel procedure (Mantel & Haenszel, 1959) or the Peto method may be preferred. For continuous outcomes, when studies use different scales, standardised mean differences (effect sizes such as Cohen’s d or Hedges’ g) are used.

Reflection

Consider a meta-analysis of 10 studies examining the effect of a drug on blood pressure. Five studies were conducted in elderly populations and five in young adults. Would you expect a fixed-effects or random-effects model to be more appropriate? Why?

Model answerRandom effects is more appropriate. The two sub-populations (elderly vs. young adults) almost certainly differ in baseline BP, comorbidity, polypharmacy, and dose-response, so it is implausible that all 10 studies are estimating one common effect. A fixed-effects model assumes a single true effect and weights only by within-study precision; with biologically distinct subgroups that assumption is wrong, the FE pooled estimate becomes a population-weighted average that doesn't describe either group, and its CI is artificially narrow. A random-effects pool reflects an average true effect plus between-study variance τ². Better still: pre-specify a subgroup analysis (or meta-regression on age) so the heterogeneity is investigated rather than absorbed into a wider CI.

Minimum 20 characters required.

✓ Reflection saved

Knowledge Check; this section

1. In a fixed-effects model, what is assumed about the true treatment effect?

It varies randomly across studies following a normal distribution It is constant across all studies, with variation due only to sampling error It is zero in all studies (null hypothesis)

The fixed-effects model assumes there is one true effect (θ) common to all studies. Any observed variation in study results is attributed to within-study random variation (sampling error) only.

2. What does τ² represent in a random-effects meta-analysis?

The within-study variance for each individual study The total variance across all observations The between-study variance (heterogeneity)

τ² represents the between-study variance, the variability in true treatment effects across studies. It quantifies heterogeneity beyond what would be expected from within-study sampling error alone.

3. Which type of data provides the most flexibility for exploring sources of heterogeneity in a meta-analysis?

Individual patient data (IPD) Summary estimate data Group data

IPD allow evaluation of study-, group-, and individual-level variables as sources of heterogeneity. Summary data can only evaluate study-level variables, while group data add some flexibility but not at the individual level.

Section 3 of 5

Forest Plots & Heterogeneity

⏱ Estimated reading time: 50 minutes

Section 3 of 5

Forest Plots & Heterogeneity

Reading the key visual output; measuring and explaining variability across studies.

Anatomy of a forest plot

Reading every element

Cochran's Q

Testing for more variation than chance predicts

Cochran’s Q statistic (Eq 28.7)

\[ \color{#0B7B6B}{Q} = \sum_i \color{#C2410C}{w_i} \left(\color{#6D28D9}{T_i} - \color{#1D4ED8}{\hat{\theta}}\right)^2 \]

Q heterogeneity statistic w_i study weight T_i study effect θ̂ pooled estimate

Under no heterogeneity, \(Q \sim \chi^2_{k-1}\). Low power when the number of studies is small. Use a relaxed threshold of \(p = 0.10\) rather than \(0.05\) to avoid missing real heterogeneity.

Higgins I² and τ²

How much of the spread is real?

Higgins I² (Eq 28.8)

\[ \color{#6D28D9}{I^2} = \frac{\color{#0B7B6B}{Q} - (\color{#1D4ED8}{k}-1)}{\color{#0B7B6B}{Q}} \times 100\% \]

I² heterogeneity index Q Cochran’s Q k number of studies

Benchmarks: 25% = low | 50% = moderate | 75% = high. Any value above 25% warrants investigation of causes.

\(\tau^2\) is the between-study variance on the same scale as the effect measure, useful for judging practical magnitude.

Explaining heterogeneity

Four diagnostic approaches

Subgroup & stratified analysis

Compare effects across defined study categories. Pre-specify in the protocol to control Type I error.

Galbraith plot

Plots Z-statistic vs. 1/SE. Points outside ±2 units are potential outliers driving heterogeneity.

Meta-regression

Weighted regression of effects on study-level predictors. Most flexible, but still observational.

Ecological caution

Predictors are study-level averages. Multiple comparisons inflate Type I error. Pre-specify analyses.

Carry forward

What to take into the next section

The forest plot makes the consistency of the evidence visible at a glance.
\(I^2\) above 25% signals real heterogeneity worth explaining, not just absorbing into a wider CI.
Subgroup analysis and meta-regression investigate causes; pre-specify them to avoid inflated error rates.
A pooled estimate is only as trustworthy as the studies behind it, and the variation among them.

Introduction and Overview

An earlier section produced a single pooled estimate. This section introduces the visual and quantitative tools for inspecting the underlying variation that produced that estimate. The forest plot makes the constituent studies and the pooled result visible at a glance; heterogeneity statistics quantify how much the studies actually disagree with one another. Both are essential before trusting the pooled number.

Learning Objectives

Read a forest plot: identify point estimates, confidence intervals, study weights, the null line, and the pooled diamond.
Quantify heterogeneity using Cochran's Q, I-squared, and tau-squared, and interpret the conventional thresholds.
Distinguish statistical heterogeneity from clinical and methodological heterogeneity.
Decide when subgroup analysis or meta-regression is the appropriate response to detected heterogeneity.

16.5 Presentation of Results: The Forest Plot

The forest plot is the most important graphical output of a meta-analysis. It displays the point estimate and confidence interval of the effect observed in each study, along with the summary estimate.

Anatomy of a Forest Plot

Each horizontal line represents one study’s results. The length of the line is the 95% CI. The centre box marks the point estimate, and the area of the box is proportional to the study’s weight. The dashed vertical line shows the overall summary estimate. The diamond at the bottom represents the pooled estimate and its CI. The solid vertical line marks the null value (e.g., 0 for mean difference, 1 for ratio measures).

▸ INTERACTIVE STORY, THE FOREST & THE FUNNEL
Open full screen ↗

Watch a forest plot build itself one study at a time, see the PRISMA flow filter thousands of records down to twelve, and watch the pooled diamond emerge. Next ▶ advances scenes.

A 6-scene visualization of the systematic-review pipeline: PICO question, PRISMA filter funnel, forest plot construction one study at a time, heterogeneity check, pooled diamond, and the evidence pyramid.

Reading a Forest Plot

If all study CIs overlap considerably and cluster near the summary estimate, there is little heterogeneity. If CIs are widely scattered and many do not overlap, heterogeneity is substantial. Studies may be ordered by publication year (to detect time trends), quality score, or effect size.

R Forest plot & meta-analysis with the metafor package

The companion R script r-activities/HSCI_230_Lesson_2_Systematic_Reviews_and_Meta_Analysis.R runs an end-to-end meta-analysis (pooled estimate, heterogeneity diagnostics, and the figure for the paper) in three lines of metafor.

# install.packages("metafor")
library(metafor)

# Eight hypothetical RCTs of a smoking-cessation drug (events / total per arm)
dat <- data.frame(study  = paste("Trial", 1:8),
  ai = c(48, 54, 31, 84, 21, 64, 36, 15),  # events on drug
  n1i = c(120,160,100,220, 85,240,130,90),
  ci = c(30, 32, 21, 48, 18, 60, 28, 14),
  n2i = c(120,160,100,220, 85,240,130,90)
)

# Compute log-OR per study with continuity correction
es <- escalc(measure = "OR",
             ai = ai, n1i = n1i,
             ci = ci, n2i = n2i, data = dat)

# Random-effects pooled OR (DerSimonian-Laird estimator)
fit <- rma(yi, vi, data = es, method = "DL")
fit

# Forest plot
forest(fit, slab = es$study, transf = exp, refline = 1,
       xlab = "Odds ratio (smoking cessation)")

Console output (key lines)

tau^2 (between-study variance): 0.0205 I^2 (heterogeneity %): 21.1% Q-statistic: 8.87, df = 7, p = 0.262 Random-Effects Model (k = 8) estimate ci.lb ci.ub 0.46 0.24 0.68 # log-OR; exp() to get OR ~ 1.58

Reading the output. I² ~ 21% suggests low heterogeneity; the forest plot will show all CIs overlapping with the diamond. Try switching method = "FE" for a fixed-effect model and watch the diamond narrow. Heterogeneity diagnostics, publication-bias plots, and meta-regression all live in metafor.

R Reflect on what you just ran

Use the questions below to interpret the output you produced. Look at your console / forest plot before answering.

1. The pooled random-effects estimate from rma() was about OR = 1.58 (exp(0.46)). In plain language, what does that pooled odds ratio say about how the smoking-cessation drug compares to control across the 8 trials? Is the effect in a beneficial direction?

Model answerExponentiating the pooled log-odds of 0.46 gives an OR of about 1.58, meaning that across the 8 trials the odds of successful smoking cessation are roughly 1.6× higher in the drug arm than in the control arm. Because the outcome (cessation) is desirable, an OR > 1 is the beneficial direction here, the drug looks helpful on average. As a number, 1.58 is a clinically meaningful effect; in absolute terms it would shift cessation rates from, say, 10% to ~15%, which over a large smoking population is a major public-health gain.

2. The output reported I² = 21.1% and Q-test p = 0.262. Using the benchmarks from the lesson (25% low, 50% moderate, 75% high), how would you describe the heterogeneity across these 8 trials, and does the Q-test agree?

Model answerI² = 21.1% sits at the lower end of the lesson's benchmarks, comfortably in the low range, meaning roughly a fifth of the variation across studies is more than chance would predict. The Q-test p-value of 0.262 is non-significant, so the two diagnostics agree: there is no strong evidence of heterogeneity. A single pooled estimate is a reasonable summary, although the random-effects model is still the safer default because it propagates the small amount of between-trial variability that does exist into the CI.

3. Re-run the model with method = "FE" (fixed effects). Compare the width of the diamond on the forest plot to the random-effects version. Why does the fixed-effect CI become narrower, and when would that narrower CI be misleading?

Model answerThe fixed-effect diamond is narrower because the FE model assumes every trial estimates the same single underlying effect and pools only within-study (sampling) variance. The RE model adds between-study variance τ² to each weight, which widens the CI to reflect ‘effect could differ across populations.’ The narrower FE CI is misleading whenever the assumption of one true effect fails, different populations, doses, follow-up windows, or outcome definitions across trials make the FE CI artificially confident. Practical rule: report RE by default, and use FE only when you have strong grounds to believe the trials are exchangeable.

Saved.

16.6 Heterogeneity

Heterogeneity refers to variability among study results beyond what would be expected from random variation alone. It should always be evaluated in a meta-analysis.

16.6.1 Real vs. Artifactual Heterogeneity

Real HeterogeneityClick to explore

Artifactual HeterogeneityClick to explore

An important distinction is between clinical heterogeneity (real differences between populations, interventions, and settings) and statistical heterogeneity (variation in observed results beyond chance). Clinical heterogeneity is always expected; the key question is whether statistical heterogeneity is also present.

16.6.2 Measuring Heterogeneity: Cochran’s Q and Higgins I²

Cochran’s Q statistic (Eq 28.7)

\[ \color{#0B7B6B}{Q} = \sum_i \color{#C2410C}{w_i}\left(\color{#6D28D9}{T_i} - \color{#1D4ED8}{\hat{\theta}}\right)^2 \]

The heterogeneity statistic is a weighted sum of squared distances between each study effect and the pooled estimate. Larger values indicate more disagreement among studies than chance alone would produce.

Where w_i are the study weights, T_i are the study effects, and θ is the pooled estimate. Under the null hypothesis of no heterogeneity, Q follows a χ² distribution with k−1 degrees of freedom. However, the Q test has low power when the number of studies is small, so a non-significant result does not rule out heterogeneity. Consider using a relaxed P-value threshold (e.g., 0.10 instead of 0.05).

Higgins I² (Eq 28.8)

\[ \color{#6D28D9}{I^2} = \frac{\color{#0B7B6B}{Q} - (\color{#1D4ED8}{k}-1)}{\color{#0B7B6B}{Q}} \times 100\% \]

The heterogeneity index rescales the Q statistic by its degrees of freedom (the number of studies minus one) into the percentage of total variation that reflects real between-study differences rather than chance.

I² (Higgins & Thompson, 2002) quantifies the proportion of variance between studies that is due to heterogeneity rather than chance. Benchmarks: 25% = low, 50% = moderate, 75% = high heterogeneity. An evaluation of possible causes should be undertaken whenever I² exceeds 25%.

The formula is more intuitive than it first looks. If every study were really estimating the same effect, Cochran's Q would on average equal its degrees of freedom, the number of studies minus one. Subtracting that quantity from Q removes the disagreement expected from chance alone, and dividing by Q re-expresses what remains as a fraction of the total spread. When the studies agree no more than chance predicts, that fraction falls to zero and I² is reported as 0%.

16.6.3 Evaluating Causes of Heterogeneity

Subgroup Analysis

Identify a specific subgroup of studies defined by a characteristic of interest and examine the effect within that subgroup. However, results should be interpreted with caution, the best estimate for any subgroup is provided by considering all the evidence (Stein’s Paradox) rather than the subgroup data alone. Subgroup analyses should be pre-specified in the review protocol.

Stratified Analysis

Data are stratified by a factor thought to influence the treatment effect, and a separate meta-analysis is carried out in each stratum. The between-strata heterogeneity can be tested using Q_B = Q_T − ΣQ_S. A disadvantage is that individual strata may contain few studies.

Galbraith Plot

A Galbraith plot plots the Z statistic (T_i/SE(T_i)) against the inverse of the SE (1/SE). The slope of the resulting line is the overall fixed-effect estimate, and lines at ±2 units from this line should encompass 95% of observations if there is no significant heterogeneity. Points outside these bounds are potential outliers contributing to heterogeneity.

Meta-Regression

Meta-regression is the most flexible approach: a weighted regression of observed treatment effects against study-level predictors (with inverse variance weights). It extends the random-effects model by adding predictors. Cautions: (1) even with RCTs, meta-regression is observational; (2) multiple comparisons inflate Type I error; (3) ecological fallacy applies since predictors are study-level averages.

Reflection

You conduct a meta-analysis of 20 studies and find I² = 82%. The forest plot shows widely scattered effect sizes. What steps would you take to investigate the causes of this high heterogeneity? Which methods from this section would you prioritise and why?

Model answerWith I² = 82% a single pooled estimate is uninformative; the priority is to explain the heterogeneity, not bury it. Steps: (a) re-examine the forest plot for outliers and clusters; (b) pre-specified subgroup analyses on population (age, sex, severity), intervention (dose, duration, formulation), comparator (active vs. placebo), and methodological quality (low vs. high risk of bias); (c) meta-regression on continuous covariates (mean age, baseline severity, year, follow-up length); (d) leave-one-out sensitivity; (e) re-extract outcomes to check for definitional drift (e.g., "response" defined differently across trials). Prioritise meta-regression on substantive moderators because it both quantifies and explains, and prioritise risk-of-bias subgrouping because high-RoB studies often drive heterogeneity for non-substantive reasons. If no moderator explains it, report each subgroup separately rather than forcing a single pooled number.

Minimum 20 characters required.

✓ Reflection saved

Knowledge Check; this section

1. In a forest plot, what does the area of the box on each study line represent?

The sample size of the study The weight assigned to the study in the meta-analysis The p-value of the study result

In a forest plot, the area of the box is proportional to the weight assigned to that study in the meta-analysis. Studies with more precise estimates (smaller SEs) receive larger weights and thus larger boxes.

2. A meta-analysis reports I² = 75%. How should this be interpreted?

75% of studies found a statistically significant effect The treatment effect is 75% larger than the control 75% of the variance between studies is due to heterogeneity rather than chance

I² = 75% means that 75% of the observed variance between study results is attributable to real heterogeneity rather than sampling error. This is considered “high” heterogeneity, and an investigation of its causes is warranted.

3. What is meta-regression used for in a meta-analysis?

Evaluating whether study-level characteristics explain heterogeneity in treatment effects Computing the pooled estimate of effect Testing whether publication bias is present

Meta-regression is a weighted regression of observed treatment effects against study-level predictors. It is the most flexible approach for evaluating whether specific study characteristics (e.g., study design, population, intervention type) explain heterogeneity.

Section 4 of 5

Publication Bias, Influential Studies & Data Issues

⏱ Estimated reading time: 50 minutes

Section 4 of 5

Publication Bias, Influential Studies & Data Issues

Three threats that survive correct technique.

Publication bias

When the pool is a biased sample

Studies with significant or favourable results are more likely to be published. A pool that includes only published studies will overestimate the true effect.

The mechanism

Null results sit in file drawers. Positive results reach journals. The published record is a skewed sample of all research conducted.

The defence

Search grey literature and trial registries upstream. Use funnel plots and Egger's test downstream.

Funnel plots

Detecting asymmetry in the evidence pool

Begg's test: rank correlation. Egger's test: regression. Both have low sensitivity below 20 studies.

Trim-and-fill

Adjusting for publication bias (Duval & Tweedie, 2000)

Step 1: Trim

Remove the most extreme studies on the over-represented side until the funnel is roughly symmetric. Estimate a new pooled effect from the trimmed set.

Step 2: Fill

Replace removed studies and add hypothetical mirror counterparts on the sparse side. Re-estimate the pooled effect.

The difference between original and adjusted estimates quantifies potential impact. Treat it as a sensitivity analysis, not a correction.

Influential studies

Leave-one-out sensitivity analysis

Repeat the meta-analysis \(k\) times, each time omitting one study. Watch how the pooled estimate and \(I^2\) shift.

Worked example from the lesson

25-study pool • pooled estimate = −2.121 • \(I^2\) = 95.6%
Remove refid 218: estimate → −2.011 (5% shift) • \(I^2\) → 88.1%
That single study had a documentable and meaningful influence on the result.

Outcome-scale issues

When pooling is inappropriate

Different scales

Continuous outcomes on different instruments require standardised mean differences (Cohen’s d, Hedges’ g), not raw mean differences.

Mixed outcome types

Binary, continuous, and time-to-event data require conversion to a compatible effect measure before pooling.

Duplicate publication

The same patients appearing in multiple papers inflate apparent study count. Check for overlapping cohorts.

Carry forward

What to take into the final assessment

Publication bias shifts pooled estimates away from the null. Funnel plots and trim-and-fill are diagnostics, not cures.
Leave-one-out analysis identifies whether a single study drives your conclusions.
Outcome-scale mismatches make pooling inappropriate regardless of model choice.
All three threats are manageable if the protocol anticipates them before extraction begins.

Introduction and Overview

Earlier sections produced and inspected a pooled estimate. This section addresses three threats that can survive correct technique: publication bias (some studies never make it to the literature), influential studies (a single trial driving the entire pooled result), and outcome-scale issues that make it inappropriate to pool studies that look superficially similar. Each of these is a place where a meta-analysis can produce a precise but misleading answer.

Learning Objectives

Define publication bias and explain why it biases pooled estimates away from the null.
Read a funnel plot, apply Begg's and Egger's tests, and interpret the trim-and-fill adjustment.
Identify influential studies via leave-one-out sensitivity analysis and decide how to report their effect.
Recognize when outcome-scale issues (binary vs continuous, time-to-event, dose-response) make pooling inappropriate.

16.7 Publication Bias

A critical concern in meta-analysis is publication bias, studies with statistically significant or favourable results are more likely to be published than those with null or unfavourable results. Consequently, published studies may represent a biased subset of all work conducted on a topic.

Why Publication Bias Matters

If the meta-analysis only includes published studies, and published studies tend to overestimate the effect, the summary estimate will be biased away from the null. This can lead to erroneous conclusions about the effectiveness of interventions.

16.7.1 Detecting Publication Bias: The Funnel Plot

A funnel plot displays each study’s SE (or its inverse, 1/SE) plotted against its estimated effect. In the absence of publication bias, the plot should resemble an inverted funnel, symmetric around the summary estimate, with small studies (large SEs) scattered widely at the bottom and large studies (small SEs) clustered near the top.

Interpreting a Funnel Plot

Asymmetry in the funnel plot suggests publication bias. For example, if studies with large effects and large SEs are present, but studies with small or null effects and large SEs are missing (a “gap” on one side), this suggests that null-result studies were not published. However, asymmetry can also arise from other factors, so interpretation should be cautious.

16.7.2 Statistical Tests for Publication Bias

Two commonly used tests evaluate the relationship between study results and their precision:

Begg’s test: A rank correlation between effect estimates and their SEs. Simple but low power with few studies.
Egger’s test: A linear regression approach that is generally more sensitive at detecting publication bias (Egger et al., 1997).

Neither test is very sensitive when the number of studies is small (<20), and both may produce false positives when there are large treatment effects, few events per trial, or all trials are of similar size. For comprehensive guidance on interpreting funnel-plot asymmetry, see Sterne et al. (2011).

16.7.3 Trim-and-Fill Method

How Trim-and-Fill Works

The trim-and-fill method (Duval and Tweedie, 2000) is a practical approach to assessing and adjusting for publication bias:

“Trim”: Produce a funnel plot and sequentially omit the most extreme studies on one side until the plot is approximately symmetrical.
Determine the centre of the trimmed, symmetrical plot (a new estimate of the treatment effect).
“Fill”: Replace the omitted studies along with their hypothetical “counterparts” on the other side of the centre line.
Redo the meta-analysis including both the original data and the hypothetical studies.

This provides an estimate of what the treatment effect would be if all studies had been published. The difference between the original and adjusted estimates indicates the potential impact of publication bias.

16.8 Influential Studies

It is important to determine whether individual studies have a profound influence on the summary estimate. A study might be much larger than others or have an extreme effect size. To evaluate this, sequentially delete each study from the meta-analysis and observe how the summary estimate changes.

Example: Sensitivity Analysis

In a meta-analysis of 25 studies with a pooled estimate of −2.121, one study (refid 218) was identified as a potential outlier in the Galbraith plot. Removing it changed the estimate to −2.011 (a 5% reduction in magnitude) and reduced I² from 95.6% to 88.1%. While the heterogeneity remained high, the analysis demonstrated that this single study had a meaningful influence on the results.

16.9 Outcome Scales and Data Issues

Published studies vary substantially in how they present data. Several practical issues arise:

Computing SEsClick to explore

Different ScalesClick to explore

Combining OutcomesClick to explore

Reflection

You produce a funnel plot for your meta-analysis and notice asymmetry; there appear to be “missing” studies with small effects and large standard errors. What are the possible explanations for this pattern beyond publication bias? How would you investigate further?

Model answerFunnel asymmetry has several explanations beyond publication bias. (1) Small-study effects: small trials are often run in more selected populations, with greater treatment fidelity and shorter follow-up; they really do produce larger effects, independent of selective publication. (2) Methodological quality differences: smaller studies frequently have higher risk of bias (poor randomisation, less blinding), which inflates effects. (3) True heterogeneity: if effects differ by subgroup and smaller studies happen to be drawn from high-effect subgroups, you get asymmetry without selection. (4) Chance, especially with fewer than 10 trials. Investigation: Egger’s test or its variants for formal asymmetry, trim-and-fill as sensitivity, search trial registries and grey literature for unpublished work, and stratify the funnel plot by risk-of-bias categories. Conclude carefully: asymmetry is a flag, not a diagnosis.

Minimum 20 characters required.

✓ Reflection saved

Knowledge Check; this section

1. What pattern in a funnel plot suggests publication bias?

Perfect symmetry around the pooled estimate Asymmetry, with “missing” studies on one side of the plot All studies clustered at the bottom of the funnel

An asymmetric funnel plot, where studies with certain characteristics (e.g., small effects and large SEs) appear to be “missing,” suggests publication bias. However, asymmetry can also be caused by other factors such as heterogeneity.

2. What is the purpose of the trim-and-fill method?

To remove low-quality studies from the meta-analysis To standardise effect sizes across different outcome scales To estimate the treatment effect adjusted for potential publication bias

The trim-and-fill method creates a symmetrical funnel plot by adding hypothetical “missing” studies, then re-estimates the pooled effect. This provides an adjusted estimate showing what the result might be if all studies had been published.

3. Why is sensitivity analysis (sequentially removing studies) important in meta-analysis?

To determine whether the summary estimate is driven by a single influential study To increase the statistical power of the meta-analysis To convert between fixed- and random-effects models

Sensitivity analysis identifies studies that have a disproportionate influence on the summary estimate. If removing one study substantially changes the result, it warrants careful evaluation of that study’s quality and characteristics.

HSCI 230, Lesson 2

Evaluating Epidemiological Research

Systematic Reviews and Meta-Analysis

Learning objectives for this lesson:

Glossary: Key Terms, People & Concepts

Hierarchy of Knowledge & Systematic Reviews

Hierarchy of Knowledge & Systematic Reviews

The apex before the rungs below

Ruling out variation, one tier at a time

What the evidence is for

Data

Information

Knowledge

Wisdom

From subjective summary to documented procedure

Narrative review

Systematic review

The systematic review process (Sargeant et al., 2006)

PROSPERO and GRADE

PROSPERO

GRADE

What to take into the next section

Introduction and Overview

The Hierarchy of Knowledge

💭 Discussion Prompt, Rank the Evidence

A Complementary Hierarchy: DIKW

Why Start at the Top of the Pyramid?

Learning Objectives

16.1 Why Systematic Reviews?

Narrative Reviews

Systematic Reviews

16.2 Steps of a Systematic Review

16.2.1 Two Tools That Underpin a Trustworthy Review

📝 Call-Out: PROSPERO, Prospective Protocol Registration

🎯 Call-Out: GRADE, Rating the Certainty of the Evidence

Reflection

Meta-Analysis: Data Types & Effect Models

Meta-Analysis: Data Types & Effect Models

What a meta-analysis does

Objective 1

Objective 2

Summary, group, and individual patient data

Summary data

Group data

Individual patient data

One true effect for all studies

A distribution of true effects

Choosing between the models

Use fixed effects when…

Use random effects when…

Introduction and Overview

Learning Objectives

16.3 What Is a Meta-Analysis?

Objectives of Meta-Analysis

16.3.1 Types of Data in Meta-Analysis

16.4 Fixed- vs. Random-Effects Models

Fixed-Effects Model

Random-Effects Model

Key Distinction

16.4.1 Weighting Methods

Reflection

Forest Plots & Heterogeneity

Forest Plots & Heterogeneity

Reading every element

Testing for more variation than chance predicts

How much of the spread is real?

Four diagnostic approaches

Subgroup & stratified analysis

Galbraith plot

Meta-regression

Ecological caution

What to take into the next section

Introduction and Overview

Learning Objectives

16.5 Presentation of Results: The Forest Plot

Anatomy of a Forest Plot

R Reflect on what you just ran

16.6 Heterogeneity

16.6.1 Real vs. Artifactual Heterogeneity

16.6.2 Measuring Heterogeneity: Cochran’s Q and Higgins I²