Integrated Appraisal of
Epidemiological Research

Evaluating Epidemiological Research

Learning objectives for this lesson:

Read a published study as a structured sequence of inferential decisions
Apply reporting frameworks (STROBE, CONSORT, PRISMA) to evaluate completeness
Conduct stepwise critical appraisal: question clarity, design alignment, internal validity, statistical inference, and external validity
Systematically identify selection bias, measurement error, and confounding threats
Distinguish red flags from quality indicators in epidemiological publications
Synthesize evidence across multiple studies, weighting by methodological rigor
Apply disciplined skepticism grounded in the methodological knowledge built throughout this course

This course was developed by Dr. Kiffer G. Card, Faculty of Health Sciences, Simon Fraser University.

Reference

Glossary: Key Terms, People & Concepts

📚 Reference page, available throughout the lesson

This glossary collects the key concepts, frameworks, and biases you will use in this capstone lesson, and throughout your epidemiology career, to evaluate research. Type in the search box to filter entries.

Critical Appraisal Concepts

Critical Appraisal The systematic process of evaluating the trustworthiness, value, and relevance of a research study, asking whether the study’s answer is likely to be true, what it actually estimates, and how it should be applied.

Inferential Chain The sequence of decisions linking a research question to a conclusion: question, design, sample, measurement, analysis, interpretation. Each link can break the chain.

Internal Validity Whether the estimated association reflects the true effect within the studied sample, free of confounding, selection, and information bias.

External Validity Whether the findings transfer to other populations, settings, or time periods. Distinct from, and impossible without, internal validity.

Hierarchy of Evidence A ranking of study designs by how well they protect against bias for causal questions (typically: systematic reviews/meta-analyses > RCTs > cohort > case-control > cross-sectional > case series). The hierarchy is a heuristic, not a verdict.

Evidence Synthesis Combining findings across studies to draw a more robust conclusion than any single study supports, through systematic reviews, meta-analyses, or narrative synthesis.

Red Flag A feature of a paper that signals likely methodological weakness, e.g., undisclosed conflicts, vague hypotheses, post-hoc subgroup analyses presented as primary results, or implausible precision.

Quality Indicator A feature that signals methodological rigour, e.g., pre-registration, clear research questions, appropriate design, transparent reporting, and honest discussion of limitations.

Disciplined Skepticism Reading studies neither credulously nor cynically: assuming neither that “published” means “true” nor that all research is unreliable. The stance this lesson aims to cultivate.

Frameworks & Reporting Guidelines

STROBE Strengthening the Reporting of Observational Studies in Epidemiology, a checklist for reporting cohort, case-control, and cross-sectional studies. Useful as both a writing guide and an appraisal lens.

CONSORT Consolidated Standards of Reporting Trials, the analogous reporting framework for randomized controlled trials, including the now-ubiquitous flow diagram.

PRISMA Preferred Reporting Items for Systematic Reviews and Meta-Analyses, the framework for transparent reporting of evidence syntheses, including the search-and-screening flow diagram.

GRADE Grading of Recommendations Assessment, Development and Evaluation, a system for rating certainty in a body of evidence (high/moderate/low/very low) and the strength of recommendations derived from it.

PICO(T) Population, Intervention/Exposure, Comparator, Outcome (and Timeframe), a structure for sharpening research questions and matching them to designs.

Bradford Hill’s Viewpoints Nine considerations Hill (1965) proposed for moving from association to causation: strength, consistency, specificity, temporality, biological gradient, plausibility, coherence, experiment, analogy. Heuristics, not a checklist.

Risk of Bias Assessment A structured judgement (e.g., Cochrane RoB 2, ROBINS-I) about whether and how a study’s design and conduct could distort its findings, used in systematic reviews and GRADE ratings.

Bias & Threat Vocabulary (Recap)

Selection Bias Distortion arising when inclusion or retention in a study depends on both exposure and outcome (e.g., Berkson’s bias, healthy worker effect, attrition bias).

Information Bias Distortion arising from how exposures, outcomes, or covariates are measured (e.g., recall bias, observer bias, misclassification, social desirability bias).

Confounding A common cause of exposure and outcome that is not on the causal pathway. The original concern of epidemiology and the workhorse threat to causal inference.

Temporal Biases Biases tied to how time is allocated and measured: immortal time bias, lead-time bias, length bias, prevalence–incidence bias, time-window bias.

Ecological & Atomistic Fallacies Mismatched levels of inference, i.e., drawing individual conclusions from group data (ecological) or group conclusions from purely individual data (atomistic).

Key People

Sir Austin Bradford Hill (1897–1991) Designed the first modern randomised controlled trial (streptomycin, 1948) and proposed the “viewpoints” for causal inference still taught today.

Archie Cochrane (1909–1988) Scottish epidemiologist whose advocacy for randomized trials and synthesis of evidence inspired the Cochrane Collaboration and the modern systematic review.

David Sackett (1934–2015) Clinician-epidemiologist often called the father of evidence-based medicine; pioneered the disciplined integration of best evidence with clinical expertise and patient values.

Kenneth Rothman Author of Modern Epidemiology; clarified causal pies, bias quantification, and the misuse of statistical significance.

Sander Greenland Epidemiologist whose writings on confounding, p-values, and the misuse of statistical significance have shaped contemporary practice.

Miguel Hernán Epidemiologist (Harvard) whose target-trial framework gives observational research a clearer link to causal questions and structured appraisal.

Gordon Guyatt Internist who coined “evidence-based medicine” and led the development of the GRADE framework for rating certainty in evidence.

No matching entries. Try a different search term.

Section 2 of 4

Stepwise Critical Appraisal

⏱ Estimated reading time: 25 minutes

Section 2 of 3

Stepwise Critical Appraisal

A five-stage procedure, from question clarity through external validity.

Five stages in order

The appraisal sequence

Stage 3 draws on the full bias inventory from earlier lessons. It is usually the most time-consuming stage.

Stage 1

Question clarity

Exposure defined?

Outcome specific?

Population identified?

Plausibility grounded?

A well-specified question constrains the study design, the measurement strategy, and the analytic approach. Vagueness at stage one propagates through all later stages.

Stage 3 in detail

Internal validity: three threat categories

Selection

Representative sampling? Differential loss to follow-up? Collider bias from restricting or conditioning on a shared effect?

Measurement

Differential misclassification biases toward or away from the null. Nondifferential misclassification generally weakens true effects.

Confounding

Confounders chosen from a directed acyclic graph, or by statistical testing? Could unmeasured or residual confounding remain?

Stage 4

Statistical inference: significance is not importance

A statistically significant hazard ratio of 1.02 with a narrow confidence interval is a precisely estimated, trivially small effect.

Confidence interval

\[ \color{#0B7B6B}{\hat{\theta}} \pm \color{#6D28D9}{z_{\alpha/2}} \cdot \color{#1D4ED8}{\text{SE}(\hat{\theta})} \]

θ̂ point estimatez_α/2 critical value (1.96 for 95%)SE(θ̂) standard error

The interval tells you about precision. The effect size tells you about relevance. Multiple comparisons without pre-specification inflate false-positive rates regardless of how precise the estimates are.

Carry forward

The worked example: five stages applied

Retrospective cohort: low blood vitamin D at hospital admission, odds ratio 2.5 (95% confidence interval 1.4 to 4.5) for intensive care admission, adjusted for age, sex, and body mass index.

Collider bias

Restricting to hospitalized patients conditions on a shared effect of vitamin D status and disease severity, opening a spurious association.

Reverse causation

Measuring vitamin D at admission, after illness has set in, cannot establish that low levels came before the intensive care admission.

Introduction and Overview

An earlier section named the seven decisions every paper makes and the three reporting frameworks that audit them. This section converts that conceptual material into a working procedure. The five stages below are how you actually move through a paper from start to finish, in the order that lets you catch problems before they propagate. Each stage maps directly onto material from earlier lessons, so you should recognize the underlying ideas as you work through them.

Learning Objectives

Apply a five-stage stepwise appraisal procedure (question, design, internal validity, statistical inference, external validity) in the right order.
Map each stage onto the relevant prior lessons (e.g., Stage 3 to the bias inventory of earlier lessons).
Work through a worked example end-to-end and produce a structured written appraisal.
Recognize when a study's weakest link makes its strongest claims unsupportable.

A Five-Stage Appraisal Framework

Critical appraisal is most effective when conducted systematically. Rather than reading a study and forming a vague impression, work through five distinct stages, each targeting a specific aspect of inferential quality. This framework integrates concepts from every preceding lesson in this course. The accordion below walks through each stage in order; the worked example that follows applies all five to a real-world COVID-era observational study.

Stage 1: Clarity and Plausibility of the Research Question

Before evaluating methods, assess the question itself:

Exposure: Is it well-defined and measurable? Could it be operationalized differently?
Outcome: Is it specific and clinically or epidemiologically meaningful?
Population: Is the target population clearly identified?
Plausibility: Does the proposed relationship have biological or social plausibility? Is there prior evidence?

A well-specified question constrains the study design, measurement strategy, and analytic approach. If the question is vague, every subsequent decision becomes difficult to evaluate.

Stage 2: Design Alignment

Does the study design appropriately address the research question?

A question about causation is best addressed by an RCT or, when experiments are infeasible, a well-designed cohort study with strong confounding control.
A question about prevalence calls for a cross-sectional design with probability sampling.
A question about rare outcomes is efficiently addressed by a case-control study.
A question requiring evidence synthesis calls for a systematic review or meta-analysis.

Ask: Would an alternative design have provided stronger evidence with fewer threats to validity? Design misalignment does not necessarily invalidate a study, but it limits the strength of conclusions that can be drawn.

Stage 3: Internal Validity, Bias Identification

This is the most detailed stage. Systematically identify potential biases using concepts from earlier lessons:

Selection processes:

Was sampling representative or could selection bias have distorted results?
Could collider bias have been introduced by conditioning on a common effect (e.g., restricting to hospitalized patients, adjusting for an intermediate variable)?
Was there differential loss to follow-up or non-response?

Measurement error:

Could differential misclassification have biased results toward or away from the null?
Could non-differential misclassification have attenuated a true effect?
Were validated instruments used? Were they validated in the study population?

Confounding control:

Were confounders identified using a causal model (DAG) or only selected based on statistical significance?
Could unmeasured or residual confounding remain?
Were specification errors present (e.g., adjusting for mediators, adjusting for colliders, incorrect functional forms)?

Empirical reanalysis example: When Hernán and colleagues (2008) reanalysed the widely cited Women’s Health Initiative observational data using the same eligibility criteria and timing conventions as the RCT, the observational estimate for hormone therapy and heart disease shifted substantially, illustrating how selection processes and analytic choices can drive results.

Stage 4: Statistical Inference

Even with good design and minimal bias, statistical inference can go wrong:

Model assumptions: Are distributional assumptions justified? Is the sample large enough for asymptotic methods?
Uncertainty quantification: Are confidence intervals reported? Are they appropriately interpreted?
Multiple testing: Were multiple comparisons made without correction? Were subgroup analyses pre-specified or post hoc?
Model selection: Were many models fit and only the “best” reported? Could selective reporting inflate false positive rates?
Effect sizes over significance: Does the study emphasize the magnitude and precision of effects, or does it reduce everything to p < 0.05 vs. p ≥ 0.05?

A study that reports “statistically significant” results with a hazard ratio of 1.02 (a 2% higher rate, real enough to detect in a large sample yet far too small to change any decision) and a narrow confidence interval has detected a precisely estimated trivial effect; statistical significance does not equal clinical or public health significance (Greenland et al., 2016).

Stage 5: External Validity and Transportability

External validity asks whether results apply beyond the study sample:

Sample representativeness: Does the study sample represent the target population? Highly selected samples (academic medical centers, volunteer cohorts) may not.
Contextual differences: Results from one healthcare system may not transport to another. Social determinants, cultural factors, and healthcare access differ across settings.
Effect modification: If the exposure–outcome relationship varies across subgroups, transporting the average effect to a population with a different subgroup distribution could be misleading.
Temporal validity: Medical practice, environmental exposures, and population characteristics change over time. Results from the 1990s may not apply today.

External validity turns on more than sample size. A large but highly selected sample may have less external validity than a smaller but representative one.

Applying the Framework: A Worked Example

Case: Observational Study of Vitamin D and COVID-19 Severity

A retrospective cohort study reports that patients with low serum vitamin D levels at hospital admission had 2.5 times the odds of ICU admission compared to those with sufficient levels (OR = 2.5, 95% CI: 1.4–4.5), adjusted for age, sex, and BMI.

Appraisal Stage	Assessment
Question clarity	Reasonably clear: exposure (vitamin D level), outcome (ICU admission), population (hospitalized COVID patients)
Design alignment	Retrospective cohort using hospital records; appropriate for this question but has inherent limitations
Internal validity	Major concerns: collider bias (restricting to hospitalized patients conditions on a collider); confounding by illness severity (sicker patients may have lower vitamin D due to acute-phase response, not baseline deficiency); measurement timing (at admission, not pre-illness)
Statistical inference	Adjusted for only 3 confounders; likely residual confounding; no sensitivity analyses reported
External validity	Single hospital, limited generalizability; hospitalized population does not represent all COVID patients

Conclusion: Despite a statistically significant and seemingly large effect, the inferential chain has several weak links, particularly collider bias (studying only already-hospitalized patients can manufacture a vitamin D and severity link that would not hold in the general population) and reverse causation (illness can itself lower vitamin D, so a low level may be a marker of being sick rather than a cause of getting sicker), that undermine causal interpretation.

Reflection

Think of a health study you have encountered in the news or in a course. Walk through the five appraisal stages. Which stage reveals the most significant threat to the study’s conclusions? How would you communicate this limitation to a non-expert audience?

Model answerPick a recent media-prominent study (e.g., the meta-analysis of red meat and CHD; the IARC processed-meat classification; a COVID booster trial) and walk through the five stages in the order this section uses them. Stage 1 (question clarity): are the exposure, outcome, and target population defined precisely enough to evaluate? Stage 2 (design alignment): does the design fit the question, or would another design give stronger evidence? Stage 3 (internal validity): work through selection, information, and confounding biases. Stage 4 (statistical inference): read the effect size and the CI width, and ask whether multiple comparisons or post-hoc subgroups inflate the result. Stage 5 (external validity): does the study sample represent the people you care about? The most significant threat varies by study; commonly it is residual confounding for observational designs, or selection / loss-to-follow-up for trials. Communicating to a non-expert: avoid jargon, name the specific alternative explanation ("healthier eaters tend to do all the other healthy things too, so we cannot tell from this study alone whether the food itself matters"), and end with what evidence would change the conclusion.

✓ Reflection saved

Section 3 of 4

Red Flags, Quality Indicators, and Applied Synthesis

⏱ Estimated reading time: 25 minutes

Section 3 of 3

Red Flags, Quality Indicators, and Applied Synthesis

Pattern recognition and evidence synthesis across multiple papers.

Red flags

Warning signs, not disqualifiers

Implausibly large effect sizes for common exposures and outcomes
Inconsistent sample sizes across tables or figures
After-the-fact subgroup analyses presented as primary findings
Adjustment for mediators as if they were confounders
Conclusions that go well beyond what the data support

Ioannidis (2005)

Small samples, flexible designs, undisclosed analytical freedom, and many teams chasing significance together make false positives mathematically likely in certain research environments.

Quality indicators

Features that warrant heightened confidence

Pre-registration

Directed acyclic graph

Validated instruments

Sensitivity analyses

Independent replication

Open data and code

E-value reported

Quality indicators are positive evidence, not merely the absence of red flags. The E-value (VanderWeele and Ding, 2017) quantifies how strong unmeasured confounding would need to be to explain away the observed association.

Applied synthesis

Weighting by design quality, not sample size

Study A: large, weak

Cross-sectional, convenience sample, crude measures, minimal confounding control. A large sample does not resolve the which-came-first problem built into the design.

Study B: smaller, stronger

Prospective design, validated instruments, exposure shown to precede outcome, comprehensive confounding control. Smaller sample, stronger inference.

Meta-analysis mechanics

Inverse-variance pooling

Fixed-effect pooled log odds ratio

\[ \color{#0B7B6B}{\hat{\theta}_{\text{pool}}} = \frac{\sum_i \color{#6D28D9}{w_i} \color{#C2410C}{\hat{\theta}_i}}{\sum_i \color{#6D28D9}{w_i}}, \quad \color{#6D28D9}{w_i} = \frac{1}{\color{#1D4ED8}{\text{SE}_i^2}} \]

θ̂_pool pooled estimateθ̂_i each study estimatew_i study weight (inverse variance)SE_i standard error of study i

Studies with smaller standard errors contribute more weight. A pooled estimate can still be a precise estimate of a biased number if the underlying studies share a common flaw. The appraisal from earlier sections is the prerequisite for any synthesis.

Forest plot of five studies with odds ratios and confidence intervals, marker size proportional to study weight, and a pooled diamond near 1.26. — Inverse-variance weighting gives the most precise studies the largest markers; the diamond is the pooled estimate. Precision of the pool says nothing about whether the inputs were valid.

A case study in consequences

The Wakefield vaccine retraction

The failures

Twelve-child convenience sample, selective ascertainment, undisclosed financial conflicts of interest, causal claims from a case series without a comparison group.

The consequence

Vaccine hesitancy affecting immunization rates years after the 2010 retraction. Systematic appraisal of the original paper would have identified every inferential failure.

Carry forward

Disciplined skepticism is calibrated, not cynical

The goal is not to dismiss every study that has limitations. Every study has limitations.

Calibrated uncertainty means stating precisely where the inferential chain is weakest, how much that matters for the conclusions, and what evidence would reduce the remaining uncertainty. The capstone reflection and final assessment below are your opportunity to demonstrate that capacity.

Introduction and Overview

An earlier section gave you a step-by-step procedure for working through a single paper. This section adds two complementary tools: pattern-recognition for warning signs that cut across study types, and a procedure for synthesising evidence when multiple studies disagree. The cards below are not a substitute for the five-stage appraisal; they are heuristics that flag which papers warrant the most careful working through.

Learning Objectives

Recognize common red flags, such as implausible effect sizes, inconsistent sample sizes, lack of transparency, post-hoc subgroups, and mediator adjustment, across study designs.
Identify quality indicators (pre-registration, transparent methods, sensitivity analyses, replication) that warrant heightened trust in a study.
Synthesize evidence across multiple studies, weighting by methodological rigor rather than counting positive results.
Articulate calibrated uncertainty: state what the evidence does and does not support, and where residual uncertainty is greatest.

Red Flags in Published Research

With experience, certain patterns signal that a study’s results may be less trustworthy than they appear. These are not definitive disqualifiers, but they warrant heightened scrutiny. Ioannidis (2005) showed analytically why: small samples, flexible designs, multiple teams chasing significance, and selective reporting all inflate the rate of false positives, sometimes to the point where most published findings in a field are wrong. The replication crisis that followed makes the red flags below worth memorising.

Implausible Effect SizesClick to explore

Inconsistent Sample SizesClick to explore

Lack of TransparencyClick to explore

Post Hoc SubgroupsClick to explore

Mediator AdjustmentClick to explore

Interpretive OverreachClick to explore

Red flags are warning signs. The flip side, the features that increase confidence in a paper, are the quality indicators below. Read them as the positive checklist that should accompany the negative one.

Quality Indicators

Just as red flags suggest potential problems, certain features indicate methodological rigor and transparency.

Quality Indicator	Why It Matters
Explicit DAGs or causal diagrams	Shows the investigators have thought carefully about causal structure, confounders, mediators, and colliders before analyzing data
Transparent, complete reporting	Follows reporting guidelines; includes flow diagrams, all pre-specified analyses, and both positive and null results
Validated measurement instruments	Indicates exposure and outcome were measured using tools with established reliability and validity in the study population
Analytic strategy aligned with design	Statistical methods appropriate for the data structure (e.g., survival analysis for time-to-event, multilevel models for clustered data)
Sensitivity and bias analyses	Tests robustness of results to alternative assumptions (e.g., E-values for unmeasured confounding, which quantify how strong a hidden confounder would need to be to explain away the result (VanderWeele & Ding, 2017); quantitative bias analysis for misclassification)
Open science practices	Pre-registration, data sharing, open-access code, and registered reports reduce opportunities for selective reporting

R Activity: Pooling published odds ratios with inverse-variance meta-analysis

The companion R script r-activities/HSCI_230_Lesson_12_Integrated_Appraisal_of_Epidemiological_Research.R walks you through a first-look fixed-effect meta-analysis: starting from three published ORs and their 95% CIs, you derive log-OR standard errors, compute inverse-variance weights, and pool the studies into a single summary OR with a CI, then cross-check the calculation against metafor::rma(). As the capstone activity, it makes the central appraisal point concrete: pooling only makes sense after each input study has survived the five-stage appraisal you learned in an earlier section.

Critical appraisal often ends with the question: given several studies, what does the evidence as a whole say? A simple inverse-variance fixed-effect meta-analysis is just a few lines of code.

# Three hypothetical studies of exposure E and outcome Y
study  <- c("Study A", "Study B", "Study C")
or     <- c(1.40, 1.10, 1.55)
lci    <- c(1.05, 0.85, 1.10)
uci    <- c(1.85, 1.43, 2.18)

# SE of log-OR derived from the published 95% CI
log_or <- log(or)
log_se <- (log(uci) - log(lci)) / (2 * 1.96)
w      <- 1 / log_se^2                         # inverse-variance weights

# Pooled log-OR and 95% CI
pool_log <- sum(w * log_or) / sum(w)
pool_se  <- sqrt(1 / sum(w))
pool_or  <- exp(pool_log)
pool_ci  <- exp(pool_log + c(-1, 1) * 1.96 * pool_se)
round(c(pooled_OR = pool_or, lower = pool_ci[1], upper = pool_ci[2]), 2)

Console output

pooled_OR lower upper 1.30 1.10 1.53

Synthesis is a tool, not a verdict. A pooled OR of 1.30 (1.10-1.53) suggests a real but modest association. But the meta-analysis only makes sense if the underlying studies are themselves trustworthy; the appraisal you just learned is the prerequisite.

R Reflect on what you just ran

Use the questions below to interpret the output you produced. Look at your console before answering.

1. The three individual ORs were 1.40 (Study A), 1.10 (Study B), and 1.55 (Study C). The pooled OR came out at 1.30. Which study contributed the most weight to the pooled estimate, and how can you tell that from the published CIs alone (without seeing the raw w values)?

Model answerStudy B contributed the most weight. In an inverse-variance pool a study is weighted by 1 divided by its squared standard error, and the standard error sets the width of the confidence interval, so the narrowest CI marks the most precise study and the largest weight. Compare widths on the ratio scale the interval is built on, by dividing each upper limit by its lower limit: Study A gives 1.85/1.05 = 1.76, Study B gives 1.43/0.85 = 1.68, and Study C gives 2.18/1.10 = 1.98. Study B has the smallest ratio, so its interval is the tightest and it carries the most weight, even though its OR of 1.10 is the lowest of the three. The narrowest CI, not the OR that happens to sit closest to the pooled value, is what tells you which study dominates.

2. The pooled 95% CI (1.10-1.53) is narrower than any of the three individual study CIs. Why does pooling tighten the CI, and what assumption about the underlying studies must hold for that narrower CI to be a fair summary?

Model answerPooling tightens the CI because precision adds: variance(weighted-average) = 1 / sum(1/var_i), so combining studies with finite variance always yields a smaller variance than any single study. For that narrower CI to be a fair summary, the studies must be estimating the same true effect, the fixed-effects assumption, or, under random-effects, the random effects must be drawn from a single distribution with constant between-study variance and no systematic bias common to all studies. Heterogeneity in design, populations, or risk-of-bias breaks that assumption, so a narrow pooled CI without a heterogeneity check is overconfident.

3. Suppose Study A had used a cross-sectional design with weak confounder control and Study C had used a prospective cohort with strong adjustment. Why does the lesson warn that a precise pooled OR can be "a precise estimate of a biased number," and how does that warning change how you would report this result?

Model answer"A precise estimate of a biased number" is the lesson's headline because pooled estimates aggregate both precision and bias. If Study A (cross-sectional, weak confounder control) contributed the most weight, its biases, namely reverse causation, residual confounding, and possibly selection, are baked into the pooled point estimate and amplified by its tight CI. Pooling cannot fix study-level bias; it can only fix sampling variability. Reporting fix: present pooled estimates stratified by design and risk-of-bias category (so the reader sees how much of the pooled OR depends on the weaker studies), report sensitivity analyses excluding high-risk studies, and use ROBINS-I / GRADE to communicate certainty rather than just the numerical CI.

Saved.

Applied Synthesis: Evaluating Conflicting Evidence

In practice, you will encounter studies that reach different conclusions about the same question. Synthesis requires comparing studies on their results and on their methodological rigor.

Exercise: Conflicting Studies on Screen Time and Adolescent Mental Health

Consider two studies examining the association between screen time and depressive symptoms in adolescents:

Feature	Study A	Study B
Design	Cross-sectional survey	Prospective cohort, 2-year follow-up
Sample	n = 50,000 (convenience sample via online platform)	n = 3,200 (population-based, probability sample)
Exposure measure	Single question: “How many hours per day do you use screens?”	Validated time-use diary collected at 3 time points
Outcome measure	Single-item mood rating	PHQ-A (validated depression screener)
Confounders adjusted	Age and sex only	Age, sex, SES, parental mental health, physical activity, sleep, prior depression
Reported effect	r = 0.35, p < 0.001, “Strong link”	β = 0.04, 95% CI: −0.02 to 0.10, “Minimal association”

Synthesis: Study A has a far larger sample but weaker design (cross-sectional, convenience sample, crude measures, minimal confounding control). Study B is smaller but has stronger design, better measurement, temporal ordering, and comprehensive confounding control. A rigorous synthesis would weight Study B’s evidence more heavily despite the smaller sample and less dramatic effect size.

Key Integration Principle

Evidence evaluation is probabilistic, not binary. No single study is perfect; no single study is worthless. The goal is to synthesize across studies, weighting each by its methodological rigor, and to exercise disciplined skepticism grounded in methodological knowledge rather than blanket dismissal or uncritical acceptance. Formal frameworks such as GRADE (Guyatt et al., 2008) and the Cochrane RoB 2 tool for trials (Sterne et al., 2019) operationalise this idea: they rate certainty on a continuum rather than declaring a single study right or wrong.

What Disciplined Skepticism Is Not

Disciplined skepticism is not cynicism or nihilism about research. It does not mean dismissing every study because “it’s just observational” or “you can prove anything with statistics.” Rather, it means applying the specific analytical skills you have developed in this course to identify precisely where evidence is strong and where it is uncertain, and calibrating your confidence accordingly.

Research Integrity: Why Critical Appraisal Matters

The clearest illustration of why appraisal cannot be outsourced is the Wakefield MMR-autism fraud. Wakefield et al.’s (1998) case series in The Lancet claimed an association between MMR vaccination and autism. Internal validity threats, including a tiny convenience sample, selective ascertainment, undisclosed conflicts of interest, and fabricated data, were visible on appraisal long before the paper was formally retracted twelve years later (The Lancet, 2010). Trained appraisers identified the problems early; uncritical citation propagated a public-health harm that persists today.

Reflection

Consider a health topic where you have seen conflicting media headlines (e.g., coffee and health, red meat and cancer). How would you apply the synthesis principles from this section to resolve the apparent conflict? What methodological features would you prioritize in deciding which evidence to weight most heavily?

Model answerFor headlines that lurch between "coffee causes cancer" and "coffee prevents Parkinson's," the synthesis principles say: (a) weight by design and bias: randomised, pre-registered, prospective evidence beats cross-sectional dredging; (b) weight by outcome specificity: a meta-analysis of coffee and all-cause mortality across populations is more informative than one cohort's lung-cancer subgroup; (c) look for dose-response; (d) look for biological coherence and replication across diverse populations; (e) distinguish individual vs. population recommendations. Methodological features to prioritise: large prospective cohorts with biomarker validation of exposure, Mendelian randomisation analyses where genetic variants for caffeine metabolism act as natural instruments, and meta-analyses that pre-specify subgroup and dose-response analyses. Weight news headlines by these features, not by how recent or how dramatic the result is.

✓ Reflection saved

HSCI 230, Lesson 12

Evaluating Epidemiological Research

Integrated Appraisal ofEpidemiological Research

Learning objectives for this lesson:

Glossary: Key Terms, People & Concepts

Reading Studies as Structured Inference

Reading Studies as Structured Inference

From toolkit to procedure

Seven decisions, one chain

Getting the structure right

What can go wrong

Hernan et al. (2008)

STROBE, CONSORT, PRISMA

STROBE

CONSORT

PRISMA

Transparency is necessary, not sufficient

The framework in use: a preview

Introduction and Overview

Learning Objectives

From Passive Reading to Active Evaluation

Core Principle

The Inferential Chain

Reporting Frameworks as Evaluation Schemas

Important Distinction: Reporting Quality vs. Methodological Quality

Stepwise Critical Appraisal

Stepwise Critical Appraisal

The appraisal sequence

Question clarity

Internal validity: three threat categories

Selection

Measurement

Confounding

Statistical inference: significance is not importance

The worked example: five stages applied

Collider bias

Reverse causation

Introduction and Overview

Learning Objectives

A Five-Stage Appraisal Framework

Applying the Framework: A Worked Example

Reflection

Red Flags, Quality Indicators, and Applied Synthesis

Red Flags, Quality Indicators, and Applied Synthesis

Warning signs, not disqualifiers

Ioannidis (2005)

Features that warrant heightened confidence

Weighting by design quality, not sample size

Study A: large, weak

Study B: smaller, stronger

Inverse-variance pooling

The Wakefield vaccine retraction

The failures

The consequence

Disciplined skepticism is calibrated, not cynical

Introduction and Overview

Learning Objectives

Red Flags in Published Research

Quality Indicators

R Reflect on what you just ran

Applied Synthesis: Evaluating Conflicting Evidence

Key Integration Principle

What Disciplined Skepticism Is Not

Research Integrity: Why Critical Appraisal Matters

Reflection

Final Assessment

Bringing It All Together

Key Takeaways from this lesson

Capstone Assessment

Reflection

Final Knowledge Assessment

Congratulations! You have completed this course.

Course Capstone: Key Takeaways

References

Integrated Appraisal of
Epidemiological Research