Evaluating Evidence

How to assess the quality and strength of evidence, understand statistics, and avoid being misled by data.

Why Evidence Evaluation Matters

Evidence is the foundation of justified belief. But not all evidence is equal. A single anecdote, a controlled experiment, and a meta-analysis of 50 studies all count as "evidence," but they tell you very different things with very different levels of reliability.

The ability to distinguish strong evidence from weak evidence is the single most practical critical thinking skill.

Types of Evidence

TypeWhat It IsStrengthWeakness
AnecdotalPersonal stories, individual casesVivid, relatableNot generalizable, prone to bias
TestimonialExpert or witness statementsQuick access to expertiseDepends on expert's reliability
ObservationalSystematic data collection without interventionCan study natural behaviorCan't establish causation
ExperimentalControlled manipulation of variablesCan establish causationArtificial conditions, limited scope
StatisticalAggregated numerical dataQuantifiable, comparableCan be manipulated, context-dependent
DocumentaryRecords, reports, official documentsVerifiable, traceableMay be incomplete or biased

The Evidence Hierarchy

From weakest to strongest:

STRONGEST
  ↑  Systematic reviews / Meta-analyses
  |  Randomized controlled trials (RCTs)
  |  Cohort studies
  |  Case-control studies
  |  Cross-sectional studies
  |  Case reports / Case series
  |  Expert opinion
  |  Anecdotal evidence
  ↓  "I heard somewhere that..."
WEAKEST

Why the Hierarchy Exists

LevelWhy It's Ranked There
Systematic reviewsAggregates ALL studies on a topic; accounts for variation
RCTsRandom assignment controls for confounders; gold standard for causation
Cohort studiesFollows groups over time; good for observing trends
Case-controlCompares those with/without an outcome; useful but can't prove causation
Cross-sectionalSnapshot in time; shows association only
Case reportsSingle cases; useful for hypotheses, not conclusions
Expert opinionValuable but filtered through one person's biases and knowledge
AnecdotesUnreliable for generalizing; heavily influenced by memory and framing

The hierarchy is a starting point, not a rigid rule. A well-done cohort study can outweigh a poorly designed RCT.

Sample Size and Representativeness

Sample Size

Sample SizeReliabilityAppropriate For
n = 1Essentially zeroGenerating hypotheses only
n = 10-30Very lowPilot studies, early exploration
n = 100-500ModerateInitial conclusions, with caveats
n = 1,000+GoodReliable conclusions for most purposes
n = 10,000+Very goodDetecting small effects

Key principle: Larger samples are more likely to reflect the true population. Small samples can easily produce misleading results through chance alone.

Representativeness

A large but biased sample is worse than a smaller representative one.

Biased Sample ProblemExample
Self-selectionOnly people who care respond to surveys
Convenience samplingStudying only college students, then generalizing to all adults
Survivorship biasOnly studying successful companies
Volunteer biasPeople who volunteer for studies may differ systematically

Ask: "Who was studied? Who was left out? Does the sample look like the population I'm generalizing to?"

Correlation vs. Causation

The most important statistical distinction in everyday reasoning.

Three Requirements for Causation

RequirementWhat It MeansExample
CorrelationThe variables move togetherIce cream sales and drowning rates both increase in summer
Temporal precedenceThe cause comes before the effectTemperature rises before both increase
No confoundsNo third variable explains the relationshipSummer (the real cause) drives both

Why Correlation Isn't Causation

CorrelationPossible Explanation
Countries that eat more chocolate win more Nobel PrizesWealth drives both chocolate consumption and research funding
Children who eat breakfast do better in schoolStable home environment drives both breakfast eating and academic support
Phone usage correlates with depression in teensDepression may cause more phone use, not the reverse

Causal Traps

TrapDescriptionExample
Reverse causationThe effect actually causes the "cause"Do hospitals cause death? (Sicker people go to hospitals)
Confounding variableA third factor causes bothEducation correlates with income, but family wealth influences both
CoincidencePure chanceCorrelation between Nicolas Cage films and pool drownings
Selection biasThe sample is skewedSuccessful people credit their habits, but unsuccessful people may have the same habits

Understanding P-Values and Statistical Significance

What a P-Value Actually Means

A p-value is the probability of seeing results at least as extreme as the observed results, IF the null hypothesis is true.

P-ValueCommon InterpretationWhat It Actually Means
p < 0.05"Statistically significant"If there were truly no effect, you'd see results this extreme less than 5% of the time
p < 0.01"Highly significant"Less than 1% chance under null hypothesis
p = 0.06"Not significant"6% chance under null, nearly identical to p=0.04, but treated very differently

What P-Values Do NOT Mean

Common MisunderstandingReality
"There's a 5% chance the result is due to chance"No. It's the probability of the data given no effect
"The effect is large/important"No. Statistical significance ≠ practical significance
"The result has been proven"No. P-values say nothing about replication
"p > 0.05 means no effect"No. It means we can't detect an effect with this study

Statistical Significance vs. Practical Significance

ScenarioStatistically Significant?Practically Significant?
Drug reduces blood pressure by 0.5 mmHg (n=100,000)YesNo. The effect is trivially small
New teaching method improves grades by 15% (n=30)Maybe notYes. If real, this matters a lot
Exercise reduces mortality risk by 30% (n=50,000)YesYes. Large sample AND meaningful effect

Always ask: "How big is the effect?" not just "Is it statistically significant?"

Evaluating Study Quality

Red Flags in Research

Red FlagWhy It's Concerning
No control groupCan't compare to baseline
No randomizationGroups may differ systematically
Small sample sizeResults may be due to chance
No blindingParticipants and researchers can influence results
Conflicts of interestFunding source may bias results
Cherry-picked outcomesReporting only positive findings
No pre-registrationHypotheses may have been changed after seeing data
Single study, dramatic claimExtraordinary claims need replication

Questions to Ask About Any Study

  1. Who funded it? Industry-funded studies are more likely to find favorable results.
  2. How big was the sample? Small samples are unreliable.
  3. Was there a control group? Without one, you can't attribute effects to the treatment.
  4. Was it randomized? Without randomization, differences between groups could explain results.
  5. Was it blinded? If participants or researchers know who gets what, it introduces bias.
  6. Has it been replicated? A single study is a starting point, not a conclusion.
  7. What's the effect size? Statistically significant doesn't mean practically important.
  8. Was the study pre-registered? Pre-registration prevents HARKing (Hypothesizing After Results are Known).

The Replication Crisis

Many published findings, especially in psychology, medicine, and social sciences, don't hold up when other researchers try to replicate them.

Key Facts

FindingSource
~50% of psychology studies failed to replicateReproducibility Project (2015)
~65% of preclinical cancer studies couldn't be reproducedAmgen study (2012)
Studies with surprising results are less likely to replicateMultiple meta-analyses
Small-sample studies are the most likely to fail replicationBasic statistics

Implications for You

  • Don't treat any single study as definitive
  • Look for replications and meta-analyses
  • Be especially skeptical of surprising or counterintuitive findings from small studies
  • "A new study says..." is one of the weakest forms of evidence

Base Rates and Bayesian Thinking

What Are Base Rates?

The base rate is the underlying frequency of something in the population. Ignoring it leads to wildly wrong conclusions.

Classic example:

A disease affects 1 in 1,000 people. A test is 99% accurate (99% true positive, 1% false positive). You test positive. What's the probability you actually have the disease?

GroupSizeTest PositiveActually Have Disease
Has disease10.99Yes
Doesn't have disease9999.99No
Total positives10.98

Probability you have it: 0.99 / 10.98 ≈ 9%. Not 99%.

Most people guess 99%. The base rate (1 in 1,000) is crucial.

Bayesian Updating

Instead of treating beliefs as binary (true/false), treat them as probabilities that update with new evidence.

Prior belief → New evidence → Updated belief
  (what you believed)  (what you learned)  (what you now believe)

Practical process:

  1. Start with a prior probability (your initial estimate)
  2. Evaluate the new evidence (how likely is this evidence if your belief is true? If false?)
  3. Update your probability accordingly
  4. Repeat as new evidence arrives

Example:

StepEventProbability It Will Rain
PriorWeather forecast says 40% chance40%
Evidence 1You see dark clouds→ 65%
Evidence 2Barometer is falling→ 80%
Evidence 3Weather app updates to 20%→ 50% (conflicting evidence)

Key Principles

  • Strong prior + weak evidence = minor update
  • Weak prior + strong evidence = major update
  • Multiple independent evidence sources compound
  • Update incrementally, not all-or-nothing

Common Evidence Pitfalls

The Narrative Fallacy

We're wired to prefer stories over statistics. A single compelling story can override mountains of data.

ScenarioStoryData
"Smoking isn't that bad""My grandfather smoked and lived to 95"Smokers die on average 10 years earlier
"Seatbelts are dangerous""My friend was saved by NOT wearing one"Seatbelts reduce fatality risk by ~45%
"Vaccines cause harm""My child got sick after vaccination"Vaccines prevent millions of deaths annually

The story may be true. But it's one data point. Always check whether the story represents the typical case or an extreme outlier.

Denominator Blindness

We focus on the numerator (events that happened) and ignore the denominator (total opportunities for events).

Example: "5 people died from X this year!" Scary? Depends on the denominator:

  • 5 out of 10 = catastrophic (50%)
  • 5 out of 10,000,000 = trivially rare (0.00005%)

Always ask: "Out of how many?"

The Availability Cascade

Vivid, dramatic evidence gets remembered and shared more, creating the illusion that it's more common.

Why you think: Shark attacks are common (dramatic news stories) Reality: You're more likely to be killed by a vending machine

Practical Evidence Evaluation Framework

When confronted with any claim backed by "evidence," run through this:

StepQuestionAction
1What type of evidence is this?Place it in the hierarchy
2What's the sample size?Small samples = weak evidence
3Is there a control group?No control = can't attribute causation
4Who produced this evidence?Check for conflicts of interest
5Has it been replicated?Single studies are unreliable
6What's the effect size?Significant ≠ important
7What's the base rate?Don't ignore prior probability
8Does this fit with other evidence?Convergent evidence is strongest

Key Takeaways

  1. Not all evidence is equal. Learn the hierarchy and use it
  2. Correlation ≠ causation. Always look for confounds, reverse causation, and coincidence
  3. P-values are widely misunderstood. Statistical significance doesn't mean practical significance
  4. Single studies prove nothing. Look for replications and systematic reviews
  5. Base rates matter enormously. Ignoring them leads to wildly wrong conclusions
  6. Stories aren't data. Vivid anecdotes can override statistics but shouldn't
  7. Always ask "out of how many?" The denominator changes everything
  8. Update incrementally. Bayesian thinking beats binary true/false