Evaluating Evidence | Critical Thinking

How to assess the quality and strength of evidence, understand statistics, and avoid being misled by data.

Why Evidence Evaluation Matters

Evidence is the foundation of justified belief. But not all evidence is equal. A single anecdote, a controlled experiment, and a meta-analysis of 50 studies all count as "evidence," but they tell you very different things with very different levels of reliability.

The ability to distinguish strong evidence from weak evidence is the single most practical critical thinking skill.

Types of Evidence

Type	What It Is	Strength	Weakness
Anecdotal	Personal stories, individual cases	Vivid, relatable	Not generalizable, prone to bias
Testimonial	Expert or witness statements	Quick access to expertise	Depends on expert's reliability
Observational	Systematic data collection without intervention	Can study natural behavior	Can't establish causation
Experimental	Controlled manipulation of variables	Can establish causation	Artificial conditions, limited scope
Statistical	Aggregated numerical data	Quantifiable, comparable	Can be manipulated, context-dependent
Documentary	Records, reports, official documents	Verifiable, traceable	May be incomplete or biased

The Evidence Hierarchy

From weakest to strongest:

STRONGEST
  ↑  Systematic reviews / Meta-analyses
  |  Randomized controlled trials (RCTs)
  |  Cohort studies
  |  Case-control studies
  |  Cross-sectional studies
  |  Case reports / Case series
  |  Expert opinion
  |  Anecdotal evidence
  ↓  "I heard somewhere that..."
WEAKEST

Why the Hierarchy Exists

Level	Why It's Ranked There
Systematic reviews	Aggregates ALL studies on a topic; accounts for variation
RCTs	Random assignment controls for confounders; gold standard for causation
Cohort studies	Follows groups over time; good for observing trends
Case-control	Compares those with/without an outcome; useful but can't prove causation
Cross-sectional	Snapshot in time; shows association only
Case reports	Single cases; useful for hypotheses, not conclusions
Expert opinion	Valuable but filtered through one person's biases and knowledge
Anecdotes	Unreliable for generalizing; heavily influenced by memory and framing

The hierarchy is a starting point, not a rigid rule. A well-done cohort study can outweigh a poorly designed RCT.

Sample Size and Representativeness

Sample Size

Sample Size	Reliability	Appropriate For
n = 1	Essentially zero	Generating hypotheses only
n = 10-30	Very low	Pilot studies, early exploration
n = 100-500	Moderate	Initial conclusions, with caveats
n = 1,000+	Good	Reliable conclusions for most purposes
n = 10,000+	Very good	Detecting small effects

Key principle: Larger samples are more likely to reflect the true population. Small samples can easily produce misleading results through chance alone.

Representativeness

A large but biased sample is worse than a smaller representative one.

Biased Sample Problem	Example
Self-selection	Only people who care respond to surveys
Convenience sampling	Studying only college students, then generalizing to all adults
Survivorship bias	Only studying successful companies
Volunteer bias	People who volunteer for studies may differ systematically

Ask: "Who was studied? Who was left out? Does the sample look like the population I'm generalizing to?"

Correlation vs. Causation

The most important statistical distinction in everyday reasoning.

Three Requirements for Causation

Requirement	What It Means	Example
Correlation	The variables move together	Ice cream sales and drowning rates both increase in summer
Temporal precedence	The cause comes before the effect	Temperature rises before both increase
No confounds	No third variable explains the relationship	Summer (the real cause) drives both

Why Correlation Isn't Causation

Correlation	Possible Explanation
Countries that eat more chocolate win more Nobel Prizes	Wealth drives both chocolate consumption and research funding
Children who eat breakfast do better in school	Stable home environment drives both breakfast eating and academic support
Phone usage correlates with depression in teens	Depression may cause more phone use, not the reverse

Causal Traps

Trap	Description	Example
Reverse causation	The effect actually causes the "cause"	Do hospitals cause death? (Sicker people go to hospitals)
Confounding variable	A third factor causes both	Education correlates with income, but family wealth influences both
Coincidence	Pure chance	Correlation between Nicolas Cage films and pool drownings
Selection bias	The sample is skewed	Successful people credit their habits, but unsuccessful people may have the same habits

Understanding P-Values and Statistical Significance

What a P-Value Actually Means

A p-value is the probability of seeing results at least as extreme as the observed results, IF the null hypothesis is true.

P-Value	Common Interpretation	What It Actually Means
p < 0.05	"Statistically significant"	If there were truly no effect, you'd see results this extreme less than 5% of the time
p < 0.01	"Highly significant"	Less than 1% chance under null hypothesis
p = 0.06	"Not significant"	6% chance under null, nearly identical to p=0.04, but treated very differently

What P-Values Do NOT Mean

Common Misunderstanding	Reality
"There's a 5% chance the result is due to chance"	No. It's the probability of the data given no effect
"The effect is large/important"	No. Statistical significance ≠ practical significance
"The result has been proven"	No. P-values say nothing about replication
"p > 0.05 means no effect"	No. It means we can't detect an effect with this study

Statistical Significance vs. Practical Significance

Scenario	Statistically Significant?	Practically Significant?
Drug reduces blood pressure by 0.5 mmHg (n=100,000)	Yes	No. The effect is trivially small
New teaching method improves grades by 15% (n=30)	Maybe not	Yes. If real, this matters a lot
Exercise reduces mortality risk by 30% (n=50,000)	Yes	Yes. Large sample AND meaningful effect

Always ask: "How big is the effect?" not just "Is it statistically significant?"

Evaluating Study Quality

Red Flags in Research

Red Flag	Why It's Concerning
No control group	Can't compare to baseline
No randomization	Groups may differ systematically
Small sample size	Results may be due to chance
No blinding	Participants and researchers can influence results
Conflicts of interest	Funding source may bias results
Cherry-picked outcomes	Reporting only positive findings
No pre-registration	Hypotheses may have been changed after seeing data
Single study, dramatic claim	Extraordinary claims need replication

Questions to Ask About Any Study

Who funded it? Industry-funded studies are more likely to find favorable results.
How big was the sample? Small samples are unreliable.
Was there a control group? Without one, you can't attribute effects to the treatment.
Was it randomized? Without randomization, differences between groups could explain results.
Was it blinded? If participants or researchers know who gets what, it introduces bias.
Has it been replicated? A single study is a starting point, not a conclusion.
What's the effect size? Statistically significant doesn't mean practically important.
Was the study pre-registered? Pre-registration prevents HARKing (Hypothesizing After Results are Known).

The Replication Crisis

Many published findings, especially in psychology, medicine, and social sciences, don't hold up when other researchers try to replicate them.

Key Facts

Finding	Source
~50% of psychology studies failed to replicate	Reproducibility Project (2015)
~65% of preclinical cancer studies couldn't be reproduced	Amgen study (2012)
Studies with surprising results are less likely to replicate	Multiple meta-analyses
Small-sample studies are the most likely to fail replication	Basic statistics

Implications for You

Don't treat any single study as definitive
Look for replications and meta-analyses
Be especially skeptical of surprising or counterintuitive findings from small studies
"A new study says..." is one of the weakest forms of evidence

Base Rates and Bayesian Thinking

What Are Base Rates?

The base rate is the underlying frequency of something in the population. Ignoring it leads to wildly wrong conclusions.

Classic example:

A disease affects 1 in 1,000 people. A test is 99% accurate (99% true positive, 1% false positive). You test positive. What's the probability you actually have the disease?

Group	Size	Test Positive	Actually Have Disease
Has disease	1	0.99	Yes
Doesn't have disease	999	9.99	No
Total positives		10.98

Probability you have it: 0.99 / 10.98 ≈ 9%. Not 99%.

Most people guess 99%. The base rate (1 in 1,000) is crucial.

Bayesian Updating

Instead of treating beliefs as binary (true/false), treat them as probabilities that update with new evidence.

Prior belief → New evidence → Updated belief
  (what you believed)  (what you learned)  (what you now believe)

Practical process:

Start with a prior probability (your initial estimate)
Evaluate the new evidence (how likely is this evidence if your belief is true? If false?)
Update your probability accordingly
Repeat as new evidence arrives

Example:

Step	Event	Probability It Will Rain
Prior	Weather forecast says 40% chance	40%
Evidence 1	You see dark clouds	→ 65%
Evidence 2	Barometer is falling	→ 80%
Evidence 3	Weather app updates to 20%	→ 50% (conflicting evidence)

Key Principles

Strong prior + weak evidence = minor update
Weak prior + strong evidence = major update
Multiple independent evidence sources compound
Update incrementally, not all-or-nothing

Common Evidence Pitfalls

The Narrative Fallacy

We're wired to prefer stories over statistics. A single compelling story can override mountains of data.

Scenario	Story	Data
"Smoking isn't that bad"	"My grandfather smoked and lived to 95"	Smokers die on average 10 years earlier
"Seatbelts are dangerous"	"My friend was saved by NOT wearing one"	Seatbelts reduce fatality risk by ~45%
"Vaccines cause harm"	"My child got sick after vaccination"	Vaccines prevent millions of deaths annually

The story may be true. But it's one data point. Always check whether the story represents the typical case or an extreme outlier.

Denominator Blindness

We focus on the numerator (events that happened) and ignore the denominator (total opportunities for events).

Example: "5 people died from X this year!" Scary? Depends on the denominator:

5 out of 10 = catastrophic (50%)
5 out of 10,000,000 = trivially rare (0.00005%)

Always ask: "Out of how many?"

The Availability Cascade

Vivid, dramatic evidence gets remembered and shared more, creating the illusion that it's more common.

Why you think: Shark attacks are common (dramatic news stories) Reality: You're more likely to be killed by a vending machine

Practical Evidence Evaluation Framework

When confronted with any claim backed by "evidence," run through this:

Step	Question	Action
1	What type of evidence is this?	Place it in the hierarchy
2	What's the sample size?	Small samples = weak evidence
3	Is there a control group?	No control = can't attribute causation
4	Who produced this evidence?	Check for conflicts of interest
5	Has it been replicated?	Single studies are unreliable
6	What's the effect size?	Significant ≠ important
7	What's the base rate?	Don't ignore prior probability
8	Does this fit with other evidence?	Convergent evidence is strongest

Key Takeaways

Not all evidence is equal. Learn the hierarchy and use it
Correlation ≠ causation. Always look for confounds, reverse causation, and coincidence
P-values are widely misunderstood. Statistical significance doesn't mean practical significance
Single studies prove nothing. Look for replications and systematic reviews
Base rates matter enormously. Ignoring them leads to wildly wrong conclusions
Stories aren't data. Vivid anecdotes can override statistics but shouldn't
Always ask "out of how many?" The denominator changes everything
Update incrementally. Bayesian thinking beats binary true/false