Evaluating Evidence
How to assess the quality and strength of evidence, understand statistics, and avoid being misled by data.
Why Evidence Evaluation Matters
Evidence is the foundation of justified belief. But not all evidence is equal. A single anecdote, a controlled experiment, and a meta-analysis of 50 studies all count as "evidence," but they tell you very different things with very different levels of reliability.
The ability to distinguish strong evidence from weak evidence is the single most practical critical thinking skill.
Types of Evidence
| Type | What It Is | Strength | Weakness |
|---|---|---|---|
| Anecdotal | Personal stories, individual cases | Vivid, relatable | Not generalizable, prone to bias |
| Testimonial | Expert or witness statements | Quick access to expertise | Depends on expert's reliability |
| Observational | Systematic data collection without intervention | Can study natural behavior | Can't establish causation |
| Experimental | Controlled manipulation of variables | Can establish causation | Artificial conditions, limited scope |
| Statistical | Aggregated numerical data | Quantifiable, comparable | Can be manipulated, context-dependent |
| Documentary | Records, reports, official documents | Verifiable, traceable | May be incomplete or biased |
The Evidence Hierarchy
From weakest to strongest:
STRONGEST
↑ Systematic reviews / Meta-analyses
| Randomized controlled trials (RCTs)
| Cohort studies
| Case-control studies
| Cross-sectional studies
| Case reports / Case series
| Expert opinion
| Anecdotal evidence
↓ "I heard somewhere that..."
WEAKEST
Why the Hierarchy Exists
| Level | Why It's Ranked There |
|---|---|
| Systematic reviews | Aggregates ALL studies on a topic; accounts for variation |
| RCTs | Random assignment controls for confounders; gold standard for causation |
| Cohort studies | Follows groups over time; good for observing trends |
| Case-control | Compares those with/without an outcome; useful but can't prove causation |
| Cross-sectional | Snapshot in time; shows association only |
| Case reports | Single cases; useful for hypotheses, not conclusions |
| Expert opinion | Valuable but filtered through one person's biases and knowledge |
| Anecdotes | Unreliable for generalizing; heavily influenced by memory and framing |
The hierarchy is a starting point, not a rigid rule. A well-done cohort study can outweigh a poorly designed RCT.
Sample Size and Representativeness
Sample Size
| Sample Size | Reliability | Appropriate For |
|---|---|---|
| n = 1 | Essentially zero | Generating hypotheses only |
| n = 10-30 | Very low | Pilot studies, early exploration |
| n = 100-500 | Moderate | Initial conclusions, with caveats |
| n = 1,000+ | Good | Reliable conclusions for most purposes |
| n = 10,000+ | Very good | Detecting small effects |
Key principle: Larger samples are more likely to reflect the true population. Small samples can easily produce misleading results through chance alone.
Representativeness
A large but biased sample is worse than a smaller representative one.
| Biased Sample Problem | Example |
|---|---|
| Self-selection | Only people who care respond to surveys |
| Convenience sampling | Studying only college students, then generalizing to all adults |
| Survivorship bias | Only studying successful companies |
| Volunteer bias | People who volunteer for studies may differ systematically |
Ask: "Who was studied? Who was left out? Does the sample look like the population I'm generalizing to?"
Correlation vs. Causation
The most important statistical distinction in everyday reasoning.
Three Requirements for Causation
| Requirement | What It Means | Example |
|---|---|---|
| Correlation | The variables move together | Ice cream sales and drowning rates both increase in summer |
| Temporal precedence | The cause comes before the effect | Temperature rises before both increase |
| No confounds | No third variable explains the relationship | Summer (the real cause) drives both |
Why Correlation Isn't Causation
| Correlation | Possible Explanation |
|---|---|
| Countries that eat more chocolate win more Nobel Prizes | Wealth drives both chocolate consumption and research funding |
| Children who eat breakfast do better in school | Stable home environment drives both breakfast eating and academic support |
| Phone usage correlates with depression in teens | Depression may cause more phone use, not the reverse |
Causal Traps
| Trap | Description | Example |
|---|---|---|
| Reverse causation | The effect actually causes the "cause" | Do hospitals cause death? (Sicker people go to hospitals) |
| Confounding variable | A third factor causes both | Education correlates with income, but family wealth influences both |
| Coincidence | Pure chance | Correlation between Nicolas Cage films and pool drownings |
| Selection bias | The sample is skewed | Successful people credit their habits, but unsuccessful people may have the same habits |
Understanding P-Values and Statistical Significance
What a P-Value Actually Means
A p-value is the probability of seeing results at least as extreme as the observed results, IF the null hypothesis is true.
| P-Value | Common Interpretation | What It Actually Means |
|---|---|---|
| p < 0.05 | "Statistically significant" | If there were truly no effect, you'd see results this extreme less than 5% of the time |
| p < 0.01 | "Highly significant" | Less than 1% chance under null hypothesis |
| p = 0.06 | "Not significant" | 6% chance under null, nearly identical to p=0.04, but treated very differently |
What P-Values Do NOT Mean
| Common Misunderstanding | Reality |
|---|---|
| "There's a 5% chance the result is due to chance" | No. It's the probability of the data given no effect |
| "The effect is large/important" | No. Statistical significance ≠ practical significance |
| "The result has been proven" | No. P-values say nothing about replication |
| "p > 0.05 means no effect" | No. It means we can't detect an effect with this study |
Statistical Significance vs. Practical Significance
| Scenario | Statistically Significant? | Practically Significant? |
|---|---|---|
| Drug reduces blood pressure by 0.5 mmHg (n=100,000) | Yes | No. The effect is trivially small |
| New teaching method improves grades by 15% (n=30) | Maybe not | Yes. If real, this matters a lot |
| Exercise reduces mortality risk by 30% (n=50,000) | Yes | Yes. Large sample AND meaningful effect |
Always ask: "How big is the effect?" not just "Is it statistically significant?"
Evaluating Study Quality
Red Flags in Research
| Red Flag | Why It's Concerning |
|---|---|
| No control group | Can't compare to baseline |
| No randomization | Groups may differ systematically |
| Small sample size | Results may be due to chance |
| No blinding | Participants and researchers can influence results |
| Conflicts of interest | Funding source may bias results |
| Cherry-picked outcomes | Reporting only positive findings |
| No pre-registration | Hypotheses may have been changed after seeing data |
| Single study, dramatic claim | Extraordinary claims need replication |
Questions to Ask About Any Study
- Who funded it? Industry-funded studies are more likely to find favorable results.
- How big was the sample? Small samples are unreliable.
- Was there a control group? Without one, you can't attribute effects to the treatment.
- Was it randomized? Without randomization, differences between groups could explain results.
- Was it blinded? If participants or researchers know who gets what, it introduces bias.
- Has it been replicated? A single study is a starting point, not a conclusion.
- What's the effect size? Statistically significant doesn't mean practically important.
- Was the study pre-registered? Pre-registration prevents HARKing (Hypothesizing After Results are Known).
The Replication Crisis
Many published findings, especially in psychology, medicine, and social sciences, don't hold up when other researchers try to replicate them.
Key Facts
| Finding | Source |
|---|---|
| ~50% of psychology studies failed to replicate | Reproducibility Project (2015) |
| ~65% of preclinical cancer studies couldn't be reproduced | Amgen study (2012) |
| Studies with surprising results are less likely to replicate | Multiple meta-analyses |
| Small-sample studies are the most likely to fail replication | Basic statistics |
Implications for You
- Don't treat any single study as definitive
- Look for replications and meta-analyses
- Be especially skeptical of surprising or counterintuitive findings from small studies
- "A new study says..." is one of the weakest forms of evidence
Base Rates and Bayesian Thinking
What Are Base Rates?
The base rate is the underlying frequency of something in the population. Ignoring it leads to wildly wrong conclusions.
Classic example:
A disease affects 1 in 1,000 people. A test is 99% accurate (99% true positive, 1% false positive). You test positive. What's the probability you actually have the disease?
| Group | Size | Test Positive | Actually Have Disease |
|---|---|---|---|
| Has disease | 1 | 0.99 | Yes |
| Doesn't have disease | 999 | 9.99 | No |
| Total positives | 10.98 |
Probability you have it: 0.99 / 10.98 ≈ 9%. Not 99%.
Most people guess 99%. The base rate (1 in 1,000) is crucial.
Bayesian Updating
Instead of treating beliefs as binary (true/false), treat them as probabilities that update with new evidence.
Prior belief → New evidence → Updated belief
(what you believed) (what you learned) (what you now believe)
Practical process:
- Start with a prior probability (your initial estimate)
- Evaluate the new evidence (how likely is this evidence if your belief is true? If false?)
- Update your probability accordingly
- Repeat as new evidence arrives
Example:
| Step | Event | Probability It Will Rain |
|---|---|---|
| Prior | Weather forecast says 40% chance | 40% |
| Evidence 1 | You see dark clouds | → 65% |
| Evidence 2 | Barometer is falling | → 80% |
| Evidence 3 | Weather app updates to 20% | → 50% (conflicting evidence) |
Key Principles
- Strong prior + weak evidence = minor update
- Weak prior + strong evidence = major update
- Multiple independent evidence sources compound
- Update incrementally, not all-or-nothing
Common Evidence Pitfalls
The Narrative Fallacy
We're wired to prefer stories over statistics. A single compelling story can override mountains of data.
| Scenario | Story | Data |
|---|---|---|
| "Smoking isn't that bad" | "My grandfather smoked and lived to 95" | Smokers die on average 10 years earlier |
| "Seatbelts are dangerous" | "My friend was saved by NOT wearing one" | Seatbelts reduce fatality risk by ~45% |
| "Vaccines cause harm" | "My child got sick after vaccination" | Vaccines prevent millions of deaths annually |
The story may be true. But it's one data point. Always check whether the story represents the typical case or an extreme outlier.
Denominator Blindness
We focus on the numerator (events that happened) and ignore the denominator (total opportunities for events).
Example: "5 people died from X this year!" Scary? Depends on the denominator:
- 5 out of 10 = catastrophic (50%)
- 5 out of 10,000,000 = trivially rare (0.00005%)
Always ask: "Out of how many?"
The Availability Cascade
Vivid, dramatic evidence gets remembered and shared more, creating the illusion that it's more common.
Why you think: Shark attacks are common (dramatic news stories) Reality: You're more likely to be killed by a vending machine
Practical Evidence Evaluation Framework
When confronted with any claim backed by "evidence," run through this:
| Step | Question | Action |
|---|---|---|
| 1 | What type of evidence is this? | Place it in the hierarchy |
| 2 | What's the sample size? | Small samples = weak evidence |
| 3 | Is there a control group? | No control = can't attribute causation |
| 4 | Who produced this evidence? | Check for conflicts of interest |
| 5 | Has it been replicated? | Single studies are unreliable |
| 6 | What's the effect size? | Significant ≠ important |
| 7 | What's the base rate? | Don't ignore prior probability |
| 8 | Does this fit with other evidence? | Convergent evidence is strongest |
Key Takeaways
- Not all evidence is equal. Learn the hierarchy and use it
- Correlation ≠ causation. Always look for confounds, reverse causation, and coincidence
- P-values are widely misunderstood. Statistical significance doesn't mean practical significance
- Single studies prove nothing. Look for replications and systematic reviews
- Base rates matter enormously. Ignoring them leads to wildly wrong conclusions
- Stories aren't data. Vivid anecdotes can override statistics but shouldn't
- Always ask "out of how many?" The denominator changes everything
- Update incrementally. Bayesian thinking beats binary true/false