Statistical Inference

Introduction

Statistical inference is the process of drawing conclusions about populations from sample data. In computer science, we use inference constantly: A/B testing features, evaluating algorithm performance, detecting anomalies, and validating models. This reading covers estimation, hypothesis testing, and confidence intervals.

Learning Objectives

By the end of this reading, you will be able to:

  • Understand point estimation and properties of estimators
  • Construct and interpret confidence intervals
  • Perform hypothesis tests and understand p-values
  • Apply the Central Limit Theorem
  • Conduct A/B tests correctly
  • Avoid common statistical pitfalls

1. Point Estimation

Estimators

An estimator is a function of sample data used to estimate a population parameter.

Parameter θ (unknown) → Estimator θ̂ (computed from sample)

Examples:

  • Population mean μ → Sample mean X̄ = (1/n)Σxᵢ
  • Population variance σ² → Sample variance S² = (1/(n-1))Σ(xᵢ - X̄)²
  • Population proportion p → Sample proportion p̂ = successes/n
import math

def sample_mean(data):
    """Estimate population mean"""
    return sum(data) / len(data)

def sample_variance(data, ddof=1):
    """
    Estimate population variance

    ddof=1 gives unbiased estimate (Bessel's correction)
    ddof=0 gives MLE estimate
    """
    n = len(data)
    mean = sample_mean(data)
    return sum((x - mean)**2 for x in data) / (n - ddof)

def sample_std(data, ddof=1):
    """Sample standard deviation"""
    return math.sqrt(sample_variance(data, ddof))

Properties of Estimators

Bias: An estimator is unbiased if E[θ̂] = θ

def demonstrate_bias():
    """Show why we divide by (n-1) for sample variance"""
    import random

    true_variance = 1  # True variance of standard normal
    n_samples = 1000
    sample_size = 10

    biased_estimates = []
    unbiased_estimates = []

    for _ in range(n_samples):
        sample = [random.gauss(0, 1) for _ in range(sample_size)]

        # Biased (divide by n)
        biased_estimates.append(sample_variance(sample, ddof=0))

        # Unbiased (divide by n-1)
        unbiased_estimates.append(sample_variance(sample, ddof=1))

    print(f"True variance: {true_variance}")
    print(f"Mean of biased estimates: {sum(biased_estimates)/len(biased_estimates):.4f}")
    print(f"Mean of unbiased estimates: {sum(unbiased_estimates)/len(unbiased_estimates):.4f}")

demonstrate_bias()

Consistency: θ̂ → θ as n → ∞ (converges in probability)

Efficiency: Among unbiased estimators, the one with smallest variance is most efficient.

Maximum Likelihood Estimation (MLE)

MLE finds the parameter values that maximize the likelihood of observing the data.

def mle_bernoulli(data):
    """
    MLE for Bernoulli parameter p

    L(p) = Π p^xᵢ (1-p)^(1-xᵢ)
    log L(p) = Σ xᵢ log(p) + Σ (1-xᵢ) log(1-p)

    Taking derivative and setting to 0:
    p̂ = (1/n) Σ xᵢ = sample proportion
    """
    return sum(data) / len(data)

def mle_normal(data):
    """
    MLE for Normal(μ, σ²)

    μ̂ = sample mean
    σ̂² = (1/n) Σ(xᵢ - x̄)²  (note: biased!)
    """
    n = len(data)
    mu_hat = sum(data) / n
    sigma_sq_hat = sum((x - mu_hat)**2 for x in data) / n
    return mu_hat, sigma_sq_hat

2. The Central Limit Theorem

Statement

For independent samples X₁, ..., Xₙ from a distribution with mean μ and variance σ²:

X̄ = (1/n) Σ Xᵢ

As n → ∞:
(X̄ - μ) / (σ/√n) → N(0, 1)

In words: The sample mean is approximately normally distributed, regardless of the original distribution.

Practical Rule

For n ≥ 30, the normal approximation is usually reasonable.

def demonstrate_clt():
    """Visualize Central Limit Theorem"""
    import random

    def sample_means(distribution_sampler, n_samples, sample_size):
        """Generate sample means"""
        return [
            sum(distribution_sampler() for _ in range(sample_size)) / sample_size
            for _ in range(n_samples)
        ]

    # Sample from uniform distribution
    uniform_sampler = lambda: random.uniform(0, 1)

    # Sample from exponential distribution
    exp_sampler = lambda: -math.log(random.random())

    # As sample size increases, means become more normal
    for sample_size in [1, 5, 30]:
        means = sample_means(exp_sampler, 10000, sample_size)
        print(f"\nSample size {sample_size}:")
        print(f"  Mean of means: {sum(means)/len(means):.4f} (true: 1.0)")
        print(f"  Std of means: {sample_std(means):.4f} (theory: {1/math.sqrt(sample_size):.4f})")

demonstrate_clt()

Standard Error

The standard error is the standard deviation of an estimator:

SE(X̄) = σ / √n

When σ is unknown, estimate with sample standard deviation:

SE(X̄) ≈ S / √n
def standard_error(data):
    """Compute standard error of the mean"""
    n = len(data)
    s = sample_std(data)
    return s / math.sqrt(n)

3. Confidence Intervals

Concept

A confidence interval provides a range of plausible values for a parameter.

A 95% CI means: If we repeated the sampling many times, 95% of computed intervals would contain the true parameter.

Common misconception: It does NOT mean "95% probability the true value is in this interval."

CI for Mean (Known Variance)

CI = X̄ ± z_{α/2} × (σ/√n)

For 95% CI: z_{0.025} = 1.96

CI for Mean (Unknown Variance)

Use t-distribution with n-1 degrees of freedom:

CI = X̄ ± t_{α/2, n-1} × (S/√n)
def t_critical(alpha, df):
    """
    Approximate t critical value
    For df > 30, t ≈ z
    """
    # Simplified approximation (use scipy.stats.t.ppf for accuracy)
    z = {0.10: 1.645, 0.05: 1.96, 0.01: 2.576}

    if df >= 30:
        return z.get(alpha, 1.96)

    # Small sample adjustment (rough approximation)
    correction = 1 + 1/(4*df)
    return z.get(alpha, 1.96) * correction

def confidence_interval_mean(data, confidence=0.95):
    """
    Compute confidence interval for population mean

    Uses t-distribution for unknown variance
    """
    n = len(data)
    x_bar = sample_mean(data)
    se = standard_error(data)

    alpha = 1 - confidence
    t_crit = t_critical(alpha, n - 1)

    margin = t_crit * se

    return (x_bar - margin, x_bar + margin)

# Example
data = [23, 25, 28, 31, 22, 27, 29, 26, 24, 30]
ci = confidence_interval_mean(data, 0.95)
print(f"Sample mean: {sample_mean(data):.2f}")
print(f"95% CI: ({ci[0]:.2f}, {ci[1]:.2f})")

CI for Proportion

p̂ ± z_{α/2} × √(p̂(1-p̂)/n)
def confidence_interval_proportion(successes, n, confidence=0.95):
    """Confidence interval for proportion"""
    p_hat = successes / n
    z = {0.90: 1.645, 0.95: 1.96, 0.99: 2.576}[confidence]

    se = math.sqrt(p_hat * (1 - p_hat) / n)
    margin = z * se

    return (max(0, p_hat - margin), min(1, p_hat + margin))

# Example: 60 successes in 100 trials
ci = confidence_interval_proportion(60, 100, 0.95)
print(f"Proportion: 0.60")
print(f"95% CI: ({ci[0]:.3f}, {ci[1]:.3f})")

4. Hypothesis Testing

Framework

  1. Null hypothesis H₀: The default/status quo claim
  2. Alternative hypothesis H₁: What we're trying to show
  3. Test statistic: Computed from data
  4. P-value: Probability of observing data as extreme as ours, if H₀ is true
  5. Decision: Reject H₀ if p-value < α (significance level)

Types of Errors

H₀ TrueH₀ False
Reject H₀Type I Error (α)Correct
Fail to Reject H₀CorrectType II Error (β)
  • α = P(Type I) = significance level (typically 0.05)
  • β = P(Type II)
  • Power = 1 - β = P(correctly rejecting false H₀)

One-Sample t-Test

Testing if population mean equals a specific value.

def one_sample_t_test(data, mu_0, alternative='two-sided'):
    """
    Test H₀: μ = μ₀

    alternative: 'two-sided', 'greater', 'less'
    """
    n = len(data)
    x_bar = sample_mean(data)
    se = standard_error(data)

    # Test statistic
    t_stat = (x_bar - mu_0) / se

    # Degrees of freedom
    df = n - 1

    # P-value calculation (simplified)
    # For production, use scipy.stats.t.sf
    def t_pvalue(t, df, alt):
        # Rough approximation using normal for df > 30
        from math import erf

        def normal_cdf(x):
            return (1 + erf(x / math.sqrt(2))) / 2

        if alt == 'two-sided':
            return 2 * (1 - normal_cdf(abs(t)))
        elif alt == 'greater':
            return 1 - normal_cdf(t)
        else:  # less
            return normal_cdf(t)

    p_value = t_pvalue(t_stat, df, alternative)

    return {
        't_statistic': t_stat,
        'df': df,
        'p_value': p_value,
        'reject_at_0.05': p_value < 0.05
    }

# Example: Test if mean response time = 100ms
response_times = [95, 102, 98, 105, 101, 99, 103, 97, 100, 104]
result = one_sample_t_test(response_times, mu_0=100)
print(f"t-statistic: {result['t_statistic']:.4f}")
print(f"p-value: {result['p_value']:.4f}")
print(f"Reject H₀: {result['reject_at_0.05']}")

Two-Sample t-Test

Comparing means of two groups.

def two_sample_t_test(data1, data2, alternative='two-sided'):
    """
    Test H₀: μ₁ = μ₂ (independent samples)

    Using Welch's t-test (doesn't assume equal variance)
    """
    n1, n2 = len(data1), len(data2)
    x1_bar, x2_bar = sample_mean(data1), sample_mean(data2)
    s1_sq, s2_sq = sample_variance(data1), sample_variance(data2)

    # Standard error of difference
    se_diff = math.sqrt(s1_sq/n1 + s2_sq/n2)

    # Test statistic
    t_stat = (x1_bar - x2_bar) / se_diff

    # Welch-Satterthwaite degrees of freedom
    df = (s1_sq/n1 + s2_sq/n2)**2 / (
        (s1_sq/n1)**2/(n1-1) + (s2_sq/n2)**2/(n2-1)
    )

    return {
        'difference': x1_bar - x2_bar,
        't_statistic': t_stat,
        'df': df,
        'mean1': x1_bar,
        'mean2': x2_bar
    }

# Example: Compare two algorithms
algo_a_times = [102, 98, 105, 101, 99, 103, 97, 100, 104, 96]
algo_b_times = [95, 92, 98, 94, 96, 91, 97, 93, 95, 94]

result = two_sample_t_test(algo_a_times, algo_b_times)
print(f"Algorithm A mean: {result['mean1']:.1f}ms")
print(f"Algorithm B mean: {result['mean2']:.1f}ms")
print(f"Difference: {result['difference']:.1f}ms")
print(f"t-statistic: {result['t_statistic']:.4f}")

Chi-Square Test

Test for independence or goodness of fit.

def chi_square_test(observed, expected):
    """
    Chi-square goodness of fit test

    H₀: Data follows expected distribution
    """
    chi_sq = sum((o - e)**2 / e for o, e in zip(observed, expected))
    df = len(observed) - 1

    return {
        'chi_square': chi_sq,
        'df': df
    }

# Example: Test if die is fair
observed = [18, 16, 14, 20, 17, 15]  # 100 rolls
expected = [100/6] * 6

result = chi_square_test(observed, expected)
print(f"Chi-square: {result['chi_square']:.4f}")
print(f"Degrees of freedom: {result['df']}")
# Compare to chi-square critical value (df=5, α=0.05): 11.07

5. A/B Testing

Framework

A/B testing compares two versions to determine which performs better.

class ABTest:
    """A/B test for conversion rates"""

    def __init__(self, name):
        self.name = name
        self.control = {'visitors': 0, 'conversions': 0}
        self.treatment = {'visitors': 0, 'conversions': 0}

    def record_control(self, converted):
        self.control['visitors'] += 1
        if converted:
            self.control['conversions'] += 1

    def record_treatment(self, converted):
        self.treatment['visitors'] += 1
        if converted:
            self.treatment['conversions'] += 1

    def get_rates(self):
        ctrl_rate = self.control['conversions'] / self.control['visitors']
        treat_rate = self.treatment['conversions'] / self.treatment['visitors']
        return ctrl_rate, treat_rate

    def analyze(self, confidence=0.95):
        """Analyze A/B test results"""
        n1, c1 = self.control['visitors'], self.control['conversions']
        n2, c2 = self.treatment['visitors'], self.treatment['conversions']

        p1, p2 = c1/n1, c2/n2

        # Pooled proportion under H₀
        p_pool = (c1 + c2) / (n1 + n2)

        # Standard error under H₀
        se_pool = math.sqrt(p_pool * (1 - p_pool) * (1/n1 + 1/n2))

        # Z-statistic
        z = (p2 - p1) / se_pool if se_pool > 0 else 0

        # Relative lift
        lift = (p2 - p1) / p1 if p1 > 0 else float('inf')

        # Confidence interval for difference
        se_diff = math.sqrt(p1*(1-p1)/n1 + p2*(1-p2)/n2)
        z_crit = 1.96  # 95% CI
        ci = (p2 - p1 - z_crit*se_diff, p2 - p1 + z_crit*se_diff)

        return {
            'control_rate': p1,
            'treatment_rate': p2,
            'absolute_lift': p2 - p1,
            'relative_lift': lift,
            'z_statistic': z,
            'ci_95': ci,
            'significant': abs(z) > 1.96
        }

# Example
test = ABTest("Checkout Button Color")

# Simulate data
import random
for _ in range(5000):
    test.record_control(random.random() < 0.10)  # 10% baseline
    test.record_treatment(random.random() < 0.12)  # 12% treatment

results = test.analyze()
print(f"Control rate: {results['control_rate']:.1%}")
print(f"Treatment rate: {results['treatment_rate']:.1%}")
print(f"Relative lift: {results['relative_lift']:.1%}")
print(f"95% CI: ({results['ci_95'][0]:.1%}, {results['ci_95'][1]:.1%})")
print(f"Statistically significant: {results['significant']}")

Sample Size Calculation

def sample_size_proportion(p1, mde, alpha=0.05, power=0.80):
    """
    Calculate required sample size for A/B test

    p1: baseline conversion rate
    mde: minimum detectable effect (relative)
    alpha: significance level
    power: statistical power
    """
    p2 = p1 * (1 + mde)  # Expected treatment rate

    # Z-values
    z_alpha = 1.96  # two-sided
    z_beta = 0.84   # 80% power

    # Pooled variance approximation
    p_avg = (p1 + p2) / 2
    var_pooled = 2 * p_avg * (1 - p_avg)

    # Sample size per group
    n = var_pooled * ((z_alpha + z_beta) / (p2 - p1))**2

    return math.ceil(n)

# Example: 5% baseline, want to detect 20% relative lift
n = sample_size_proportion(0.05, 0.20)
print(f"Required sample size per group: {n}")
print(f"Total visitors needed: {2 * n}")

Multiple Testing Problem

def bonferroni_correction(alpha, num_tests):
    """
    Bonferroni correction for multiple comparisons

    If testing at α level with k tests:
    Adjusted α = α / k
    """
    return alpha / num_tests

def benjamini_hochberg(p_values, alpha=0.05):
    """
    Benjamini-Hochberg procedure for controlling FDR

    Returns which hypotheses to reject
    """
    n = len(p_values)
    sorted_idx = sorted(range(n), key=lambda i: p_values[i])

    reject = [False] * n
    for i, idx in enumerate(sorted_idx):
        threshold = (i + 1) * alpha / n
        if p_values[idx] <= threshold:
            reject[idx] = True
        else:
            break

    return reject

# Example: 10 tests
p_values = [0.01, 0.04, 0.03, 0.06, 0.02, 0.08, 0.05, 0.07, 0.09, 0.04]
print(f"Original α = 0.05: {sum(p < 0.05 for p in p_values)} significant")
print(f"Bonferroni α = {bonferroni_correction(0.05, 10):.4f}")
print(f"Bonferroni significant: {sum(p < 0.005 for p in p_values)}")
print(f"B-H rejections: {benjamini_hochberg(p_values)}")

6. Common Pitfalls

P-value Misconceptions

"""
WRONG interpretations of p = 0.03:
- "There's a 3% chance H₀ is true"
- "There's a 97% chance the effect is real"
- "If we reject H₀, there's only 3% chance we're wrong"

CORRECT interpretation:
- "If H₀ is true, we'd see data this extreme 3% of the time"
"""

P-hacking / Data Dredging

def demonstrate_p_hacking():
    """
    Show how testing many things finds 'significant' results by chance
    """
    import random

    def test_correlation(x, y):
        """Simple correlation test"""
        n = len(x)
        mean_x, mean_y = sum(x)/n, sum(y)/n
        cov = sum((xi-mean_x)*(yi-mean_y) for xi, yi in zip(x, y)) / n
        std_x = math.sqrt(sum((xi-mean_x)**2 for xi in x) / n)
        std_y = math.sqrt(sum((yi-mean_y)**2 for yi in y) / n)

        r = cov / (std_x * std_y) if std_x > 0 and std_y > 0 else 0

        # Approximate p-value (simplified)
        t = r * math.sqrt(n-2) / math.sqrt(1-r**2) if abs(r) < 1 else 0
        return abs(r), r

    # Target variable (random)
    y = [random.gauss(0, 1) for _ in range(100)]

    # Test 100 random "features"
    significant = 0
    for i in range(100):
        x = [random.gauss(0, 1) for _ in range(100)]
        r, _ = test_correlation(x, y)
        if r > 0.2:  # Arbitrary "significance"
            significant += 1

    print(f"Testing 100 random features against random target:")
    print(f"Found {significant} 'significant' correlations")
    print("These are all spurious!")

demonstrate_p_hacking()

Stopping Rules

def demonstrate_early_stopping_bias():
    """
    Show bias from stopping test when p < 0.05
    """
    import random

    def run_biased_test():
        """Stop as soon as p < 0.05 (WRONG!)"""
        control = []
        treatment = []

        for i in range(1000):
            control.append(random.gauss(100, 10))
            treatment.append(random.gauss(100, 10))  # Same distribution!

            if len(control) >= 20:
                # Calculate p-value
                diff = sum(treatment)/len(treatment) - sum(control)/len(control)
                se = 10 * math.sqrt(2/len(control))
                z = diff / se

                if abs(z) > 1.96:  # "Significant"
                    return True, len(control)

        return False, 1000

    # Run many trials
    false_positives = 0
    for _ in range(100):
        significant, n = run_biased_test()
        if significant:
            false_positives += 1

    print(f"False positive rate with early stopping: {false_positives}%")
    print("(Should be 5% with fixed sample size)")

demonstrate_early_stopping_bias()

Exercises

Basic

  1. A sample of 50 website load times has mean 2.3 seconds and std 0.8 seconds. Construct a 95% CI for the population mean.

  2. An A/B test shows 120/1000 conversions for control and 150/1000 for treatment. Is this statistically significant at α = 0.05?

  3. Explain why "p = 0.03 means there's a 3% chance the null hypothesis is true" is incorrect.

Intermediate

  1. You're testing 20 features for correlation with an outcome. Using Bonferroni correction, what p-value threshold should you use for α = 0.05?

  2. Calculate the sample size needed to detect a 10% relative lift from a 5% baseline with 80% power.

  3. Implement a function that computes the power of a two-sample t-test given effect size, sample size, and α.

Advanced

  1. Implement a Bayesian A/B test using Beta-Binomial model. Compare results to frequentist approach.

  2. Design a sequential testing procedure that controls Type I error while allowing early stopping.

  3. Implement bootstrap confidence intervals and compare to parametric intervals.


Summary

  • Point estimation provides single best guesses for parameters
  • Confidence intervals provide ranges of plausible values
  • Hypothesis testing provides framework for making decisions
  • A/B testing applies inference to compare variants
  • Watch out for multiple testing, early stopping, and p-value misconceptions

Next Reading

Randomized Algorithms →