Statistical Inference

Statistical inference is the process of drawing conclusions about populations from sample data. In computer science, we use inference constantly: A/B testing features, evaluating algorithm performance, detecting anomalies, and validating models. ## 1. Point Estimation

Estimators

An estimator is a function of sample data used to estimate a population parameter.

Parameter θ (unknown) → Estimator θ̂ (computed from sample)

Examples:

  • Population mean μ → Sample mean X̄ = (1/n)Σxᵢ
  • Population variance σ² → Sample variance S² = (1/(n-1))Σ(xᵢ - X̄)²
  • Population proportion p → Sample proportion p̂ = successes/n
import math

def sample_mean(data):
    """Estimate population mean"""
    return sum(data) / len(data)

def sample_variance(data, ddof=1):
    """
    Estimate population variance

    ddof=1 gives unbiased estimate (Bessel's correction)
    ddof=0 gives MLE estimate
    """
    n = len(data)
    mean = sample_mean(data)
    return sum((x - mean)**2 for x in data) / (n - ddof)

def sample_std(data, ddof=1):
    """Sample standard deviation"""
    return math.sqrt(sample_variance(data, ddof))

Properties of Estimators

Bias: An estimator is unbiased if E[θ̂] = θ

def demonstrate_bias():
    """Show why we divide by (n-1) for sample variance"""
    import random

    true_variance = 1  # True variance of standard normal
    n_samples = 1000
    sample_size = 10

    biased_estimates = []
    unbiased_estimates = []

    for _ in range(n_samples):
        sample = [random.gauss(0, 1) for _ in range(sample_size)]

        # Biased (divide by n)
        biased_estimates.append(sample_variance(sample, ddof=0))

        # Unbiased (divide by n-1)
        unbiased_estimates.append(sample_variance(sample, ddof=1))

    print(f"True variance: {true_variance}")
    print(f"Mean of biased estimates: {sum(biased_estimates)/len(biased_estimates):.4f}")
    print(f"Mean of unbiased estimates: {sum(unbiased_estimates)/len(unbiased_estimates):.4f}")

demonstrate_bias()

Consistency: θ̂ → θ as n → ∞ (converges in probability)

Efficiency: Among unbiased estimators, the one with smallest variance is most efficient.

Maximum Likelihood Estimation (MLE)

MLE finds the parameter values that maximize the likelihood of observing the data.

def mle_bernoulli(data):
    """
    MLE for Bernoulli parameter p

    L(p) = Π p^xᵢ (1-p)^(1-xᵢ)
    log L(p) = Σ xᵢ log(p) + Σ (1-xᵢ) log(1-p)

    Taking derivative and setting to 0:
    p̂ = (1/n) Σ xᵢ = sample proportion
    """
    return sum(data) / len(data)

def mle_normal(data):
    """
    MLE for Normal(μ, σ²)

    μ̂ = sample mean
    σ̂² = (1/n) Σ(xᵢ - x̄)²  (note: biased!)
    """
    n = len(data)
    mu_hat = sum(data) / n
    sigma_sq_hat = sum((x - mu_hat)**2 for x in data) / n
    return mu_hat, sigma_sq_hat

2. The Central Limit Theorem

Statement

For independent samples X₁, ..., Xₙ from a distribution with mean μ and variance σ²:

X̄ = (1/n) Σ Xᵢ

As n → ∞:
(X̄ - μ) / (σ/√n) → N(0, 1)

In words: The sample mean is approximately normally distributed, regardless of the original distribution.

Practical Rule

For n ≥ 30, the normal approximation is usually reasonable.

def demonstrate_clt():
    """Visualize Central Limit Theorem"""
    import random

    def sample_means(distribution_sampler, n_samples, sample_size):
        """Generate sample means"""
        return [
            sum(distribution_sampler() for _ in range(sample_size)) / sample_size
            for _ in range(n_samples)
        ]

    # Sample from uniform distribution
    uniform_sampler = lambda: random.uniform(0, 1)

    # Sample from exponential distribution
    exp_sampler = lambda: -math.log(random.random())

    # As sample size increases, means become more normal
    for sample_size in [1, 5, 30]:
        means = sample_means(exp_sampler, 10000, sample_size)
        print(f"\nSample size {sample_size}:")
        print(f"  Mean of means: {sum(means)/len(means):.4f} (true: 1.0)")
        print(f"  Std of means: {sample_std(means):.4f} (theory: {1/math.sqrt(sample_size):.4f})")

demonstrate_clt()

Standard Error

The standard error is the standard deviation of an estimator:

SE(X̄) = σ / √n

When σ is unknown, estimate with sample standard deviation:

SE(X̄) ≈ S / √n
def standard_error(data):
    """Compute standard error of the mean"""
    n = len(data)
    s = sample_std(data)
    return s / math.sqrt(n)

3. Confidence Intervals

Concept

A confidence interval provides a range of plausible values for a parameter.

A 95% CI means: If we repeated the sampling many times, 95% of computed intervals would contain the true parameter.

Common misconception: It does NOT mean "95% probability the true value is in this interval."

CI for Mean (Known Variance)

CI = X̄ ± z_{α/2} × (σ/√n)

For 95% CI: z_{0.025} = 1.96

CI for Mean (Unknown Variance)

Use t-distribution with n-1 degrees of freedom:

CI = X̄ ± t_{α/2, n-1} × (S/√n)
def t_critical(alpha, df):
    """
    Approximate t critical value
    For df > 30, t ≈ z
    """
    # Simplified approximation (use scipy.stats.t.ppf for accuracy)
    z = {0.10: 1.645, 0.05: 1.96, 0.01: 2.576}

    if df >= 30:
        return z.get(alpha, 1.96)

    # Small sample adjustment (rough approximation)
    correction = 1 + 1/(4*df)
    return z.get(alpha, 1.96) * correction

def confidence_interval_mean(data, confidence=0.95):
    """
    Compute confidence interval for population mean

    Uses t-distribution for unknown variance
    """
    n = len(data)
    x_bar = sample_mean(data)
    se = standard_error(data)

    alpha = 1 - confidence
    t_crit = t_critical(alpha, n - 1)

    margin = t_crit * se

    return (x_bar - margin, x_bar + margin)

# Example
data = [23, 25, 28, 31, 22, 27, 29, 26, 24, 30]
ci = confidence_interval_mean(data, 0.95)
print(f"Sample mean: {sample_mean(data):.2f}")
print(f"95% CI: ({ci[0]:.2f}, {ci[1]:.2f})")

CI for Proportion

p̂ ± z_{α/2} × √(p̂(1-p̂)/n)
def confidence_interval_proportion(successes, n, confidence=0.95):
    """Confidence interval for proportion"""
    p_hat = successes / n
    z = {0.90: 1.645, 0.95: 1.96, 0.99: 2.576}[confidence]

    se = math.sqrt(p_hat * (1 - p_hat) / n)
    margin = z * se

    return (max(0, p_hat - margin), min(1, p_hat + margin))

# Example: 60 successes in 100 trials
ci = confidence_interval_proportion(60, 100, 0.95)
print(f"Proportion: 0.60")
print(f"95% CI: ({ci[0]:.3f}, {ci[1]:.3f})")

4. Hypothesis Testing

Framework

  1. Null hypothesis H₀: The default/status quo claim
  2. Alternative hypothesis H₁: What we're trying to show
  3. Test statistic: Computed from data
  4. P-value: Probability of observing data as extreme as ours, if H₀ is true
  5. Decision: Reject H₀ if p-value < α (significance level)

Types of Errors

H₀ TrueH₀ False
Reject H₀Type I Error (α)Correct
Fail to Reject H₀CorrectType II Error (β)
  • α = P(Type I) = significance level (typically 0.05)
  • β = P(Type II)
  • Power = 1 - β = P(correctly rejecting false H₀)

One-Sample t-Test

Testing if population mean equals a specific value.

def one_sample_t_test(data, mu_0, alternative='two-sided'):
    """
    Test H₀: μ = μ₀

    alternative: 'two-sided', 'greater', 'less'
    """
    n = len(data)
    x_bar = sample_mean(data)
    se = standard_error(data)

    # Test statistic
    t_stat = (x_bar - mu_0) / se

    # Degrees of freedom
    df = n - 1

    # P-value calculation (simplified)
    # For production, use scipy.stats.t.sf
    def t_pvalue(t, df, alt):
        # Rough approximation using normal for df > 30
        from math import erf

        def normal_cdf(x):
            return (1 + erf(x / math.sqrt(2))) / 2

        if alt == 'two-sided':
            return 2 * (1 - normal_cdf(abs(t)))
        elif alt == 'greater':
            return 1 - normal_cdf(t)
        else:  # less
            return normal_cdf(t)

    p_value = t_pvalue(t_stat, df, alternative)

    return {
        't_statistic': t_stat,
        'df': df,
        'p_value': p_value,
        'reject_at_0.05': p_value < 0.05
    }

# Example: Test if mean response time = 100ms
response_times = [95, 102, 98, 105, 101, 99, 103, 97, 100, 104]
result = one_sample_t_test(response_times, mu_0=100)
print(f"t-statistic: {result['t_statistic']:.4f}")
print(f"p-value: {result['p_value']:.4f}")
print(f"Reject H₀: {result['reject_at_0.05']}")

Two-Sample t-Test

Comparing means of two groups.

def two_sample_t_test(data1, data2, alternative='two-sided'):
    """
    Test H₀: μ₁ = μ₂ (independent samples)

    Using Welch's t-test (doesn't assume equal variance)
    """
    n1, n2 = len(data1), len(data2)
    x1_bar, x2_bar = sample_mean(data1), sample_mean(data2)
    s1_sq, s2_sq = sample_variance(data1), sample_variance(data2)

    # Standard error of difference
    se_diff = math.sqrt(s1_sq/n1 + s2_sq/n2)

    # Test statistic
    t_stat = (x1_bar - x2_bar) / se_diff

    # Welch-Satterthwaite degrees of freedom
    df = (s1_sq/n1 + s2_sq/n2)**2 / (
        (s1_sq/n1)**2/(n1-1) + (s2_sq/n2)**2/(n2-1)
    )

    return {
        'difference': x1_bar - x2_bar,
        't_statistic': t_stat,
        'df': df,
        'mean1': x1_bar,
        'mean2': x2_bar
    }

# Example: Compare two algorithms
algo_a_times = [102, 98, 105, 101, 99, 103, 97, 100, 104, 96]
algo_b_times = [95, 92, 98, 94, 96, 91, 97, 93, 95, 94]

result = two_sample_t_test(algo_a_times, algo_b_times)
print(f"Algorithm A mean: {result['mean1']:.1f}ms")
print(f"Algorithm B mean: {result['mean2']:.1f}ms")
print(f"Difference: {result['difference']:.1f}ms")
print(f"t-statistic: {result['t_statistic']:.4f}")

Chi-Square Test

Test for independence or goodness of fit.

def chi_square_test(observed, expected):
    """
    Chi-square goodness of fit test

    H₀: Data follows expected distribution
    """
    chi_sq = sum((o - e)**2 / e for o, e in zip(observed, expected))
    df = len(observed) - 1

    return {
        'chi_square': chi_sq,
        'df': df
    }

# Example: Test if die is fair
observed = [18, 16, 14, 20, 17, 15]  # 100 rolls
expected = [100/6] * 6

result = chi_square_test(observed, expected)
print(f"Chi-square: {result['chi_square']:.4f}")
print(f"Degrees of freedom: {result['df']}")
# Compare to chi-square critical value (df=5, α=0.05): 11.07

5. A/B Testing

Framework

A/B testing compares two versions to determine which performs better.

class ABTest:
    """A/B test for conversion rates"""

    def __init__(self, name):
        self.name = name
        self.control = {'visitors': 0, 'conversions': 0}
        self.treatment = {'visitors': 0, 'conversions': 0}

    def record_control(self, converted):
        self.control['visitors'] += 1
        if converted:
            self.control['conversions'] += 1

    def record_treatment(self, converted):
        self.treatment['visitors'] += 1
        if converted:
            self.treatment['conversions'] += 1

    def get_rates(self):
        ctrl_rate = self.control['conversions'] / self.control['visitors']
        treat_rate = self.treatment['conversions'] / self.treatment['visitors']
        return ctrl_rate, treat_rate

    def analyze(self, confidence=0.95):
        """Analyze A/B test results"""
        n1, c1 = self.control['visitors'], self.control['conversions']
        n2, c2 = self.treatment['visitors'], self.treatment['conversions']

        p1, p2 = c1/n1, c2/n2

        # Pooled proportion under H₀
        p_pool = (c1 + c2) / (n1 + n2)

        # Standard error under H₀
        se_pool = math.sqrt(p_pool * (1 - p_pool) * (1/n1 + 1/n2))

        # Z-statistic
        z = (p2 - p1) / se_pool if se_pool > 0 else 0

        # Relative lift
        lift = (p2 - p1) / p1 if p1 > 0 else float('inf')

        # Confidence interval for difference
        se_diff = math.sqrt(p1*(1-p1)/n1 + p2*(1-p2)/n2)
        z_crit = 1.96  # 95% CI
        ci = (p2 - p1 - z_crit*se_diff, p2 - p1 + z_crit*se_diff)

        return {
            'control_rate': p1,
            'treatment_rate': p2,
            'absolute_lift': p2 - p1,
            'relative_lift': lift,
            'z_statistic': z,
            'ci_95': ci,
            'significant': abs(z) > 1.96
        }

# Example
test = ABTest("Checkout Button Color")

# Simulate data
import random
for _ in range(5000):
    test.record_control(random.random() < 0.10)  # 10% baseline
    test.record_treatment(random.random() < 0.12)  # 12% treatment

results = test.analyze()
print(f"Control rate: {results['control_rate']:.1%}")
print(f"Treatment rate: {results['treatment_rate']:.1%}")
print(f"Relative lift: {results['relative_lift']:.1%}")
print(f"95% CI: ({results['ci_95'][0]:.1%}, {results['ci_95'][1]:.1%})")
print(f"Statistically significant: {results['significant']}")

Sample Size Calculation

def sample_size_proportion(p1, mde, alpha=0.05, power=0.80):
    """
    Calculate required sample size for A/B test

    p1: baseline conversion rate
    mde: minimum detectable effect (relative)
    alpha: significance level
    power: statistical power
    """
    p2 = p1 * (1 + mde)  # Expected treatment rate

    # Z-values
    z_alpha = 1.96  # two-sided
    z_beta = 0.84   # 80% power

    # Pooled variance approximation
    p_avg = (p1 + p2) / 2
    var_pooled = 2 * p_avg * (1 - p_avg)

    # Sample size per group
    n = var_pooled * ((z_alpha + z_beta) / (p2 - p1))**2

    return math.ceil(n)

# Example: 5% baseline, want to detect 20% relative lift
n = sample_size_proportion(0.05, 0.20)
print(f"Required sample size per group: {n}")
print(f"Total visitors needed: {2 * n}")

Multiple Testing Problem

def bonferroni_correction(alpha, num_tests):
    """
    Bonferroni correction for multiple comparisons

    If testing at α level with k tests:
    Adjusted α = α / k
    """
    return alpha / num_tests

def benjamini_hochberg(p_values, alpha=0.05):
    """
    Benjamini-Hochberg procedure for controlling FDR

    Returns which hypotheses to reject
    """
    n = len(p_values)
    sorted_idx = sorted(range(n), key=lambda i: p_values[i])

    reject = [False] * n
    for i, idx in enumerate(sorted_idx):
        threshold = (i + 1) * alpha / n
        if p_values[idx] <= threshold:
            reject[idx] = True
        else:
            break

    return reject

# Example: 10 tests
p_values = [0.01, 0.04, 0.03, 0.06, 0.02, 0.08, 0.05, 0.07, 0.09, 0.04]
print(f"Original α = 0.05: {sum(p < 0.05 for p in p_values)} significant")
print(f"Bonferroni α = {bonferroni_correction(0.05, 10):.4f}")
print(f"Bonferroni significant: {sum(p < 0.005 for p in p_values)}")
print(f"B-H rejections: {benjamini_hochberg(p_values)}")

6. Common Pitfalls

P-value Misconceptions

"""
WRONG interpretations of p = 0.03:
- "There's a 3% chance H₀ is true"
- "There's a 97% chance the effect is real"
- "If we reject H₀, there's only 3% chance we're wrong"

CORRECT interpretation:
- "If H₀ is true, we'd see data this extreme 3% of the time"
"""

P-hacking / Data Dredging

def demonstrate_p_hacking():
    """
    Show how testing many things finds 'significant' results by chance
    """
    import random

    def test_correlation(x, y):
        """Simple correlation test"""
        n = len(x)
        mean_x, mean_y = sum(x)/n, sum(y)/n
        cov = sum((xi-mean_x)*(yi-mean_y) for xi, yi in zip(x, y)) / n
        std_x = math.sqrt(sum((xi-mean_x)**2 for xi in x) / n)
        std_y = math.sqrt(sum((yi-mean_y)**2 for yi in y) / n)

        r = cov / (std_x * std_y) if std_x > 0 and std_y > 0 else 0

        # Approximate p-value (simplified)
        t = r * math.sqrt(n-2) / math.sqrt(1-r**2) if abs(r) < 1 else 0
        return abs(r), r

    # Target variable (random)
    y = [random.gauss(0, 1) for _ in range(100)]

    # Test 100 random "features"
    significant = 0
    for i in range(100):
        x = [random.gauss(0, 1) for _ in range(100)]
        r, _ = test_correlation(x, y)
        if r > 0.2:  # Arbitrary "significance"
            significant += 1

    print(f"Testing 100 random features against random target:")
    print(f"Found {significant} 'significant' correlations")
    print("These are all spurious!")

demonstrate_p_hacking()

Stopping Rules

def demonstrate_early_stopping_bias():
    """
    Show bias from stopping test when p < 0.05
    """
    import random

    def run_biased_test():
        """Stop as soon as p < 0.05 (WRONG!)"""
        control = []
        treatment = []

        for i in range(1000):
            control.append(random.gauss(100, 10))
            treatment.append(random.gauss(100, 10))  # Same distribution!

            if len(control) >= 20:
                # Calculate p-value
                diff = sum(treatment)/len(treatment) - sum(control)/len(control)
                se = 10 * math.sqrt(2/len(control))
                z = diff / se

                if abs(z) > 1.96:  # "Significant"
                    return True, len(control)

        return False, 1000

    # Run many trials
    false_positives = 0
    for _ in range(100):
        significant, n = run_biased_test()
        if significant:
            false_positives += 1

    print(f"False positive rate with early stopping: {false_positives}%")
    print("(Should be 5% with fixed sample size)")

demonstrate_early_stopping_bias()

Exercises

Basic

  1. A sample of 50 website load times has mean 2.3 seconds and std 0.8 seconds. Construct a 95% CI for the population mean.

  2. An A/B test shows 120/1000 conversions for control and 150/1000 for treatment. Is this statistically significant at α = 0.05?

  3. Explain why "p = 0.03 means there's a 3% chance the null hypothesis is true" is incorrect.

Intermediate

  1. You're testing 20 features for correlation with an outcome. Using Bonferroni correction, what p-value threshold should you use for α = 0.05?

  2. Calculate the sample size needed to detect a 10% relative lift from a 5% baseline with 80% power.

  3. Implement a function that computes the power of a two-sample t-test given effect size, sample size, and α.

Advanced

  1. Implement a Bayesian A/B test using Beta-Binomial model. Compare results to frequentist approach.

  2. Design a sequential testing procedure that controls Type I error while allowing early stopping.

  3. Implement bootstrap confidence intervals and compare to parametric intervals.


Summary

  • Point estimation provides single best guesses for parameters
  • Confidence intervals provide ranges of plausible values
  • Hypothesis testing provides framework for making decisions
  • A/B testing applies inference to compare variants
  • Watch out for multiple testing, early stopping, and p-value misconceptions

Next Reading

Randomized Algorithms →