Statistical Inference
Introduction
Statistical inference is the process of drawing conclusions about populations from sample data. In computer science, we use inference constantly: A/B testing features, evaluating algorithm performance, detecting anomalies, and validating models. This reading covers estimation, hypothesis testing, and confidence intervals.
Learning Objectives
By the end of this reading, you will be able to:
- Understand point estimation and properties of estimators
- Construct and interpret confidence intervals
- Perform hypothesis tests and understand p-values
- Apply the Central Limit Theorem
- Conduct A/B tests correctly
- Avoid common statistical pitfalls
1. Point Estimation
Estimators
An estimator is a function of sample data used to estimate a population parameter.
Parameter θ (unknown) → Estimator θ̂ (computed from sample)
Examples:
- Population mean μ → Sample mean X̄ = (1/n)Σxᵢ
- Population variance σ² → Sample variance S² = (1/(n-1))Σ(xᵢ - X̄)²
- Population proportion p → Sample proportion p̂ = successes/n
import math
def sample_mean(data):
"""Estimate population mean"""
return sum(data) / len(data)
def sample_variance(data, ddof=1):
"""
Estimate population variance
ddof=1 gives unbiased estimate (Bessel's correction)
ddof=0 gives MLE estimate
"""
n = len(data)
mean = sample_mean(data)
return sum((x - mean)**2 for x in data) / (n - ddof)
def sample_std(data, ddof=1):
"""Sample standard deviation"""
return math.sqrt(sample_variance(data, ddof))
Properties of Estimators
Bias: An estimator is unbiased if E[θ̂] = θ
def demonstrate_bias():
"""Show why we divide by (n-1) for sample variance"""
import random
true_variance = 1 # True variance of standard normal
n_samples = 1000
sample_size = 10
biased_estimates = []
unbiased_estimates = []
for _ in range(n_samples):
sample = [random.gauss(0, 1) for _ in range(sample_size)]
# Biased (divide by n)
biased_estimates.append(sample_variance(sample, ddof=0))
# Unbiased (divide by n-1)
unbiased_estimates.append(sample_variance(sample, ddof=1))
print(f"True variance: {true_variance}")
print(f"Mean of biased estimates: {sum(biased_estimates)/len(biased_estimates):.4f}")
print(f"Mean of unbiased estimates: {sum(unbiased_estimates)/len(unbiased_estimates):.4f}")
demonstrate_bias()
Consistency: θ̂ → θ as n → ∞ (converges in probability)
Efficiency: Among unbiased estimators, the one with smallest variance is most efficient.
Maximum Likelihood Estimation (MLE)
MLE finds the parameter values that maximize the likelihood of observing the data.
def mle_bernoulli(data):
"""
MLE for Bernoulli parameter p
L(p) = Π p^xᵢ (1-p)^(1-xᵢ)
log L(p) = Σ xᵢ log(p) + Σ (1-xᵢ) log(1-p)
Taking derivative and setting to 0:
p̂ = (1/n) Σ xᵢ = sample proportion
"""
return sum(data) / len(data)
def mle_normal(data):
"""
MLE for Normal(μ, σ²)
μ̂ = sample mean
σ̂² = (1/n) Σ(xᵢ - x̄)² (note: biased!)
"""
n = len(data)
mu_hat = sum(data) / n
sigma_sq_hat = sum((x - mu_hat)**2 for x in data) / n
return mu_hat, sigma_sq_hat
2. The Central Limit Theorem
Statement
For independent samples X₁, ..., Xₙ from a distribution with mean μ and variance σ²:
X̄ = (1/n) Σ Xᵢ
As n → ∞:
(X̄ - μ) / (σ/√n) → N(0, 1)
In words: The sample mean is approximately normally distributed, regardless of the original distribution.
Practical Rule
For n ≥ 30, the normal approximation is usually reasonable.
def demonstrate_clt():
"""Visualize Central Limit Theorem"""
import random
def sample_means(distribution_sampler, n_samples, sample_size):
"""Generate sample means"""
return [
sum(distribution_sampler() for _ in range(sample_size)) / sample_size
for _ in range(n_samples)
]
# Sample from uniform distribution
uniform_sampler = lambda: random.uniform(0, 1)
# Sample from exponential distribution
exp_sampler = lambda: -math.log(random.random())
# As sample size increases, means become more normal
for sample_size in [1, 5, 30]:
means = sample_means(exp_sampler, 10000, sample_size)
print(f"\nSample size {sample_size}:")
print(f" Mean of means: {sum(means)/len(means):.4f} (true: 1.0)")
print(f" Std of means: {sample_std(means):.4f} (theory: {1/math.sqrt(sample_size):.4f})")
demonstrate_clt()
Standard Error
The standard error is the standard deviation of an estimator:
SE(X̄) = σ / √n
When σ is unknown, estimate with sample standard deviation:
SE(X̄) ≈ S / √n
def standard_error(data):
"""Compute standard error of the mean"""
n = len(data)
s = sample_std(data)
return s / math.sqrt(n)
3. Confidence Intervals
Concept
A confidence interval provides a range of plausible values for a parameter.
A 95% CI means: If we repeated the sampling many times, 95% of computed intervals would contain the true parameter.
Common misconception: It does NOT mean "95% probability the true value is in this interval."
CI for Mean (Known Variance)
CI = X̄ ± z_{α/2} × (σ/√n)
For 95% CI: z_{0.025} = 1.96
CI for Mean (Unknown Variance)
Use t-distribution with n-1 degrees of freedom:
CI = X̄ ± t_{α/2, n-1} × (S/√n)
def t_critical(alpha, df):
"""
Approximate t critical value
For df > 30, t ≈ z
"""
# Simplified approximation (use scipy.stats.t.ppf for accuracy)
z = {0.10: 1.645, 0.05: 1.96, 0.01: 2.576}
if df >= 30:
return z.get(alpha, 1.96)
# Small sample adjustment (rough approximation)
correction = 1 + 1/(4*df)
return z.get(alpha, 1.96) * correction
def confidence_interval_mean(data, confidence=0.95):
"""
Compute confidence interval for population mean
Uses t-distribution for unknown variance
"""
n = len(data)
x_bar = sample_mean(data)
se = standard_error(data)
alpha = 1 - confidence
t_crit = t_critical(alpha, n - 1)
margin = t_crit * se
return (x_bar - margin, x_bar + margin)
# Example
data = [23, 25, 28, 31, 22, 27, 29, 26, 24, 30]
ci = confidence_interval_mean(data, 0.95)
print(f"Sample mean: {sample_mean(data):.2f}")
print(f"95% CI: ({ci[0]:.2f}, {ci[1]:.2f})")
CI for Proportion
p̂ ± z_{α/2} × √(p̂(1-p̂)/n)
def confidence_interval_proportion(successes, n, confidence=0.95):
"""Confidence interval for proportion"""
p_hat = successes / n
z = {0.90: 1.645, 0.95: 1.96, 0.99: 2.576}[confidence]
se = math.sqrt(p_hat * (1 - p_hat) / n)
margin = z * se
return (max(0, p_hat - margin), min(1, p_hat + margin))
# Example: 60 successes in 100 trials
ci = confidence_interval_proportion(60, 100, 0.95)
print(f"Proportion: 0.60")
print(f"95% CI: ({ci[0]:.3f}, {ci[1]:.3f})")
4. Hypothesis Testing
Framework
- Null hypothesis H₀: The default/status quo claim
- Alternative hypothesis H₁: What we're trying to show
- Test statistic: Computed from data
- P-value: Probability of observing data as extreme as ours, if H₀ is true
- Decision: Reject H₀ if p-value < α (significance level)
Types of Errors
| H₀ True | H₀ False | |
|---|---|---|
| Reject H₀ | Type I Error (α) | Correct |
| Fail to Reject H₀ | Correct | Type II Error (β) |
- α = P(Type I) = significance level (typically 0.05)
- β = P(Type II)
- Power = 1 - β = P(correctly rejecting false H₀)
One-Sample t-Test
Testing if population mean equals a specific value.
def one_sample_t_test(data, mu_0, alternative='two-sided'):
"""
Test H₀: μ = μ₀
alternative: 'two-sided', 'greater', 'less'
"""
n = len(data)
x_bar = sample_mean(data)
se = standard_error(data)
# Test statistic
t_stat = (x_bar - mu_0) / se
# Degrees of freedom
df = n - 1
# P-value calculation (simplified)
# For production, use scipy.stats.t.sf
def t_pvalue(t, df, alt):
# Rough approximation using normal for df > 30
from math import erf
def normal_cdf(x):
return (1 + erf(x / math.sqrt(2))) / 2
if alt == 'two-sided':
return 2 * (1 - normal_cdf(abs(t)))
elif alt == 'greater':
return 1 - normal_cdf(t)
else: # less
return normal_cdf(t)
p_value = t_pvalue(t_stat, df, alternative)
return {
't_statistic': t_stat,
'df': df,
'p_value': p_value,
'reject_at_0.05': p_value < 0.05
}
# Example: Test if mean response time = 100ms
response_times = [95, 102, 98, 105, 101, 99, 103, 97, 100, 104]
result = one_sample_t_test(response_times, mu_0=100)
print(f"t-statistic: {result['t_statistic']:.4f}")
print(f"p-value: {result['p_value']:.4f}")
print(f"Reject H₀: {result['reject_at_0.05']}")
Two-Sample t-Test
Comparing means of two groups.
def two_sample_t_test(data1, data2, alternative='two-sided'):
"""
Test H₀: μ₁ = μ₂ (independent samples)
Using Welch's t-test (doesn't assume equal variance)
"""
n1, n2 = len(data1), len(data2)
x1_bar, x2_bar = sample_mean(data1), sample_mean(data2)
s1_sq, s2_sq = sample_variance(data1), sample_variance(data2)
# Standard error of difference
se_diff = math.sqrt(s1_sq/n1 + s2_sq/n2)
# Test statistic
t_stat = (x1_bar - x2_bar) / se_diff
# Welch-Satterthwaite degrees of freedom
df = (s1_sq/n1 + s2_sq/n2)**2 / (
(s1_sq/n1)**2/(n1-1) + (s2_sq/n2)**2/(n2-1)
)
return {
'difference': x1_bar - x2_bar,
't_statistic': t_stat,
'df': df,
'mean1': x1_bar,
'mean2': x2_bar
}
# Example: Compare two algorithms
algo_a_times = [102, 98, 105, 101, 99, 103, 97, 100, 104, 96]
algo_b_times = [95, 92, 98, 94, 96, 91, 97, 93, 95, 94]
result = two_sample_t_test(algo_a_times, algo_b_times)
print(f"Algorithm A mean: {result['mean1']:.1f}ms")
print(f"Algorithm B mean: {result['mean2']:.1f}ms")
print(f"Difference: {result['difference']:.1f}ms")
print(f"t-statistic: {result['t_statistic']:.4f}")
Chi-Square Test
Test for independence or goodness of fit.
def chi_square_test(observed, expected):
"""
Chi-square goodness of fit test
H₀: Data follows expected distribution
"""
chi_sq = sum((o - e)**2 / e for o, e in zip(observed, expected))
df = len(observed) - 1
return {
'chi_square': chi_sq,
'df': df
}
# Example: Test if die is fair
observed = [18, 16, 14, 20, 17, 15] # 100 rolls
expected = [100/6] * 6
result = chi_square_test(observed, expected)
print(f"Chi-square: {result['chi_square']:.4f}")
print(f"Degrees of freedom: {result['df']}")
# Compare to chi-square critical value (df=5, α=0.05): 11.07
5. A/B Testing
Framework
A/B testing compares two versions to determine which performs better.
class ABTest:
"""A/B test for conversion rates"""
def __init__(self, name):
self.name = name
self.control = {'visitors': 0, 'conversions': 0}
self.treatment = {'visitors': 0, 'conversions': 0}
def record_control(self, converted):
self.control['visitors'] += 1
if converted:
self.control['conversions'] += 1
def record_treatment(self, converted):
self.treatment['visitors'] += 1
if converted:
self.treatment['conversions'] += 1
def get_rates(self):
ctrl_rate = self.control['conversions'] / self.control['visitors']
treat_rate = self.treatment['conversions'] / self.treatment['visitors']
return ctrl_rate, treat_rate
def analyze(self, confidence=0.95):
"""Analyze A/B test results"""
n1, c1 = self.control['visitors'], self.control['conversions']
n2, c2 = self.treatment['visitors'], self.treatment['conversions']
p1, p2 = c1/n1, c2/n2
# Pooled proportion under H₀
p_pool = (c1 + c2) / (n1 + n2)
# Standard error under H₀
se_pool = math.sqrt(p_pool * (1 - p_pool) * (1/n1 + 1/n2))
# Z-statistic
z = (p2 - p1) / se_pool if se_pool > 0 else 0
# Relative lift
lift = (p2 - p1) / p1 if p1 > 0 else float('inf')
# Confidence interval for difference
se_diff = math.sqrt(p1*(1-p1)/n1 + p2*(1-p2)/n2)
z_crit = 1.96 # 95% CI
ci = (p2 - p1 - z_crit*se_diff, p2 - p1 + z_crit*se_diff)
return {
'control_rate': p1,
'treatment_rate': p2,
'absolute_lift': p2 - p1,
'relative_lift': lift,
'z_statistic': z,
'ci_95': ci,
'significant': abs(z) > 1.96
}
# Example
test = ABTest("Checkout Button Color")
# Simulate data
import random
for _ in range(5000):
test.record_control(random.random() < 0.10) # 10% baseline
test.record_treatment(random.random() < 0.12) # 12% treatment
results = test.analyze()
print(f"Control rate: {results['control_rate']:.1%}")
print(f"Treatment rate: {results['treatment_rate']:.1%}")
print(f"Relative lift: {results['relative_lift']:.1%}")
print(f"95% CI: ({results['ci_95'][0]:.1%}, {results['ci_95'][1]:.1%})")
print(f"Statistically significant: {results['significant']}")
Sample Size Calculation
def sample_size_proportion(p1, mde, alpha=0.05, power=0.80):
"""
Calculate required sample size for A/B test
p1: baseline conversion rate
mde: minimum detectable effect (relative)
alpha: significance level
power: statistical power
"""
p2 = p1 * (1 + mde) # Expected treatment rate
# Z-values
z_alpha = 1.96 # two-sided
z_beta = 0.84 # 80% power
# Pooled variance approximation
p_avg = (p1 + p2) / 2
var_pooled = 2 * p_avg * (1 - p_avg)
# Sample size per group
n = var_pooled * ((z_alpha + z_beta) / (p2 - p1))**2
return math.ceil(n)
# Example: 5% baseline, want to detect 20% relative lift
n = sample_size_proportion(0.05, 0.20)
print(f"Required sample size per group: {n}")
print(f"Total visitors needed: {2 * n}")
Multiple Testing Problem
def bonferroni_correction(alpha, num_tests):
"""
Bonferroni correction for multiple comparisons
If testing at α level with k tests:
Adjusted α = α / k
"""
return alpha / num_tests
def benjamini_hochberg(p_values, alpha=0.05):
"""
Benjamini-Hochberg procedure for controlling FDR
Returns which hypotheses to reject
"""
n = len(p_values)
sorted_idx = sorted(range(n), key=lambda i: p_values[i])
reject = [False] * n
for i, idx in enumerate(sorted_idx):
threshold = (i + 1) * alpha / n
if p_values[idx] <= threshold:
reject[idx] = True
else:
break
return reject
# Example: 10 tests
p_values = [0.01, 0.04, 0.03, 0.06, 0.02, 0.08, 0.05, 0.07, 0.09, 0.04]
print(f"Original α = 0.05: {sum(p < 0.05 for p in p_values)} significant")
print(f"Bonferroni α = {bonferroni_correction(0.05, 10):.4f}")
print(f"Bonferroni significant: {sum(p < 0.005 for p in p_values)}")
print(f"B-H rejections: {benjamini_hochberg(p_values)}")
6. Common Pitfalls
P-value Misconceptions
"""
WRONG interpretations of p = 0.03:
- "There's a 3% chance H₀ is true"
- "There's a 97% chance the effect is real"
- "If we reject H₀, there's only 3% chance we're wrong"
CORRECT interpretation:
- "If H₀ is true, we'd see data this extreme 3% of the time"
"""
P-hacking / Data Dredging
def demonstrate_p_hacking():
"""
Show how testing many things finds 'significant' results by chance
"""
import random
def test_correlation(x, y):
"""Simple correlation test"""
n = len(x)
mean_x, mean_y = sum(x)/n, sum(y)/n
cov = sum((xi-mean_x)*(yi-mean_y) for xi, yi in zip(x, y)) / n
std_x = math.sqrt(sum((xi-mean_x)**2 for xi in x) / n)
std_y = math.sqrt(sum((yi-mean_y)**2 for yi in y) / n)
r = cov / (std_x * std_y) if std_x > 0 and std_y > 0 else 0
# Approximate p-value (simplified)
t = r * math.sqrt(n-2) / math.sqrt(1-r**2) if abs(r) < 1 else 0
return abs(r), r
# Target variable (random)
y = [random.gauss(0, 1) for _ in range(100)]
# Test 100 random "features"
significant = 0
for i in range(100):
x = [random.gauss(0, 1) for _ in range(100)]
r, _ = test_correlation(x, y)
if r > 0.2: # Arbitrary "significance"
significant += 1
print(f"Testing 100 random features against random target:")
print(f"Found {significant} 'significant' correlations")
print("These are all spurious!")
demonstrate_p_hacking()
Stopping Rules
def demonstrate_early_stopping_bias():
"""
Show bias from stopping test when p < 0.05
"""
import random
def run_biased_test():
"""Stop as soon as p < 0.05 (WRONG!)"""
control = []
treatment = []
for i in range(1000):
control.append(random.gauss(100, 10))
treatment.append(random.gauss(100, 10)) # Same distribution!
if len(control) >= 20:
# Calculate p-value
diff = sum(treatment)/len(treatment) - sum(control)/len(control)
se = 10 * math.sqrt(2/len(control))
z = diff / se
if abs(z) > 1.96: # "Significant"
return True, len(control)
return False, 1000
# Run many trials
false_positives = 0
for _ in range(100):
significant, n = run_biased_test()
if significant:
false_positives += 1
print(f"False positive rate with early stopping: {false_positives}%")
print("(Should be 5% with fixed sample size)")
demonstrate_early_stopping_bias()
Exercises
Basic
A sample of 50 website load times has mean 2.3 seconds and std 0.8 seconds. Construct a 95% CI for the population mean.
An A/B test shows 120/1000 conversions for control and 150/1000 for treatment. Is this statistically significant at α = 0.05?
Explain why "p = 0.03 means there's a 3% chance the null hypothesis is true" is incorrect.
Intermediate
You're testing 20 features for correlation with an outcome. Using Bonferroni correction, what p-value threshold should you use for α = 0.05?
Calculate the sample size needed to detect a 10% relative lift from a 5% baseline with 80% power.
Implement a function that computes the power of a two-sample t-test given effect size, sample size, and α.
Advanced
Implement a Bayesian A/B test using Beta-Binomial model. Compare results to frequentist approach.
Design a sequential testing procedure that controls Type I error while allowing early stopping.
Implement bootstrap confidence intervals and compare to parametric intervals.
Summary
- Point estimation provides single best guesses for parameters
- Confidence intervals provide ranges of plausible values
- Hypothesis testing provides framework for making decisions
- A/B testing applies inference to compare variants
- Watch out for multiple testing, early stopping, and p-value misconceptions