Statistical Inference
Statistical inference is the process of drawing conclusions about populations from sample data. In computer science, we use inference constantly: A/B testing features, evaluating algorithm performance, detecting anomalies, and validating models. ## 1. Point Estimation
Estimators
An estimator is a function of sample data used to estimate a population parameter.
Parameter θ (unknown) → Estimator θ̂ (computed from sample)
Examples:
- Population mean μ → Sample mean X̄ = (1/n)Σxᵢ
- Population variance σ² → Sample variance S² = (1/(n-1))Σ(xᵢ - X̄)²
- Population proportion p → Sample proportion p̂ = successes/n
import math
def sample_mean(data):
"""Estimate population mean"""
return sum(data) / len(data)
def sample_variance(data, ddof=1):
"""
Estimate population variance
ddof=1 gives unbiased estimate (Bessel's correction)
ddof=0 gives MLE estimate
"""
n = len(data)
mean = sample_mean(data)
return sum((x - mean)**2 for x in data) / (n - ddof)
def sample_std(data, ddof=1):
"""Sample standard deviation"""
return math.sqrt(sample_variance(data, ddof))
Properties of Estimators
Bias: An estimator is unbiased if E[θ̂] = θ
def demonstrate_bias():
"""Show why we divide by (n-1) for sample variance"""
import random
true_variance = 1 # True variance of standard normal
n_samples = 1000
sample_size = 10
biased_estimates = []
unbiased_estimates = []
for _ in range(n_samples):
sample = [random.gauss(0, 1) for _ in range(sample_size)]
# Biased (divide by n)
biased_estimates.append(sample_variance(sample, ddof=0))
# Unbiased (divide by n-1)
unbiased_estimates.append(sample_variance(sample, ddof=1))
print(f"True variance: {true_variance}")
print(f"Mean of biased estimates: {sum(biased_estimates)/len(biased_estimates):.4f}")
print(f"Mean of unbiased estimates: {sum(unbiased_estimates)/len(unbiased_estimates):.4f}")
demonstrate_bias()
Consistency: θ̂ → θ as n → ∞ (converges in probability)
Efficiency: Among unbiased estimators, the one with smallest variance is most efficient.
Maximum Likelihood Estimation (MLE)
MLE finds the parameter values that maximize the likelihood of observing the data.
def mle_bernoulli(data):
"""
MLE for Bernoulli parameter p
L(p) = Π p^xᵢ (1-p)^(1-xᵢ)
log L(p) = Σ xᵢ log(p) + Σ (1-xᵢ) log(1-p)
Taking derivative and setting to 0:
p̂ = (1/n) Σ xᵢ = sample proportion
"""
return sum(data) / len(data)
def mle_normal(data):
"""
MLE for Normal(μ, σ²)
μ̂ = sample mean
σ̂² = (1/n) Σ(xᵢ - x̄)² (note: biased!)
"""
n = len(data)
mu_hat = sum(data) / n
sigma_sq_hat = sum((x - mu_hat)**2 for x in data) / n
return mu_hat, sigma_sq_hat
2. The Central Limit Theorem
Statement
For independent samples X₁, ..., Xₙ from a distribution with mean μ and variance σ²:
X̄ = (1/n) Σ Xᵢ
As n → ∞:
(X̄ - μ) / (σ/√n) → N(0, 1)
In words: The sample mean is approximately normally distributed, regardless of the original distribution.
Practical Rule
For n ≥ 30, the normal approximation is usually reasonable.
def demonstrate_clt():
"""Visualize Central Limit Theorem"""
import random
def sample_means(distribution_sampler, n_samples, sample_size):
"""Generate sample means"""
return [
sum(distribution_sampler() for _ in range(sample_size)) / sample_size
for _ in range(n_samples)
]
# Sample from uniform distribution
uniform_sampler = lambda: random.uniform(0, 1)
# Sample from exponential distribution
exp_sampler = lambda: -math.log(random.random())
# As sample size increases, means become more normal
for sample_size in [1, 5, 30]:
means = sample_means(exp_sampler, 10000, sample_size)
print(f"\nSample size {sample_size}:")
print(f" Mean of means: {sum(means)/len(means):.4f} (true: 1.0)")
print(f" Std of means: {sample_std(means):.4f} (theory: {1/math.sqrt(sample_size):.4f})")
demonstrate_clt()
Standard Error
The standard error is the standard deviation of an estimator:
SE(X̄) = σ / √n
When σ is unknown, estimate with sample standard deviation:
SE(X̄) ≈ S / √n
def standard_error(data):
"""Compute standard error of the mean"""
n = len(data)
s = sample_std(data)
return s / math.sqrt(n)
3. Confidence Intervals
Concept
A confidence interval provides a range of plausible values for a parameter.
A 95% CI means: If we repeated the sampling many times, 95% of computed intervals would contain the true parameter.
Common misconception: It does NOT mean "95% probability the true value is in this interval."
CI for Mean (Known Variance)
CI = X̄ ± z_{α/2} × (σ/√n)
For 95% CI: z_{0.025} = 1.96
CI for Mean (Unknown Variance)
Use t-distribution with n-1 degrees of freedom:
CI = X̄ ± t_{α/2, n-1} × (S/√n)
def t_critical(alpha, df):
"""
Approximate t critical value
For df > 30, t ≈ z
"""
# Simplified approximation (use scipy.stats.t.ppf for accuracy)
z = {0.10: 1.645, 0.05: 1.96, 0.01: 2.576}
if df >= 30:
return z.get(alpha, 1.96)
# Small sample adjustment (rough approximation)
correction = 1 + 1/(4*df)
return z.get(alpha, 1.96) * correction
def confidence_interval_mean(data, confidence=0.95):
"""
Compute confidence interval for population mean
Uses t-distribution for unknown variance
"""
n = len(data)
x_bar = sample_mean(data)
se = standard_error(data)
alpha = 1 - confidence
t_crit = t_critical(alpha, n - 1)
margin = t_crit * se
return (x_bar - margin, x_bar + margin)
# Example
data = [23, 25, 28, 31, 22, 27, 29, 26, 24, 30]
ci = confidence_interval_mean(data, 0.95)
print(f"Sample mean: {sample_mean(data):.2f}")
print(f"95% CI: ({ci[0]:.2f}, {ci[1]:.2f})")
CI for Proportion
p̂ ± z_{α/2} × √(p̂(1-p̂)/n)
def confidence_interval_proportion(successes, n, confidence=0.95):
"""Confidence interval for proportion"""
p_hat = successes / n
z = {0.90: 1.645, 0.95: 1.96, 0.99: 2.576}[confidence]
se = math.sqrt(p_hat * (1 - p_hat) / n)
margin = z * se
return (max(0, p_hat - margin), min(1, p_hat + margin))
# Example: 60 successes in 100 trials
ci = confidence_interval_proportion(60, 100, 0.95)
print(f"Proportion: 0.60")
print(f"95% CI: ({ci[0]:.3f}, {ci[1]:.3f})")
4. Hypothesis Testing
Framework
- Null hypothesis H₀: The default/status quo claim
- Alternative hypothesis H₁: What we're trying to show
- Test statistic: Computed from data
- P-value: Probability of observing data as extreme as ours, if H₀ is true
- Decision: Reject H₀ if p-value < α (significance level)
Types of Errors
| H₀ True | H₀ False | |
|---|---|---|
| Reject H₀ | Type I Error (α) | Correct |
| Fail to Reject H₀ | Correct | Type II Error (β) |
- α = P(Type I) = significance level (typically 0.05)
- β = P(Type II)
- Power = 1 - β = P(correctly rejecting false H₀)
One-Sample t-Test
Testing if population mean equals a specific value.
def one_sample_t_test(data, mu_0, alternative='two-sided'):
"""
Test H₀: μ = μ₀
alternative: 'two-sided', 'greater', 'less'
"""
n = len(data)
x_bar = sample_mean(data)
se = standard_error(data)
# Test statistic
t_stat = (x_bar - mu_0) / se
# Degrees of freedom
df = n - 1
# P-value calculation (simplified)
# For production, use scipy.stats.t.sf
def t_pvalue(t, df, alt):
# Rough approximation using normal for df > 30
from math import erf
def normal_cdf(x):
return (1 + erf(x / math.sqrt(2))) / 2
if alt == 'two-sided':
return 2 * (1 - normal_cdf(abs(t)))
elif alt == 'greater':
return 1 - normal_cdf(t)
else: # less
return normal_cdf(t)
p_value = t_pvalue(t_stat, df, alternative)
return {
't_statistic': t_stat,
'df': df,
'p_value': p_value,
'reject_at_0.05': p_value < 0.05
}
# Example: Test if mean response time = 100ms
response_times = [95, 102, 98, 105, 101, 99, 103, 97, 100, 104]
result = one_sample_t_test(response_times, mu_0=100)
print(f"t-statistic: {result['t_statistic']:.4f}")
print(f"p-value: {result['p_value']:.4f}")
print(f"Reject H₀: {result['reject_at_0.05']}")
Two-Sample t-Test
Comparing means of two groups.
def two_sample_t_test(data1, data2, alternative='two-sided'):
"""
Test H₀: μ₁ = μ₂ (independent samples)
Using Welch's t-test (doesn't assume equal variance)
"""
n1, n2 = len(data1), len(data2)
x1_bar, x2_bar = sample_mean(data1), sample_mean(data2)
s1_sq, s2_sq = sample_variance(data1), sample_variance(data2)
# Standard error of difference
se_diff = math.sqrt(s1_sq/n1 + s2_sq/n2)
# Test statistic
t_stat = (x1_bar - x2_bar) / se_diff
# Welch-Satterthwaite degrees of freedom
df = (s1_sq/n1 + s2_sq/n2)**2 / (
(s1_sq/n1)**2/(n1-1) + (s2_sq/n2)**2/(n2-1)
)
return {
'difference': x1_bar - x2_bar,
't_statistic': t_stat,
'df': df,
'mean1': x1_bar,
'mean2': x2_bar
}
# Example: Compare two algorithms
algo_a_times = [102, 98, 105, 101, 99, 103, 97, 100, 104, 96]
algo_b_times = [95, 92, 98, 94, 96, 91, 97, 93, 95, 94]
result = two_sample_t_test(algo_a_times, algo_b_times)
print(f"Algorithm A mean: {result['mean1']:.1f}ms")
print(f"Algorithm B mean: {result['mean2']:.1f}ms")
print(f"Difference: {result['difference']:.1f}ms")
print(f"t-statistic: {result['t_statistic']:.4f}")
Chi-Square Test
Test for independence or goodness of fit.
def chi_square_test(observed, expected):
"""
Chi-square goodness of fit test
H₀: Data follows expected distribution
"""
chi_sq = sum((o - e)**2 / e for o, e in zip(observed, expected))
df = len(observed) - 1
return {
'chi_square': chi_sq,
'df': df
}
# Example: Test if die is fair
observed = [18, 16, 14, 20, 17, 15] # 100 rolls
expected = [100/6] * 6
result = chi_square_test(observed, expected)
print(f"Chi-square: {result['chi_square']:.4f}")
print(f"Degrees of freedom: {result['df']}")
# Compare to chi-square critical value (df=5, α=0.05): 11.07
5. A/B Testing
Framework
A/B testing compares two versions to determine which performs better.
class ABTest:
"""A/B test for conversion rates"""
def __init__(self, name):
self.name = name
self.control = {'visitors': 0, 'conversions': 0}
self.treatment = {'visitors': 0, 'conversions': 0}
def record_control(self, converted):
self.control['visitors'] += 1
if converted:
self.control['conversions'] += 1
def record_treatment(self, converted):
self.treatment['visitors'] += 1
if converted:
self.treatment['conversions'] += 1
def get_rates(self):
ctrl_rate = self.control['conversions'] / self.control['visitors']
treat_rate = self.treatment['conversions'] / self.treatment['visitors']
return ctrl_rate, treat_rate
def analyze(self, confidence=0.95):
"""Analyze A/B test results"""
n1, c1 = self.control['visitors'], self.control['conversions']
n2, c2 = self.treatment['visitors'], self.treatment['conversions']
p1, p2 = c1/n1, c2/n2
# Pooled proportion under H₀
p_pool = (c1 + c2) / (n1 + n2)
# Standard error under H₀
se_pool = math.sqrt(p_pool * (1 - p_pool) * (1/n1 + 1/n2))
# Z-statistic
z = (p2 - p1) / se_pool if se_pool > 0 else 0
# Relative lift
lift = (p2 - p1) / p1 if p1 > 0 else float('inf')
# Confidence interval for difference
se_diff = math.sqrt(p1*(1-p1)/n1 + p2*(1-p2)/n2)
z_crit = 1.96 # 95% CI
ci = (p2 - p1 - z_crit*se_diff, p2 - p1 + z_crit*se_diff)
return {
'control_rate': p1,
'treatment_rate': p2,
'absolute_lift': p2 - p1,
'relative_lift': lift,
'z_statistic': z,
'ci_95': ci,
'significant': abs(z) > 1.96
}
# Example
test = ABTest("Checkout Button Color")
# Simulate data
import random
for _ in range(5000):
test.record_control(random.random() < 0.10) # 10% baseline
test.record_treatment(random.random() < 0.12) # 12% treatment
results = test.analyze()
print(f"Control rate: {results['control_rate']:.1%}")
print(f"Treatment rate: {results['treatment_rate']:.1%}")
print(f"Relative lift: {results['relative_lift']:.1%}")
print(f"95% CI: ({results['ci_95'][0]:.1%}, {results['ci_95'][1]:.1%})")
print(f"Statistically significant: {results['significant']}")
Sample Size Calculation
def sample_size_proportion(p1, mde, alpha=0.05, power=0.80):
"""
Calculate required sample size for A/B test
p1: baseline conversion rate
mde: minimum detectable effect (relative)
alpha: significance level
power: statistical power
"""
p2 = p1 * (1 + mde) # Expected treatment rate
# Z-values
z_alpha = 1.96 # two-sided
z_beta = 0.84 # 80% power
# Pooled variance approximation
p_avg = (p1 + p2) / 2
var_pooled = 2 * p_avg * (1 - p_avg)
# Sample size per group
n = var_pooled * ((z_alpha + z_beta) / (p2 - p1))**2
return math.ceil(n)
# Example: 5% baseline, want to detect 20% relative lift
n = sample_size_proportion(0.05, 0.20)
print(f"Required sample size per group: {n}")
print(f"Total visitors needed: {2 * n}")
Multiple Testing Problem
def bonferroni_correction(alpha, num_tests):
"""
Bonferroni correction for multiple comparisons
If testing at α level with k tests:
Adjusted α = α / k
"""
return alpha / num_tests
def benjamini_hochberg(p_values, alpha=0.05):
"""
Benjamini-Hochberg procedure for controlling FDR
Returns which hypotheses to reject
"""
n = len(p_values)
sorted_idx = sorted(range(n), key=lambda i: p_values[i])
reject = [False] * n
for i, idx in enumerate(sorted_idx):
threshold = (i + 1) * alpha / n
if p_values[idx] <= threshold:
reject[idx] = True
else:
break
return reject
# Example: 10 tests
p_values = [0.01, 0.04, 0.03, 0.06, 0.02, 0.08, 0.05, 0.07, 0.09, 0.04]
print(f"Original α = 0.05: {sum(p < 0.05 for p in p_values)} significant")
print(f"Bonferroni α = {bonferroni_correction(0.05, 10):.4f}")
print(f"Bonferroni significant: {sum(p < 0.005 for p in p_values)}")
print(f"B-H rejections: {benjamini_hochberg(p_values)}")
6. Common Pitfalls
P-value Misconceptions
"""
WRONG interpretations of p = 0.03:
- "There's a 3% chance H₀ is true"
- "There's a 97% chance the effect is real"
- "If we reject H₀, there's only 3% chance we're wrong"
CORRECT interpretation:
- "If H₀ is true, we'd see data this extreme 3% of the time"
"""
P-hacking / Data Dredging
def demonstrate_p_hacking():
"""
Show how testing many things finds 'significant' results by chance
"""
import random
def test_correlation(x, y):
"""Simple correlation test"""
n = len(x)
mean_x, mean_y = sum(x)/n, sum(y)/n
cov = sum((xi-mean_x)*(yi-mean_y) for xi, yi in zip(x, y)) / n
std_x = math.sqrt(sum((xi-mean_x)**2 for xi in x) / n)
std_y = math.sqrt(sum((yi-mean_y)**2 for yi in y) / n)
r = cov / (std_x * std_y) if std_x > 0 and std_y > 0 else 0
# Approximate p-value (simplified)
t = r * math.sqrt(n-2) / math.sqrt(1-r**2) if abs(r) < 1 else 0
return abs(r), r
# Target variable (random)
y = [random.gauss(0, 1) for _ in range(100)]
# Test 100 random "features"
significant = 0
for i in range(100):
x = [random.gauss(0, 1) for _ in range(100)]
r, _ = test_correlation(x, y)
if r > 0.2: # Arbitrary "significance"
significant += 1
print(f"Testing 100 random features against random target:")
print(f"Found {significant} 'significant' correlations")
print("These are all spurious!")
demonstrate_p_hacking()
Stopping Rules
def demonstrate_early_stopping_bias():
"""
Show bias from stopping test when p < 0.05
"""
import random
def run_biased_test():
"""Stop as soon as p < 0.05 (WRONG!)"""
control = []
treatment = []
for i in range(1000):
control.append(random.gauss(100, 10))
treatment.append(random.gauss(100, 10)) # Same distribution!
if len(control) >= 20:
# Calculate p-value
diff = sum(treatment)/len(treatment) - sum(control)/len(control)
se = 10 * math.sqrt(2/len(control))
z = diff / se
if abs(z) > 1.96: # "Significant"
return True, len(control)
return False, 1000
# Run many trials
false_positives = 0
for _ in range(100):
significant, n = run_biased_test()
if significant:
false_positives += 1
print(f"False positive rate with early stopping: {false_positives}%")
print("(Should be 5% with fixed sample size)")
demonstrate_early_stopping_bias()
Exercises
Basic
A sample of 50 website load times has mean 2.3 seconds and std 0.8 seconds. Construct a 95% CI for the population mean.
An A/B test shows 120/1000 conversions for control and 150/1000 for treatment. Is this statistically significant at α = 0.05?
Explain why "p = 0.03 means there's a 3% chance the null hypothesis is true" is incorrect.
Intermediate
You're testing 20 features for correlation with an outcome. Using Bonferroni correction, what p-value threshold should you use for α = 0.05?
Calculate the sample size needed to detect a 10% relative lift from a 5% baseline with 80% power.
Implement a function that computes the power of a two-sample t-test given effect size, sample size, and α.
Advanced
Implement a Bayesian A/B test using Beta-Binomial model. Compare results to frequentist approach.
Design a sequential testing procedure that controls Type I error while allowing early stopping.
Implement bootstrap confidence intervals and compare to parametric intervals.
Summary
- Point estimation provides single best guesses for parameters
- Confidence intervals provide ranges of plausible values
- Hypothesis testing provides framework for making decisions
- A/B testing applies inference to compare variants
- Watch out for multiple testing, early stopping, and p-value misconceptions