Machine Learning Fundamentals
Introduction
Machine learning enables computers to learn from data without being explicitly programmed. This reading covers core ML concepts, the learning process, and fundamental principles that underlie all ML algorithms.
Learning Objectives
By the end of this reading, you will be able to:
- Define machine learning and its types
- Understand the bias-variance tradeoff
- Implement basic evaluation metrics
- Apply cross-validation techniques
- Recognize overfitting and underfitting
1. What is Machine Learning?
Definition
Machine Learning is the field of study that gives computers the ability to learn without being explicitly programmed. A computer program learns from experience E with respect to task T and performance measure P, if its performance on T improves with E.
"""
Traditional Programming:
Input + Rules → Output
Machine Learning:
Input + Output → Rules (Model)
"""
# Traditional: Explicit rules
def is_spam_traditional(email: str) -> bool:
spam_words = ['free', 'winner', 'click here', 'urgent']
return any(word in email.lower() for word in spam_words)
# ML Approach: Learn from examples
class SpamClassifier:
def __init__(self):
self.model = None
def train(self, emails: list, labels: list):
"""Learn from labeled examples"""
# Model learns patterns from data
pass
def predict(self, email: str) -> bool:
"""Apply learned patterns"""
return self.model.predict(email)
Types of Machine Learning
"""
1. SUPERVISED LEARNING
- Learn from labeled data
- Input → Output mapping
- Examples: Classification, Regression
2. UNSUPERVISED LEARNING
- Learn from unlabeled data
- Find hidden patterns
- Examples: Clustering, Dimensionality Reduction
3. REINFORCEMENT LEARNING
- Learn from interaction with environment
- Maximize cumulative reward
- Examples: Game playing, Robotics
4. SEMI-SUPERVISED LEARNING
- Mix of labeled and unlabeled data
- Use structure in unlabeled data
5. SELF-SUPERVISED LEARNING
- Create labels from data itself
- Examples: Language models, Contrastive learning
"""
# Supervised: Regression example
def supervised_regression_example():
"""Predict house price from features"""
# X: features (size, bedrooms, location)
# y: labels (price)
X = [[1500, 3, 1], [2000, 4, 2], [1200, 2, 1]]
y = [300000, 450000, 250000]
# Learn: f(X) → y
# Supervised: Classification example
def supervised_classification_example():
"""Classify email as spam or not"""
# X: email features
# y: labels (0=not spam, 1=spam)
X = [[10, 0.5, 3], [2, 0.1, 0], [15, 0.8, 5]]
y = [1, 0, 1]
# Learn: f(X) → y ∈ {0, 1}
# Unsupervised: Clustering example
def unsupervised_clustering_example():
"""Group customers by behavior"""
# X: customer features (no labels)
X = [[25, 50000], [45, 80000], [23, 45000], [47, 85000]]
# Learn: group similar customers
2. The Learning Process
Training Pipeline
import numpy as np
from typing import Tuple, List
from dataclasses import dataclass
@dataclass
class Dataset:
X: np.ndarray # Features
y: np.ndarray # Labels (if supervised)
def ml_pipeline(data: Dataset):
"""Standard ML pipeline"""
# 1. Data Preprocessing
X_clean = preprocess(data.X)
# 2. Feature Engineering
X_features = extract_features(X_clean)
# 3. Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(
X_features, data.y, test_size=0.2
)
# 4. Model Selection
model = select_model()
# 5. Training
model.fit(X_train, y_train)
# 6. Evaluation
score = model.evaluate(X_test, y_test)
# 7. Prediction
predictions = model.predict(X_test)
return model, score
def train_test_split(
X: np.ndarray,
y: np.ndarray,
test_size: float = 0.2,
random_state: int = None
) -> Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
"""Split data into training and test sets"""
if random_state:
np.random.seed(random_state)
n = len(X)
indices = np.random.permutation(n)
test_count = int(n * test_size)
test_idx = indices[:test_count]
train_idx = indices[test_count:]
return X[train_idx], X[test_idx], y[train_idx], y[test_idx]
Feature Scaling
class StandardScaler:
"""Standardize features to zero mean and unit variance"""
def __init__(self):
self.mean_ = None
self.std_ = None
def fit(self, X: np.ndarray) -> 'StandardScaler':
"""Compute mean and std from training data"""
self.mean_ = np.mean(X, axis=0)
self.std_ = np.std(X, axis=0)
# Avoid division by zero
self.std_[self.std_ == 0] = 1
return self
def transform(self, X: np.ndarray) -> np.ndarray:
"""Apply standardization"""
return (X - self.mean_) / self.std_
def fit_transform(self, X: np.ndarray) -> np.ndarray:
"""Fit and transform in one step"""
return self.fit(X).transform(X)
class MinMaxScaler:
"""Scale features to [0, 1] range"""
def __init__(self):
self.min_ = None
self.max_ = None
def fit(self, X: np.ndarray) -> 'MinMaxScaler':
self.min_ = np.min(X, axis=0)
self.max_ = np.max(X, axis=0)
return self
def transform(self, X: np.ndarray) -> np.ndarray:
range_ = self.max_ - self.min_
range_[range_ == 0] = 1
return (X - self.min_) / range_
# Why scaling matters
"""
Without scaling:
- Feature 1: age (0-100)
- Feature 2: income (0-1,000,000)
Income dominates because of larger scale!
Many algorithms (gradient descent, SVM, KNN) are sensitive to scale.
"""
3. Bias-Variance Tradeoff
Understanding Error
"""
Total Error = Bias² + Variance + Irreducible Error
BIAS: Error from wrong assumptions in the model
- High bias: Model is too simple (underfitting)
- Low bias: Model captures true relationship
VARIANCE: Error from sensitivity to training data
- High variance: Model fits noise (overfitting)
- Low variance: Model is stable across datasets
IRREDUCIBLE ERROR: Noise in the data itself
- Cannot be eliminated
"""
def demonstrate_bias_variance():
"""Visual demonstration of bias-variance tradeoff"""
# True function (unknown in practice)
def true_function(x):
return np.sin(x)
# Generate noisy data
np.random.seed(42)
X = np.linspace(0, 2*np.pi, 20)
y = true_function(X) + np.random.normal(0, 0.3, 20)
# High bias (underfitting): Linear model
# Assumes y = ax + b, but true function is sin(x)
# Will have large error because model is too simple
# High variance (overfitting): High-degree polynomial
# Fits every point including noise
# Will have large error on new data
# Good balance: Moderate complexity
# Captures sin-like shape without fitting noise
return X, y
class BiasVarianceDemo:
"""Demonstrate bias-variance with polynomial regression"""
def __init__(self, degree: int):
self.degree = degree
self.coefficients = None
def fit(self, X: np.ndarray, y: np.ndarray):
"""Fit polynomial of given degree"""
# Create polynomial features
X_poly = np.column_stack([X**i for i in range(self.degree + 1)])
# Solve normal equations
self.coefficients = np.linalg.lstsq(X_poly, y, rcond=None)[0]
def predict(self, X: np.ndarray) -> np.ndarray:
X_poly = np.column_stack([X**i for i in range(self.degree + 1)])
return X_poly @ self.coefficients
# degree=1: High bias (underfitting) - straight line can't fit sin
# degree=15: High variance (overfitting) - wiggles through every point
# degree=5: Good balance - captures curve shape
Regularization
class RidgeRegression:
"""Linear regression with L2 regularization (reduces variance)"""
def __init__(self, alpha: float = 1.0):
self.alpha = alpha # Regularization strength
self.weights = None
def fit(self, X: np.ndarray, y: np.ndarray):
"""Fit with L2 penalty on weights"""
n_features = X.shape[1]
# Add bias column
X_b = np.column_stack([np.ones(len(X)), X])
# Ridge solution: (X'X + αI)^(-1) X'y
I = np.eye(n_features + 1)
I[0, 0] = 0 # Don't regularize bias
self.weights = np.linalg.solve(
X_b.T @ X_b + self.alpha * I,
X_b.T @ y
)
def predict(self, X: np.ndarray) -> np.ndarray:
X_b = np.column_stack([np.ones(len(X)), X])
return X_b @ self.weights
class LassoRegression:
"""Linear regression with L1 regularization (feature selection)"""
def __init__(self, alpha: float = 1.0, max_iter: int = 1000):
self.alpha = alpha
self.max_iter = max_iter
self.weights = None
def fit(self, X: np.ndarray, y: np.ndarray):
"""Fit with L1 penalty (coordinate descent)"""
n_samples, n_features = X.shape
self.weights = np.zeros(n_features)
for _ in range(self.max_iter):
for j in range(n_features):
# Residual without feature j
residual = y - X @ self.weights + X[:, j] * self.weights[j]
# Correlation
rho = X[:, j] @ residual
# Soft thresholding
if rho < -self.alpha:
self.weights[j] = (rho + self.alpha) / (X[:, j] @ X[:, j])
elif rho > self.alpha:
self.weights[j] = (rho - self.alpha) / (X[:, j] @ X[:, j])
else:
self.weights[j] = 0 # Feature dropped!
def predict(self, X: np.ndarray) -> np.ndarray:
return X @ self.weights
"""
Regularization effects:
- L2 (Ridge): Shrinks weights toward zero, keeps all features
- L1 (Lasso): Can set weights exactly to zero (feature selection)
- Higher α: More regularization, simpler model, higher bias
- Lower α: Less regularization, complex model, higher variance
"""
4. Model Evaluation
Classification Metrics
class ClassificationMetrics:
"""Metrics for evaluating classifiers"""
@staticmethod
def confusion_matrix(y_true: np.ndarray, y_pred: np.ndarray) -> np.ndarray:
"""
Compute confusion matrix.
[[TN, FP],
[FN, TP]]
"""
TP = np.sum((y_true == 1) & (y_pred == 1))
TN = np.sum((y_true == 0) & (y_pred == 0))
FP = np.sum((y_true == 0) & (y_pred == 1))
FN = np.sum((y_true == 1) & (y_pred == 0))
return np.array([[TN, FP], [FN, TP]])
@staticmethod
def accuracy(y_true: np.ndarray, y_pred: np.ndarray) -> float:
"""(TP + TN) / Total"""
return np.mean(y_true == y_pred)
@staticmethod
def precision(y_true: np.ndarray, y_pred: np.ndarray) -> float:
"""TP / (TP + FP) - Of predicted positives, how many are correct?"""
TP = np.sum((y_true == 1) & (y_pred == 1))
FP = np.sum((y_true == 0) & (y_pred == 1))
return TP / (TP + FP) if (TP + FP) > 0 else 0
@staticmethod
def recall(y_true: np.ndarray, y_pred: np.ndarray) -> float:
"""TP / (TP + FN) - Of actual positives, how many did we catch?"""
TP = np.sum((y_true == 1) & (y_pred == 1))
FN = np.sum((y_true == 1) & (y_pred == 0))
return TP / (TP + FN) if (TP + FN) > 0 else 0
@staticmethod
def f1_score(y_true: np.ndarray, y_pred: np.ndarray) -> float:
"""Harmonic mean of precision and recall"""
p = ClassificationMetrics.precision(y_true, y_pred)
r = ClassificationMetrics.recall(y_true, y_pred)
return 2 * p * r / (p + r) if (p + r) > 0 else 0
# Example
y_true = np.array([1, 0, 1, 1, 0, 1, 0, 0, 1, 0])
y_pred = np.array([1, 0, 1, 0, 0, 1, 1, 0, 1, 0])
metrics = ClassificationMetrics()
print(f"Accuracy: {metrics.accuracy(y_true, y_pred):.3f}")
print(f"Precision: {metrics.precision(y_true, y_pred):.3f}")
print(f"Recall: {metrics.recall(y_true, y_pred):.3f}")
print(f"F1 Score: {metrics.f1_score(y_true, y_pred):.3f}")
Regression Metrics
class RegressionMetrics:
"""Metrics for evaluating regression models"""
@staticmethod
def mse(y_true: np.ndarray, y_pred: np.ndarray) -> float:
"""Mean Squared Error"""
return np.mean((y_true - y_pred) ** 2)
@staticmethod
def rmse(y_true: np.ndarray, y_pred: np.ndarray) -> float:
"""Root Mean Squared Error"""
return np.sqrt(RegressionMetrics.mse(y_true, y_pred))
@staticmethod
def mae(y_true: np.ndarray, y_pred: np.ndarray) -> float:
"""Mean Absolute Error"""
return np.mean(np.abs(y_true - y_pred))
@staticmethod
def r2_score(y_true: np.ndarray, y_pred: np.ndarray) -> float:
"""
R² (Coefficient of Determination)
1.0 = perfect prediction
0.0 = predicting mean
<0 = worse than mean
"""
ss_res = np.sum((y_true - y_pred) ** 2)
ss_tot = np.sum((y_true - np.mean(y_true)) ** 2)
return 1 - (ss_res / ss_tot) if ss_tot > 0 else 0
# Example
y_true = np.array([3, -0.5, 2, 7])
y_pred = np.array([2.5, 0.0, 2, 8])
metrics = RegressionMetrics()
print(f"MSE: {metrics.mse(y_true, y_pred):.3f}")
print(f"RMSE: {metrics.rmse(y_true, y_pred):.3f}")
print(f"MAE: {metrics.mae(y_true, y_pred):.3f}")
print(f"R²: {metrics.r2_score(y_true, y_pred):.3f}")
5. Cross-Validation
K-Fold Cross-Validation
class KFoldCV:
"""K-Fold Cross-Validation"""
def __init__(self, n_splits: int = 5, shuffle: bool = True):
self.n_splits = n_splits
self.shuffle = shuffle
def split(self, X: np.ndarray):
"""Generate train/test indices for each fold"""
n_samples = len(X)
indices = np.arange(n_samples)
if self.shuffle:
np.random.shuffle(indices)
fold_sizes = np.full(self.n_splits, n_samples // self.n_splits)
fold_sizes[:n_samples % self.n_splits] += 1
current = 0
for fold_size in fold_sizes:
test_idx = indices[current:current + fold_size]
train_idx = np.concatenate([
indices[:current],
indices[current + fold_size:]
])
yield train_idx, test_idx
current += fold_size
def cross_val_score(model, X: np.ndarray, y: np.ndarray,
cv: int = 5, scoring='accuracy') -> np.ndarray:
"""Evaluate model using cross-validation"""
kfold = KFoldCV(n_splits=cv)
scores = []
for train_idx, test_idx in kfold.split(X):
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
# Clone and fit model
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Calculate score
if scoring == 'accuracy':
score = np.mean(y_test == y_pred)
elif scoring == 'mse':
score = -np.mean((y_test - y_pred) ** 2) # Negative for consistency
scores.append(score)
return np.array(scores)
# Usage
"""
scores = cross_val_score(model, X, y, cv=5)
print(f"CV Scores: {scores}")
print(f"Mean: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")
"""
Stratified K-Fold
class StratifiedKFold:
"""K-Fold that preserves class distribution in each fold"""
def __init__(self, n_splits: int = 5):
self.n_splits = n_splits
def split(self, X: np.ndarray, y: np.ndarray):
"""Generate stratified train/test indices"""
classes = np.unique(y)
# Get indices for each class
class_indices = {c: np.where(y == c)[0] for c in classes}
# Shuffle within each class
for c in classes:
np.random.shuffle(class_indices[c])
# Distribute each class across folds
folds = [[] for _ in range(self.n_splits)]
for c in classes:
indices = class_indices[c]
fold_sizes = np.full(self.n_splits, len(indices) // self.n_splits)
fold_sizes[:len(indices) % self.n_splits] += 1
current = 0
for i, size in enumerate(fold_sizes):
folds[i].extend(indices[current:current + size])
current += size
# Generate train/test splits
for i in range(self.n_splits):
test_idx = np.array(folds[i])
train_idx = np.concatenate([folds[j] for j in range(self.n_splits) if j != i])
yield train_idx, test_idx
# Why stratified?
"""
Imbalanced dataset: 90% class A, 10% class B
Regular K-Fold might create:
- Fold 1: 95% A, 5% B
- Fold 2: 85% A, 15% B
Different distributions = unreliable evaluation
Stratified K-Fold ensures:
- Fold 1: 90% A, 10% B
- Fold 2: 90% A, 10% B
Consistent distributions across folds
"""
6. Overfitting and Underfitting
Detection
def diagnose_fit(train_error: float, test_error: float,
baseline_error: float) -> str:
"""
Diagnose model fitting issues.
train_error: Error on training data
test_error: Error on test data
baseline_error: Error of simple baseline (e.g., predicting mean)
"""
if train_error > baseline_error * 0.9:
return "UNDERFITTING: High train error. Model too simple."
gap = test_error - train_error
if gap > train_error * 0.5:
return "OVERFITTING: Large gap between train and test error."
if test_error < baseline_error * 0.7:
return "GOOD FIT: Low error, small gap."
return "MODERATE: Consider more data or different model."
class LearningCurve:
"""Plot learning curves to diagnose fitting"""
@staticmethod
def compute(model, X: np.ndarray, y: np.ndarray,
train_sizes: List[float] = None) -> dict:
"""
Compute train and test scores for different training set sizes.
"""
if train_sizes is None:
train_sizes = [0.1, 0.25, 0.5, 0.75, 1.0]
n_samples = len(X)
results = {'train_sizes': [], 'train_scores': [], 'test_scores': []}
# Split into train and test
split_idx = int(n_samples * 0.8)
X_train_full, X_test = X[:split_idx], X[split_idx:]
y_train_full, y_test = y[:split_idx], y[split_idx:]
for size in train_sizes:
n_train = int(len(X_train_full) * size)
X_train = X_train_full[:n_train]
y_train = y_train_full[:n_train]
model.fit(X_train, y_train)
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)
results['train_sizes'].append(n_train)
results['train_scores'].append(train_score)
results['test_scores'].append(test_score)
return results
"""
Learning curve interpretation:
UNDERFITTING:
- Train and test scores both low
- Scores converge to similar (low) value
- Solution: More complex model, more features
OVERFITTING:
- Train score high, test score low
- Large gap that doesn't close with more data
- Solution: Regularization, less complex model, more data
GOOD FIT:
- Both scores high
- Small gap between them
- Gap closes as data increases
"""
Solutions
"""
UNDERFITTING SOLUTIONS:
1. Use more complex model
2. Add more features
3. Reduce regularization
4. Train longer (for iterative methods)
OVERFITTING SOLUTIONS:
1. Get more training data
2. Reduce model complexity
3. Add regularization (L1, L2)
4. Dropout (for neural networks)
5. Early stopping
6. Data augmentation
7. Feature selection
8. Ensemble methods
"""
class EarlyStopping:
"""Stop training when validation loss stops improving"""
def __init__(self, patience: int = 5, min_delta: float = 0.001):
self.patience = patience
self.min_delta = min_delta
self.best_loss = float('inf')
self.counter = 0
self.best_weights = None
def __call__(self, val_loss: float, model) -> bool:
"""
Returns True if training should stop.
"""
if val_loss < self.best_loss - self.min_delta:
self.best_loss = val_loss
self.counter = 0
self.best_weights = model.get_weights() # Save best
return False
else:
self.counter += 1
if self.counter >= self.patience:
model.set_weights(self.best_weights) # Restore best
return True
return False
Exercises
Basic
Implement accuracy, precision, and recall from scratch.
Split a dataset into 80% training and 20% test, ensuring class balance.
Explain why you should never evaluate on training data.
Intermediate
Implement 5-fold cross-validation and compute mean and std of scores.
Create learning curves for a model and diagnose the fit.
Compare L1 and L2 regularization on a dataset with correlated features.
Advanced
Implement stratified K-fold for multi-class classification.
Design an experiment to demonstrate the bias-variance tradeoff.
Build an automated model selection pipeline with cross-validation.
Summary
- ML learns patterns from data instead of explicit rules
- Supervised learning uses labeled data; unsupervised finds hidden structure
- Bias-variance tradeoff: simple models underfit, complex models overfit
- Regularization reduces overfitting by penalizing model complexity
- Cross-validation provides reliable model evaluation
- Learning curves help diagnose fitting issues