Machine Learning Fundamentals

Introduction

Machine learning enables computers to learn from data without being explicitly programmed. This reading covers core ML concepts, the learning process, and fundamental principles that underlie all ML algorithms.

Learning Objectives

By the end of this reading, you will be able to:

  • Define machine learning and its types
  • Understand the bias-variance tradeoff
  • Implement basic evaluation metrics
  • Apply cross-validation techniques
  • Recognize overfitting and underfitting

1. What is Machine Learning?

Definition

Machine Learning is the field of study that gives computers the ability to learn without being explicitly programmed. A computer program learns from experience E with respect to task T and performance measure P, if its performance on T improves with E.

"""
Traditional Programming:
    Input + Rules → Output

Machine Learning:
    Input + Output → Rules (Model)
"""

# Traditional: Explicit rules
def is_spam_traditional(email: str) -> bool:
    spam_words = ['free', 'winner', 'click here', 'urgent']
    return any(word in email.lower() for word in spam_words)

# ML Approach: Learn from examples
class SpamClassifier:
    def __init__(self):
        self.model = None

    def train(self, emails: list, labels: list):
        """Learn from labeled examples"""
        # Model learns patterns from data
        pass

    def predict(self, email: str) -> bool:
        """Apply learned patterns"""
        return self.model.predict(email)

Types of Machine Learning

"""
1. SUPERVISED LEARNING
   - Learn from labeled data
   - Input → Output mapping
   - Examples: Classification, Regression

2. UNSUPERVISED LEARNING
   - Learn from unlabeled data
   - Find hidden patterns
   - Examples: Clustering, Dimensionality Reduction

3. REINFORCEMENT LEARNING
   - Learn from interaction with environment
   - Maximize cumulative reward
   - Examples: Game playing, Robotics

4. SEMI-SUPERVISED LEARNING
   - Mix of labeled and unlabeled data
   - Use structure in unlabeled data

5. SELF-SUPERVISED LEARNING
   - Create labels from data itself
   - Examples: Language models, Contrastive learning
"""

# Supervised: Regression example
def supervised_regression_example():
    """Predict house price from features"""
    # X: features (size, bedrooms, location)
    # y: labels (price)
    X = [[1500, 3, 1], [2000, 4, 2], [1200, 2, 1]]
    y = [300000, 450000, 250000]
    # Learn: f(X) → y

# Supervised: Classification example
def supervised_classification_example():
    """Classify email as spam or not"""
    # X: email features
    # y: labels (0=not spam, 1=spam)
    X = [[10, 0.5, 3], [2, 0.1, 0], [15, 0.8, 5]]
    y = [1, 0, 1]
    # Learn: f(X) → y ∈ {0, 1}

# Unsupervised: Clustering example
def unsupervised_clustering_example():
    """Group customers by behavior"""
    # X: customer features (no labels)
    X = [[25, 50000], [45, 80000], [23, 45000], [47, 85000]]
    # Learn: group similar customers

2. The Learning Process

Training Pipeline

import numpy as np
from typing import Tuple, List
from dataclasses import dataclass

@dataclass
class Dataset:
    X: np.ndarray  # Features
    y: np.ndarray  # Labels (if supervised)

def ml_pipeline(data: Dataset):
    """Standard ML pipeline"""

    # 1. Data Preprocessing
    X_clean = preprocess(data.X)

    # 2. Feature Engineering
    X_features = extract_features(X_clean)

    # 3. Train/Test Split
    X_train, X_test, y_train, y_test = train_test_split(
        X_features, data.y, test_size=0.2
    )

    # 4. Model Selection
    model = select_model()

    # 5. Training
    model.fit(X_train, y_train)

    # 6. Evaluation
    score = model.evaluate(X_test, y_test)

    # 7. Prediction
    predictions = model.predict(X_test)

    return model, score

def train_test_split(
    X: np.ndarray,
    y: np.ndarray,
    test_size: float = 0.2,
    random_state: int = None
) -> Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
    """Split data into training and test sets"""
    if random_state:
        np.random.seed(random_state)

    n = len(X)
    indices = np.random.permutation(n)
    test_count = int(n * test_size)

    test_idx = indices[:test_count]
    train_idx = indices[test_count:]

    return X[train_idx], X[test_idx], y[train_idx], y[test_idx]

Feature Scaling

class StandardScaler:
    """Standardize features to zero mean and unit variance"""

    def __init__(self):
        self.mean_ = None
        self.std_ = None

    def fit(self, X: np.ndarray) -> 'StandardScaler':
        """Compute mean and std from training data"""
        self.mean_ = np.mean(X, axis=0)
        self.std_ = np.std(X, axis=0)
        # Avoid division by zero
        self.std_[self.std_ == 0] = 1
        return self

    def transform(self, X: np.ndarray) -> np.ndarray:
        """Apply standardization"""
        return (X - self.mean_) / self.std_

    def fit_transform(self, X: np.ndarray) -> np.ndarray:
        """Fit and transform in one step"""
        return self.fit(X).transform(X)

class MinMaxScaler:
    """Scale features to [0, 1] range"""

    def __init__(self):
        self.min_ = None
        self.max_ = None

    def fit(self, X: np.ndarray) -> 'MinMaxScaler':
        self.min_ = np.min(X, axis=0)
        self.max_ = np.max(X, axis=0)
        return self

    def transform(self, X: np.ndarray) -> np.ndarray:
        range_ = self.max_ - self.min_
        range_[range_ == 0] = 1
        return (X - self.min_) / range_

# Why scaling matters
"""
Without scaling:
- Feature 1: age (0-100)
- Feature 2: income (0-1,000,000)

Income dominates because of larger scale!
Many algorithms (gradient descent, SVM, KNN) are sensitive to scale.
"""

3. Bias-Variance Tradeoff

Understanding Error

"""
Total Error = Bias² + Variance + Irreducible Error

BIAS: Error from wrong assumptions in the model
- High bias: Model is too simple (underfitting)
- Low bias: Model captures true relationship

VARIANCE: Error from sensitivity to training data
- High variance: Model fits noise (overfitting)
- Low variance: Model is stable across datasets

IRREDUCIBLE ERROR: Noise in the data itself
- Cannot be eliminated
"""

def demonstrate_bias_variance():
    """Visual demonstration of bias-variance tradeoff"""

    # True function (unknown in practice)
    def true_function(x):
        return np.sin(x)

    # Generate noisy data
    np.random.seed(42)
    X = np.linspace(0, 2*np.pi, 20)
    y = true_function(X) + np.random.normal(0, 0.3, 20)

    # High bias (underfitting): Linear model
    # Assumes y = ax + b, but true function is sin(x)
    # Will have large error because model is too simple

    # High variance (overfitting): High-degree polynomial
    # Fits every point including noise
    # Will have large error on new data

    # Good balance: Moderate complexity
    # Captures sin-like shape without fitting noise

    return X, y

class BiasVarianceDemo:
    """Demonstrate bias-variance with polynomial regression"""

    def __init__(self, degree: int):
        self.degree = degree
        self.coefficients = None

    def fit(self, X: np.ndarray, y: np.ndarray):
        """Fit polynomial of given degree"""
        # Create polynomial features
        X_poly = np.column_stack([X**i for i in range(self.degree + 1)])
        # Solve normal equations
        self.coefficients = np.linalg.lstsq(X_poly, y, rcond=None)[0]

    def predict(self, X: np.ndarray) -> np.ndarray:
        X_poly = np.column_stack([X**i for i in range(self.degree + 1)])
        return X_poly @ self.coefficients

# degree=1: High bias (underfitting) - straight line can't fit sin
# degree=15: High variance (overfitting) - wiggles through every point
# degree=5: Good balance - captures curve shape

Regularization

class RidgeRegression:
    """Linear regression with L2 regularization (reduces variance)"""

    def __init__(self, alpha: float = 1.0):
        self.alpha = alpha  # Regularization strength
        self.weights = None

    def fit(self, X: np.ndarray, y: np.ndarray):
        """Fit with L2 penalty on weights"""
        n_features = X.shape[1]

        # Add bias column
        X_b = np.column_stack([np.ones(len(X)), X])

        # Ridge solution: (X'X + αI)^(-1) X'y
        I = np.eye(n_features + 1)
        I[0, 0] = 0  # Don't regularize bias

        self.weights = np.linalg.solve(
            X_b.T @ X_b + self.alpha * I,
            X_b.T @ y
        )

    def predict(self, X: np.ndarray) -> np.ndarray:
        X_b = np.column_stack([np.ones(len(X)), X])
        return X_b @ self.weights

class LassoRegression:
    """Linear regression with L1 regularization (feature selection)"""

    def __init__(self, alpha: float = 1.0, max_iter: int = 1000):
        self.alpha = alpha
        self.max_iter = max_iter
        self.weights = None

    def fit(self, X: np.ndarray, y: np.ndarray):
        """Fit with L1 penalty (coordinate descent)"""
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)

        for _ in range(self.max_iter):
            for j in range(n_features):
                # Residual without feature j
                residual = y - X @ self.weights + X[:, j] * self.weights[j]

                # Correlation
                rho = X[:, j] @ residual

                # Soft thresholding
                if rho < -self.alpha:
                    self.weights[j] = (rho + self.alpha) / (X[:, j] @ X[:, j])
                elif rho > self.alpha:
                    self.weights[j] = (rho - self.alpha) / (X[:, j] @ X[:, j])
                else:
                    self.weights[j] = 0  # Feature dropped!

    def predict(self, X: np.ndarray) -> np.ndarray:
        return X @ self.weights

"""
Regularization effects:
- L2 (Ridge): Shrinks weights toward zero, keeps all features
- L1 (Lasso): Can set weights exactly to zero (feature selection)
- Higher α: More regularization, simpler model, higher bias
- Lower α: Less regularization, complex model, higher variance
"""

4. Model Evaluation

Classification Metrics

class ClassificationMetrics:
    """Metrics for evaluating classifiers"""

    @staticmethod
    def confusion_matrix(y_true: np.ndarray, y_pred: np.ndarray) -> np.ndarray:
        """
        Compute confusion matrix.

        [[TN, FP],
         [FN, TP]]
        """
        TP = np.sum((y_true == 1) & (y_pred == 1))
        TN = np.sum((y_true == 0) & (y_pred == 0))
        FP = np.sum((y_true == 0) & (y_pred == 1))
        FN = np.sum((y_true == 1) & (y_pred == 0))
        return np.array([[TN, FP], [FN, TP]])

    @staticmethod
    def accuracy(y_true: np.ndarray, y_pred: np.ndarray) -> float:
        """(TP + TN) / Total"""
        return np.mean(y_true == y_pred)

    @staticmethod
    def precision(y_true: np.ndarray, y_pred: np.ndarray) -> float:
        """TP / (TP + FP) - Of predicted positives, how many are correct?"""
        TP = np.sum((y_true == 1) & (y_pred == 1))
        FP = np.sum((y_true == 0) & (y_pred == 1))
        return TP / (TP + FP) if (TP + FP) > 0 else 0

    @staticmethod
    def recall(y_true: np.ndarray, y_pred: np.ndarray) -> float:
        """TP / (TP + FN) - Of actual positives, how many did we catch?"""
        TP = np.sum((y_true == 1) & (y_pred == 1))
        FN = np.sum((y_true == 1) & (y_pred == 0))
        return TP / (TP + FN) if (TP + FN) > 0 else 0

    @staticmethod
    def f1_score(y_true: np.ndarray, y_pred: np.ndarray) -> float:
        """Harmonic mean of precision and recall"""
        p = ClassificationMetrics.precision(y_true, y_pred)
        r = ClassificationMetrics.recall(y_true, y_pred)
        return 2 * p * r / (p + r) if (p + r) > 0 else 0

# Example
y_true = np.array([1, 0, 1, 1, 0, 1, 0, 0, 1, 0])
y_pred = np.array([1, 0, 1, 0, 0, 1, 1, 0, 1, 0])

metrics = ClassificationMetrics()
print(f"Accuracy:  {metrics.accuracy(y_true, y_pred):.3f}")
print(f"Precision: {metrics.precision(y_true, y_pred):.3f}")
print(f"Recall:    {metrics.recall(y_true, y_pred):.3f}")
print(f"F1 Score:  {metrics.f1_score(y_true, y_pred):.3f}")

Regression Metrics

class RegressionMetrics:
    """Metrics for evaluating regression models"""

    @staticmethod
    def mse(y_true: np.ndarray, y_pred: np.ndarray) -> float:
        """Mean Squared Error"""
        return np.mean((y_true - y_pred) ** 2)

    @staticmethod
    def rmse(y_true: np.ndarray, y_pred: np.ndarray) -> float:
        """Root Mean Squared Error"""
        return np.sqrt(RegressionMetrics.mse(y_true, y_pred))

    @staticmethod
    def mae(y_true: np.ndarray, y_pred: np.ndarray) -> float:
        """Mean Absolute Error"""
        return np.mean(np.abs(y_true - y_pred))

    @staticmethod
    def r2_score(y_true: np.ndarray, y_pred: np.ndarray) -> float:
        """
        R² (Coefficient of Determination)
        1.0 = perfect prediction
        0.0 = predicting mean
        <0  = worse than mean
        """
        ss_res = np.sum((y_true - y_pred) ** 2)
        ss_tot = np.sum((y_true - np.mean(y_true)) ** 2)
        return 1 - (ss_res / ss_tot) if ss_tot > 0 else 0

# Example
y_true = np.array([3, -0.5, 2, 7])
y_pred = np.array([2.5, 0.0, 2, 8])

metrics = RegressionMetrics()
print(f"MSE:  {metrics.mse(y_true, y_pred):.3f}")
print(f"RMSE: {metrics.rmse(y_true, y_pred):.3f}")
print(f"MAE:  {metrics.mae(y_true, y_pred):.3f}")
print(f"R²:   {metrics.r2_score(y_true, y_pred):.3f}")

5. Cross-Validation

K-Fold Cross-Validation

class KFoldCV:
    """K-Fold Cross-Validation"""

    def __init__(self, n_splits: int = 5, shuffle: bool = True):
        self.n_splits = n_splits
        self.shuffle = shuffle

    def split(self, X: np.ndarray):
        """Generate train/test indices for each fold"""
        n_samples = len(X)
        indices = np.arange(n_samples)

        if self.shuffle:
            np.random.shuffle(indices)

        fold_sizes = np.full(self.n_splits, n_samples // self.n_splits)
        fold_sizes[:n_samples % self.n_splits] += 1

        current = 0
        for fold_size in fold_sizes:
            test_idx = indices[current:current + fold_size]
            train_idx = np.concatenate([
                indices[:current],
                indices[current + fold_size:]
            ])
            yield train_idx, test_idx
            current += fold_size

def cross_val_score(model, X: np.ndarray, y: np.ndarray,
                   cv: int = 5, scoring='accuracy') -> np.ndarray:
    """Evaluate model using cross-validation"""
    kfold = KFoldCV(n_splits=cv)
    scores = []

    for train_idx, test_idx in kfold.split(X):
        X_train, X_test = X[train_idx], X[test_idx]
        y_train, y_test = y[train_idx], y[test_idx]

        # Clone and fit model
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)

        # Calculate score
        if scoring == 'accuracy':
            score = np.mean(y_test == y_pred)
        elif scoring == 'mse':
            score = -np.mean((y_test - y_pred) ** 2)  # Negative for consistency

        scores.append(score)

    return np.array(scores)

# Usage
"""
scores = cross_val_score(model, X, y, cv=5)
print(f"CV Scores: {scores}")
print(f"Mean: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")
"""

Stratified K-Fold

class StratifiedKFold:
    """K-Fold that preserves class distribution in each fold"""

    def __init__(self, n_splits: int = 5):
        self.n_splits = n_splits

    def split(self, X: np.ndarray, y: np.ndarray):
        """Generate stratified train/test indices"""
        classes = np.unique(y)

        # Get indices for each class
        class_indices = {c: np.where(y == c)[0] for c in classes}

        # Shuffle within each class
        for c in classes:
            np.random.shuffle(class_indices[c])

        # Distribute each class across folds
        folds = [[] for _ in range(self.n_splits)]
        for c in classes:
            indices = class_indices[c]
            fold_sizes = np.full(self.n_splits, len(indices) // self.n_splits)
            fold_sizes[:len(indices) % self.n_splits] += 1

            current = 0
            for i, size in enumerate(fold_sizes):
                folds[i].extend(indices[current:current + size])
                current += size

        # Generate train/test splits
        for i in range(self.n_splits):
            test_idx = np.array(folds[i])
            train_idx = np.concatenate([folds[j] for j in range(self.n_splits) if j != i])
            yield train_idx, test_idx

# Why stratified?
"""
Imbalanced dataset: 90% class A, 10% class B

Regular K-Fold might create:
- Fold 1: 95% A, 5% B
- Fold 2: 85% A, 15% B
Different distributions = unreliable evaluation

Stratified K-Fold ensures:
- Fold 1: 90% A, 10% B
- Fold 2: 90% A, 10% B
Consistent distributions across folds
"""

6. Overfitting and Underfitting

Detection

def diagnose_fit(train_error: float, test_error: float,
                baseline_error: float) -> str:
    """
    Diagnose model fitting issues.

    train_error: Error on training data
    test_error: Error on test data
    baseline_error: Error of simple baseline (e.g., predicting mean)
    """
    if train_error > baseline_error * 0.9:
        return "UNDERFITTING: High train error. Model too simple."

    gap = test_error - train_error

    if gap > train_error * 0.5:
        return "OVERFITTING: Large gap between train and test error."

    if test_error < baseline_error * 0.7:
        return "GOOD FIT: Low error, small gap."

    return "MODERATE: Consider more data or different model."

class LearningCurve:
    """Plot learning curves to diagnose fitting"""

    @staticmethod
    def compute(model, X: np.ndarray, y: np.ndarray,
               train_sizes: List[float] = None) -> dict:
        """
        Compute train and test scores for different training set sizes.
        """
        if train_sizes is None:
            train_sizes = [0.1, 0.25, 0.5, 0.75, 1.0]

        n_samples = len(X)
        results = {'train_sizes': [], 'train_scores': [], 'test_scores': []}

        # Split into train and test
        split_idx = int(n_samples * 0.8)
        X_train_full, X_test = X[:split_idx], X[split_idx:]
        y_train_full, y_test = y[:split_idx], y[split_idx:]

        for size in train_sizes:
            n_train = int(len(X_train_full) * size)
            X_train = X_train_full[:n_train]
            y_train = y_train_full[:n_train]

            model.fit(X_train, y_train)

            train_score = model.score(X_train, y_train)
            test_score = model.score(X_test, y_test)

            results['train_sizes'].append(n_train)
            results['train_scores'].append(train_score)
            results['test_scores'].append(test_score)

        return results

"""
Learning curve interpretation:

UNDERFITTING:
- Train and test scores both low
- Scores converge to similar (low) value
- Solution: More complex model, more features

OVERFITTING:
- Train score high, test score low
- Large gap that doesn't close with more data
- Solution: Regularization, less complex model, more data

GOOD FIT:
- Both scores high
- Small gap between them
- Gap closes as data increases
"""

Solutions

"""
UNDERFITTING SOLUTIONS:
1. Use more complex model
2. Add more features
3. Reduce regularization
4. Train longer (for iterative methods)

OVERFITTING SOLUTIONS:
1. Get more training data
2. Reduce model complexity
3. Add regularization (L1, L2)
4. Dropout (for neural networks)
5. Early stopping
6. Data augmentation
7. Feature selection
8. Ensemble methods
"""

class EarlyStopping:
    """Stop training when validation loss stops improving"""

    def __init__(self, patience: int = 5, min_delta: float = 0.001):
        self.patience = patience
        self.min_delta = min_delta
        self.best_loss = float('inf')
        self.counter = 0
        self.best_weights = None

    def __call__(self, val_loss: float, model) -> bool:
        """
        Returns True if training should stop.
        """
        if val_loss < self.best_loss - self.min_delta:
            self.best_loss = val_loss
            self.counter = 0
            self.best_weights = model.get_weights()  # Save best
            return False
        else:
            self.counter += 1
            if self.counter >= self.patience:
                model.set_weights(self.best_weights)  # Restore best
                return True
            return False

Exercises

Basic

  1. Implement accuracy, precision, and recall from scratch.

  2. Split a dataset into 80% training and 20% test, ensuring class balance.

  3. Explain why you should never evaluate on training data.

Intermediate

  1. Implement 5-fold cross-validation and compute mean and std of scores.

  2. Create learning curves for a model and diagnose the fit.

  3. Compare L1 and L2 regularization on a dataset with correlated features.

Advanced

  1. Implement stratified K-fold for multi-class classification.

  2. Design an experiment to demonstrate the bias-variance tradeoff.

  3. Build an automated model selection pipeline with cross-validation.


Summary

  • ML learns patterns from data instead of explicit rules
  • Supervised learning uses labeled data; unsupervised finds hidden structure
  • Bias-variance tradeoff: simple models underfit, complex models overfit
  • Regularization reduces overfitting by penalizing model complexity
  • Cross-validation provides reliable model evaluation
  • Learning curves help diagnose fitting issues

Next Reading

Supervised Learning →