Memory Hierarchy | Computer Science Fundamentals

Memory is organized as a hierarchy, with faster but smaller storage closer to the CPU and slower but larger storage farther away. Understanding this hierarchy is crucial for writing efficient programs.

Why memory hierarchy matters:

The CPU-memory speed gap is enormous (CPU is 100+ times faster than RAM)
Well-designed memory systems can approach the speed of the fastest memory at the cost of the cheapest
Memory access patterns dramatically affect program performance
Understanding memory helps you write cache-friendly code

The Memory Hierarchy Pyramid

              ▲ Faster, Smaller, More Expensive (per byte)
              │
         ┌────┴────┐
         │ Registers│ ← ~1 cycle, ~100 bytes
         ├─────────┤
         │ L1 Cache │ ← ~4 cycles, ~64 KB
         ├─────────┤
         │ L2 Cache │ ← ~10 cycles, ~256 KB - 1 MB
         ├─────────┤
         │ L3 Cache │ ← ~40 cycles, ~8 MB - 32 MB
         ├─────────┤
         │Main Mem. │ ← ~100-300 cycles, ~8-64 GB
         │  (RAM)   │
         ├─────────┤
         │   SSD   │ ← ~10,000+ cycles, ~256 GB - 2 TB
         ├─────────┤
         │Hard Disk│ ← ~1,000,000+ cycles, ~1-8 TB
         └─────────┘
              │
              ▼ Slower, Larger, Cheaper (per byte)

Characteristics by Level

┌───────────┬─────────┬──────────┬────────────┬──────────┐
│ Level     │ Size    │ Speed    │ Cost/GB    │ Tech     │
├───────────┼─────────┼──────────┼────────────┼──────────┤
│ Registers │ < 1 KB  │ 0.25 ns  │ N/A        │ SRAM     │
│ L1 Cache  │ 32-64KB │ 1 ns     │ Very High  │ SRAM     │
│ L2 Cache  │ 256KB-  │ 3-10 ns  │ Very High  │ SRAM     │
│           │ 1MB     │          │            │          │
│ L3 Cache  │ 4-32MB  │ 10-20 ns │ High       │ SRAM     │
│ Main Mem. │ 4-64GB  │ 50-100ns │ ~$5        │ DRAM     │
│ SSD       │ 256GB+  │ 10-100μs │ ~$0.10     │ Flash    │
│ HDD       │ 1TB+    │ 5-10 ms  │ ~$0.02     │ Magnetic │
└───────────┴─────────┴──────────┴────────────┴──────────┘

Key Principles

1. Principle of Locality

Programs tend to access a small portion of their address space at any time
Well-written programs exhibit good locality

2. Principle of Inclusion

Data in higher levels is typically a subset of data in lower levels
L1 ⊂ L2 ⊂ L3 ⊂ RAM ⊂ Disk

3. Caching Principle

Keep frequently accessed data in faster, smaller memory
Brings performance closer to the fast memory at the cost of the slow memory

Types of Locality

Temporal Locality

Principle: If data is accessed, it's likely to be accessed again soon.

Example:

int sum = 0;
for (int i = 0; i < n; i++) {
    sum += array[i];  // 'sum' accessed repeatedly
}

Variable sum is accessed in every iteration → high temporal locality.

Exploitation: Keep recently accessed data in cache.

Spatial Locality

Principle: If data is accessed, nearby data is likely to be accessed soon.

Example:

int array[1000];
for (int i = 0; i < 1000; i++) {
    array[i] = 0;  // Sequential access: array[0], array[1], array[2]...
}

Accessing array[i] suggests array[i+1] will be accessed next.

Exploitation: Fetch data in blocks (cache lines).

Examples of Good vs Poor Locality

Good Spatial Locality (Row-major):

// C stores arrays in row-major order
for (int i = 0; i < N; i++) {
    for (int j = 0; j < M; j++) {
        array[i][j] = 0;  // Sequential memory access
    }
}

Poor Spatial Locality (Column-major in C):

for (int j = 0; j < M; j++) {
    for (int i = 0; i < N; i++) {
        array[i][j] = 0;  // Strided access, poor cache performance
    }
}

Cache Memory

Cache is small, fast memory between the CPU and main memory.

Cache Terminology

┌──────────────────────────────────────────┐
│            Cache Line / Block            │
│  ┌────┬──────────────────────┬─────────┐│
│  │Tag │    Data Block        │  Valid  ││
│  │    │  (e.g., 64 bytes)    │   Bit   ││
│  └────┴──────────────────────┴─────────┘│
└──────────────────────────────────────────┘

Block/Line: Unit of transfer between cache and memory (typically 64 bytes)

Tag: Identifies which memory block is in this cache line

Valid Bit: Indicates if the cache line contains valid data

Dirty Bit: Indicates if data has been modified (write-back caches)

Cache Operations

Cache Hit: Data found in cache

CPU requests address 0x1000
→ Check cache
→ Found! (HIT)
→ Return data in ~1-10 cycles

Cache Miss: Data not in cache

CPU requests address 0x2000
→ Check cache
→ Not found (MISS)
→ Fetch from main memory (~100+ cycles)
→ Store in cache
→ Return data

Types of Misses:

Compulsory Miss (Cold Miss): First access to data
Capacity Miss: Cache is too small to hold all needed data
Conflict Miss: Multiple addresses map to same cache location

Cache Performance Metrics

Hit Rate: Percentage of accesses found in cache

Hit Rate = Hits / (Hits + Misses)

Miss Rate: Percentage of accesses not found in cache

Miss Rate = 1 - Hit Rate = Misses / (Hits + Misses)

Average Memory Access Time (AMAT):

AMAT = Hit Time + (Miss Rate × Miss Penalty)

Example:
Hit Time = 1 cycle
Miss Penalty = 100 cycles
Hit Rate = 95%

AMAT = 1 + (0.05 × 100) = 1 + 5 = 6 cycles

Cache Organization

1. Direct-Mapped Cache

Each memory address maps to exactly one cache line.

Mapping:

Cache Line Index = (Address / Block Size) % Number of Cache Lines

Address Breakdown:

┌────────┬────────────┬──────────────┐
│  Tag   │   Index    │ Block Offset │
└────────┴────────────┴──────────────┘
│        │            │
│        │            └─ Byte within block
│        └─────────────── Cache line number
└──────────────────────── Identifies which memory block

Example: 16-line cache, 4-byte blocks

Address = 0x4C (binary: 0000 0000 0100 1100)

Block size = 4 bytes = 2² → 2-bit offset
Cache lines = 16 = 2⁴ → 4-bit index
Tag = remaining bits

   Tag    Index  Offset
┌────────┬──────┬────┐
│00000001│ 0011 │ 00 │
└────────┴──────┴────┘
    1        3     0

Maps to cache line 3

Direct-Mapped Cache Structure:

Index  Valid  Tag      Data
───────────────────────────────
  0     1    0x02    [data...]
  1     0    ----    [data...]
  2     1    0x15    [data...]
  3     1    0x01    [data...]  ← Address 0x4C maps here
  4     0    ----    [data...]
  ...
  15    1    0x0A    [data...]

Advantages:

Simple and fast
Low hardware cost

Disadvantages:

High conflict miss rate
Poor utilization if multiple addresses map to same line

2. Fully Associative Cache

Any memory block can go in any cache line.

Operation:

Search all cache lines in parallel
Compare tag with all tags simultaneously

Structure:

Line  Valid  Tag      Data
────────────────────────────────
  0    1    0x1234   [data...]
  1    1    0x5678   [data...]  ← Any address can be here
  2    0    ------   [data...]
  3    1    0xABCD   [data...]
  ...

Address Breakdown:

┌──────────────────┬──────────────┐
│       Tag        │ Block Offset │
└──────────────────┴──────────────┘
        │                 │
        │                 └─ Byte within block
        └─────────────────── Identifies memory block

Advantages:

No conflict misses
Best cache utilization

Disadvantages:

Expensive (needs comparators for all lines)
Slower (parallel search)
Needs replacement policy

3. Set-Associative Cache

Compromise between direct-mapped and fully associative.

N-way set associative: Each address maps to a set of N lines.

Example: 2-way set-associative

Set   Way 0              Way 1
    ┌──────────────┐  ┌──────────────┐
 0  │V│Tag │ Data │  │V│Tag │ Data │
    ├──┼────┼──────┤  ├──┼────┼──────┤
 1  │V│Tag │ Data │  │V│Tag │ Data │
    ├──┼────┼──────┤  ├──┼────┼──────┤
 2  │V│Tag │ Data │  │V│Tag │ Data │
    ├──┼────┼──────┤  ├──┼────┼──────┤
 3  │V│Tag │ Data │  │V│Tag │ Data │
    └──┴────┴──────┘  └──┴────┴──────┘

Address Breakdown:

┌────────┬────────────┬──────────────┐
│  Tag   │ Set Index  │ Block Offset │
└────────┴────────────┴──────────────┘

Operation:

Use set index to select set
Compare tag with all ways in the set (parallel)
If match, return data from matching way

Advantages:

Fewer conflict misses than direct-mapped
Less expensive than fully associative
Good balance of performance and cost

Common Configurations:

2-way, 4-way, 8-way, 16-way set-associative

Cache Replacement Policies

When cache is full, which line should be evicted?

1. LRU (Least Recently Used)

Evict the line that hasn't been used for the longest time
Good performance, exploits temporal locality
Requires tracking access history

2. FIFO (First In, First Out)

Evict the oldest line
Simple to implement
Doesn't consider access patterns

3. Random

Evict a random line
Very simple, surprisingly effective
No bookkeeping required

4. LFU (Least Frequently Used)

Evict the line used least often
Requires usage counters
Can be ineffective if access patterns change

Write Policies

What happens when CPU writes to cache?

Write-Through

Strategy: Write to both cache and memory simultaneously.

CPU writes to address 0x1000
  ↓
Write to cache (fast)
  ↓
Write to memory (slow) ← Performance bottleneck

Advantages:

Memory always up-to-date
Simpler consistency

Disadvantages:

Every write is slow
High memory traffic

Optimization: Write buffer (queue writes to memory)

Write-Back

Strategy: Write only to cache; write to memory when line is evicted.

CPU writes to address 0x1000
  ↓
Write to cache only (fast)
  ↓
Set dirty bit
  ↓
(Later, when evicted)
  ↓
Write to memory if dirty

Advantages:

Faster writes
Reduced memory traffic
Multiple writes to same location only written to memory once

Disadvantages:

Memory may be stale
More complex
Needs dirty bit

Write Miss Policies

What if write misses in cache?

Write-Allocate (Fetch on Write):

Fetch block from memory into cache
Then write to cache
Common with write-back

No-Write-Allocate (Write Around):

Write directly to memory
Don't fetch into cache
Common with write-through

Multi-Level Caches

Modern CPUs have multiple cache levels.

┌───────────────────────────────────┐
│            CPU Core               │
├───────────────────────────────────┤
│  L1 Instruction    L1 Data        │
│  Cache (32KB)      Cache (32KB)   │ ← 3-4 cycles
├───────────────────────────────────┤
│       L2 Unified Cache            │
│           (256KB)                 │ ← 10-12 cycles
├───────────────────────────────────┤
│       L3 Shared Cache             │
│         (8-32MB)                  │ ← 30-40 cycles
└────────────┬──────────────────────┘
             │
             ▼
        Main Memory (RAM)             ← 100+ cycles
           (8-64GB)

Inclusion Policies

Inclusive Cache:

Data in L1 is also in L2, L3
Easier coherency
Some duplication

Exclusive Cache:

Data in L1 is NOT in L2
Better total capacity
More complex

Non-Inclusive:

No guarantee either way
Most flexible

Multi-Level AMAT

AMAT = L1_Hit_Time
       + L1_Miss_Rate × (L2_Hit_Time + L2_Miss_Rate × Memory_Time)

Example:
L1: 1 cycle, 95% hit rate
L2: 10 cycles, 80% hit rate (of L1 misses)
Memory: 100 cycles

AMAT = 1 + 0.05 × (10 + 0.20 × 100)
     = 1 + 0.05 × (10 + 20)
     = 1 + 0.05 × 30
     = 1 + 1.5
     = 2.5 cycles (average)

Virtual Memory

Virtual memory provides each process with its own address space.

Motivation

Problems with physical addresses:

Programs must know exact memory locations
Can't move programs in memory
No protection between processes
Limited by physical memory size

Virtual memory solves:

Process isolation
Memory protection
Larger address space than physical RAM
Simplified programming

Address Translation

Virtual Address → [Translation] → Physical Address

Virtual Address: Used by programs Physical Address: Actual location in RAM

Example:

Process A sees: 0x0000 - 0xFFFF (virtual)
    ↓
Maps to: 0x10000 - 0x1FFFF (physical)

Process B sees: 0x0000 - 0xFFFF (virtual)
    ↓
Maps to: 0x20000 - 0x2FFFF (physical)

Both processes use same virtual addresses,
but access different physical memory!

Paging

Memory is divided into fixed-size pages.

Page size: Typically 4 KB (4096 bytes)

Virtual Memory:           Physical Memory:
┌──────────┐             ┌──────────┐
│  Page 0  │             │ Frame 5  │
├──────────┤             ├──────────┤
│  Page 1  │───────┐     │ Frame 2  │
├──────────┤       │     ├──────────┤
│  Page 2  │       └────►│ Frame 7  │
├──────────┤             ├──────────┤
│  Page 3  │             │ Frame 1  │
└──────────┘             └──────────┘

Virtual pages map to physical frames

Page Table

Page Table: Data structure mapping virtual pages to physical frames.

Virtual Page # → Page Table → Physical Frame #

Page Table Entry (PTE):

┌──────┬─────┬─────┬──────┬──────┬──────────────┐
│Valid │ Ref │Dirty│ R/W  │ U/S  │ Frame Number │
└──────┴─────┴─────┴──────┴──────┴──────────────┘
  │      │     │      │      │            │
  │      │     │      │      │            └─ Physical frame
  │      │     │      │      └─ User/Supervisor
  │      │     │      └─ Read/Write permission
  │      │     └─ Modified (dirty bit)
  │      └─ Referenced recently
  └─ Page present in memory

Address Translation Example:

Virtual Address: 32-bit, 4KB pages

┌────────────────────┬──────────────┐
│  Page Number (20)  │ Offset (12)  │
└────────────────────┴──────────────┘
         │                  │
         ▼                  │
    Page Table              │
         │                  │
         ▼                  │
   Frame Number             │
         │                  │
         ▼                  ▼
┌────────────────────┬──────────────┐
│  Frame Number (20) │ Offset (12)  │
└────────────────────┴──────────────┘
     Physical Address

Multi-Level Page Tables

Problem: Single-level page table is huge!

32-bit address, 4KB pages: 2²⁰ entries × 4 bytes = 4MB per process
64-bit address: Impossibly large!

Solution: Multi-level page tables (typically 2-4 levels)

┌──────────┬──────────┬──────────┬─────────┐
│  Level 1 │  Level 2 │  Level 3 │ Offset  │
│  (10 bit)│  (10 bit)│  (10 bit)│ (12 bit)│
└──────────┴──────────┴──────────┴─────────┘

      │         │          │
      ▼         │          │
  ┌───────┐     │          │
  │  PD   │─────┘          │
  └───────┘     ▼          │
             ┌───────┐     │
             │  PT   │─────┘
             └───────┘     ▼
                        ┌───────┐
                        │ Frame │
                        └───────┘

Benefits:

Only allocate page tables for used memory
Much smaller memory footprint
Still provides full address space

Translation Lookaside Buffer (TLB)

Problem: Page table in memory → Every memory access requires 2+ memory accesses!

Solution: TLB caches recent page table entries.

Virtual Address
      │
      ▼
 ┌──────────┐
 │   TLB    │  ← Cache of page table entries
 │  (fast)  │
 └────┬─────┘
      │
   Hit?├─ Yes → Physical Address (fast!)
      │
      No
      ▼
 ┌──────────┐
 │Page Table│  ← Access page table in memory
 │ (slow)   │
 └────┬─────┘
      │
      ▼
 Physical Address
 Update TLB

TLB Characteristics:

Small (64-512 entries)
Fully associative
Very high hit rate (>99%)

TLB Miss Handling:

1. Hardware walks page table (x86)
   OR
2. Software trap to OS (MIPS, some ARM)

Page Faults

Page Fault: Requested page not in physical memory.

Handling:

1. CPU accesses virtual address
2. Page table shows "not present"
3. Hardware raises page fault exception
4. OS page fault handler:
   a. Find page on disk
   b. Find free frame (evict if necessary)
   c. Read page from disk to frame
   d. Update page table
5. Resume instruction

Demand Paging: Pages loaded only when accessed (lazy loading).

Page Replacement Algorithms

When memory is full, which page to evict?

1. Optimal (OPT)

Replace page that won't be used for longest time
Impossible to implement (requires future knowledge)
Theoretical best case for comparison

2. FIFO (First In, First Out)

Replace oldest page
Simple, but suffers from Belady's anomaly

3. LRU (Least Recently Used)

Replace page not used for longest time
Good performance
Expensive to implement perfectly

4. Clock (Second Chance)

Approximation of LRU
Uses reference bit
Practical and effective

5. Working Set

Keep pages in process's working set
Based on temporal locality

Memory Performance Optimization

Cache-Friendly Code

1. Maximize Spatial Locality

// Bad: Column-major access in C (row-major language)
for (int j = 0; j < N; j++) {
    for (int i = 0; i < N; i++) {
        sum += matrix[i][j];  // Stride of N, poor cache use
    }
}

// Good: Row-major access
for (int i = 0; i < N; i++) {
    for (int j = 0; j < N; j++) {
        sum += matrix[i][j];  // Sequential, excellent cache use
    }
}

2. Maximize Temporal Locality

// Bad: Large array, access each element once
for (int i = 0; i < HUGE; i++) {
    process(array[i]);
}

// Good: Block the data to fit in cache
for (int block = 0; block < HUGE; block += BLOCK_SIZE) {
    for (int i = block; i < block + BLOCK_SIZE; i++) {
        process(array[i]);
        process_again(array[i]);  // Reuse while in cache
    }
}

3. Loop Tiling/Blocking

Matrix multiplication example:

// Standard (poor cache behavior for large matrices)
for (int i = 0; i < N; i++) {
    for (int j = 0; j < N; j++) {
        for (int k = 0; k < N; k++) {
            C[i][j] += A[i][k] * B[k][j];
        }
    }
}

// Blocked (better cache behavior)
for (int ii = 0; ii < N; ii += BLOCK) {
    for (int jj = 0; jj < N; jj += BLOCK) {
        for (int kk = 0; kk < N; kk += BLOCK) {
            // Process block
            for (int i = ii; i < ii + BLOCK; i++) {
                for (int j = jj; j < jj + BLOCK; j++) {
                    for (int k = kk; k < kk + BLOCK; k++) {
                        C[i][j] += A[i][k] * B[k][j];
                    }
                }
            }
        }
    }
}

Programming Examples

Python: Cache Performance Simulation

import random

class DirectMappedCache:
    def __init__(self, num_lines, block_size):
        self.num_lines = num_lines
        self.block_size = block_size
        self.cache = [None] * num_lines
        self.tags = [None] * num_lines
        self.valid = [False] * num_lines
        self.hits = 0
        self.misses = 0

    def access(self, address):
        """Access cache at given address"""
        # Calculate index and tag
        block_address = address // self.block_size
        index = block_address % self.num_lines
        tag = block_address // self.num_lines

        # Check for hit
        if self.valid[index] and self.tags[index] == tag:
            self.hits += 1
            return True  # Hit
        else:
            # Miss: load into cache
            self.misses += 1
            self.tags[index] = tag
            self.valid[index] = True
            return False  # Miss

    def get_stats(self):
        """Return cache statistics"""
        total = self.hits + self.misses
        hit_rate = self.hits / total if total > 0 else 0
        return {
            'hits': self.hits,
            'misses': self.misses,
            'total': total,
            'hit_rate': hit_rate
        }

# Simulate different access patterns
cache = DirectMappedCache(num_lines=16, block_size=64)

print("=== Sequential Access (Good Locality) ===")
for addr in range(0, 1000, 4):  # Access every 4 bytes
    cache.access(addr)
stats = cache.get_stats()
print(f"Hit rate: {stats['hit_rate']:.2%}")

# Reset
cache = DirectMappedCache(num_lines=16, block_size=64)

print("\n=== Random Access (Poor Locality) ===")
for _ in range(250):
    addr = random.randint(0, 10000)
    cache.access(addr)
stats = cache.get_stats()
print(f"Hit rate: {stats['hit_rate']:.2%}")

# Reset
cache = DirectMappedCache(num_lines=16, block_size=64)

print("\n=== Strided Access (Moderate Locality) ===")
for i in range(100):
    addr = i * 256  # Large stride
    cache.access(addr)
stats = cache.get_stats()
print(f"Hit rate: {stats['hit_rate']:.2%}")

Python: AMAT Calculator

def calculate_amat(l1_hit_time, l1_miss_rate,
                   l2_hit_time, l2_miss_rate,
                   memory_time):
    """Calculate Average Memory Access Time for 2-level cache"""

    # AMAT = L1_time + L1_miss_rate × (L2_time + L2_miss_rate × Mem_time)
    l2_access_time = l2_hit_time + l2_miss_rate * memory_time
    amat = l1_hit_time + l1_miss_rate * l2_access_time

    return amat

# Example system
l1_hit = 1        # 1 cycle
l1_miss = 0.05    # 5% miss rate
l2_hit = 10       # 10 cycles
l2_miss = 0.20    # 20% of L1 misses also miss L2
mem = 100         # 100 cycles

amat = calculate_amat(l1_hit, l1_miss, l2_hit, l2_miss, mem)
print(f"Average Memory Access Time: {amat:.2f} cycles")

# Compare configurations
print("\n=== Comparing Cache Configurations ===")
configs = [
    ("Small L1, No L2", 1, 0.10, 0, 0, 100),
    ("Large L1, No L2", 2, 0.05, 0, 0, 100),
    ("Small L1, L2", 1, 0.10, 10, 0.30, 100),
    ("Large L1, L2", 2, 0.05, 10, 0.20, 100),
]

for name, *params in configs:
    if params[2] == 0:  # No L2
        amat = params[0] + params[1] * params[4]
    else:
        amat = calculate_amat(*params)
    print(f"{name:20s}: {amat:6.2f} cycles")

Exercises

Basic Exercises

Locality Analysis
- Identify temporal and spatial locality in: for(i=0;i<n;i++) sum += arr[i];
- Which has better spatial locality: row-major or column-major traversal?
- Give an example of code with poor locality
Cache Mapping
- 4KB direct-mapped cache, 64-byte blocks. Where does address 0x4C8 map?
- How many bits for tag, index, offset? (Assume 32-bit addresses)
- What addresses conflict with 0x4C8?
Hit Rate Calculation
- 100 accesses: 92 hits, 8 misses. Calculate hit rate and miss rate.
- Calculate AMAT: hit time=2ns, miss penalty=100ns, hit rate=95%
- If miss rate decreases from 10% to 5%, how much does AMAT improve?

Intermediate Exercises

Cache Design
- Design a 4-way set-associative cache: 16KB, 32-byte blocks
- How many sets? How many bits for tag/index/offset?
- Compare with direct-mapped cache of same size
Virtual Memory
- 32-bit virtual address, 4KB pages. How many page table entries?
- Calculate overhead of single-level vs 2-level page table
- Trace translation for virtual address 0x12345678 (show page number, offset)
TLB Performance
- TLB hit time=1ns, miss penalty=20ns, hit rate=98%. Calculate AMAT for address translation.
- If TLB is enlarged to improve hit rate to 99.5%, what's the speedup?
- Why is TLB hit rate typically higher than cache hit rate?

Advanced Exercises

Cache Optimization
- Optimize matrix multiplication for cache (assume 8KB L1, 64-byte lines)
- Calculate ideal block size for tiling
- Estimate speedup from optimization
Memory Hierarchy Analysis
- System: L1(1 cycle, 5% miss), L2(10 cycles, 20% miss), RAM(100 cycles)
- Calculate overall AMAT
- Program runs 1B instructions, 30% are memory accesses. Calculate total memory time.
- If L2 is doubled in size (miss rate → 10%), what's the speedup?
Page Replacement
- Simulate LRU for access sequence: 1,2,3,4,1,2,5,1,2,3,4,5 (3 frames)
- Calculate page fault rate
- Compare with FIFO for same sequence
System Design
- Design a cache hierarchy for: 3GHz CPU, 100ns RAM, 4GB memory
- Choose cache sizes, associativity, block sizes
- Justify your design with calculations
- Estimate performance for typical workload

Summary

In this reading, we explored the memory hierarchy and caching:

Key Concepts:

Memory hierarchy balances speed, size, and cost
Locality (temporal and spatial) makes caching effective
Cache organization: direct-mapped, set-associative, fully associative
Virtual memory provides isolation, protection, and larger address spaces
TLB accelerates address translation
Page replacement algorithms manage limited physical memory

Important Skills:

Calculating cache performance (hit rate, AMAT)
Understanding cache mapping and address breakdown
Tracing virtual-to-physical address translation
Writing cache-friendly code

Why This Matters:

Memory access is the dominant performance bottleneck in modern systems
Understanding memory helps you write 10-100× faster code
Essential for systems programming, performance tuning, and architecture design
Foundation for understanding multithreading, coherency, and distributed systems

Next Steps

Now that you understand how memory works, you're ready to learn assembly language and see how high-level code translates to machine operations.

Next Reading: 05-assembly.md - Assembly Language Basics

Module 5: Computer Architecture | Reading 4 of 5

The Memory Hierarchy Pyramid

Characteristics by Level

Key Principles

Types of Locality

Temporal Locality

Spatial Locality

Examples of Good vs Poor Locality

Cache Memory

Cache Terminology

Cache Operations

Cache Performance Metrics

Cache Organization

1. Direct-Mapped Cache

2. Fully Associative Cache

3. Set-Associative Cache

Cache Replacement Policies

Write Policies

Write-Through

Write-Back

Write Miss Policies

Multi-Level Caches

Inclusion Policies

Multi-Level AMAT

Virtual Memory

Motivation

Address Translation

Paging

Page Table

Multi-Level Page Tables

Translation Lookaside Buffer (TLB)

Page Faults

Page Replacement Algorithms

Memory Performance Optimization

Cache-Friendly Code

Programming Examples

Python: Cache Performance Simulation

Python: AMAT Calculator

Exercises

Basic Exercises

Intermediate Exercises

Advanced Exercises

Summary

Further Reading

Next Steps