Memory Hierarchy

Learning Objectives

By the end of this reading, you will be able to:

  • Understand the memory hierarchy and the trade-offs between speed, size, and cost
  • Explain how cache memory improves system performance
  • Understand cache organization: direct-mapped, set-associative, and fully associative
  • Apply principles of temporal and spatial locality
  • Understand virtual memory and address translation
  • Explain page tables, TLB, and memory management
  • Calculate cache hit rates and effective access times
  • Recognize memory optimization strategies in programming

Introduction

Memory is organized as a hierarchy, with faster but smaller storage closer to the CPU and slower but larger storage farther away. Understanding this hierarchy is crucial for writing efficient programs.

Why memory hierarchy matters:

  • The CPU-memory speed gap is enormous (CPU is 100+ times faster than RAM)
  • Well-designed memory systems can approach the speed of the fastest memory at the cost of the cheapest
  • Memory access patterns dramatically affect program performance
  • Understanding memory helps you write cache-friendly code

The Memory Hierarchy Pyramid

              ▲ Faster, Smaller, More Expensive (per byte)
              │
         ┌────┴────┐
         │ Registers│ ← ~1 cycle, ~100 bytes
         ├─────────┤
         │ L1 Cache │ ← ~4 cycles, ~64 KB
         ├─────────┤
         │ L2 Cache │ ← ~10 cycles, ~256 KB - 1 MB
         ├─────────┤
         │ L3 Cache │ ← ~40 cycles, ~8 MB - 32 MB
         ├─────────┤
         │Main Mem. │ ← ~100-300 cycles, ~8-64 GB
         │  (RAM)   │
         ├─────────┤
         │   SSD   │ ← ~10,000+ cycles, ~256 GB - 2 TB
         ├─────────┤
         │Hard Disk│ ← ~1,000,000+ cycles, ~1-8 TB
         └─────────┘
              │
              ▼ Slower, Larger, Cheaper (per byte)

Characteristics by Level

┌───────────┬─────────┬──────────┬────────────┬──────────┐
│ Level     │ Size    │ Speed    │ Cost/GB    │ Tech     │
├───────────┼─────────┼──────────┼────────────┼──────────┤
│ Registers │ < 1 KB  │ 0.25 ns  │ N/A        │ SRAM     │
│ L1 Cache  │ 32-64KB │ 1 ns     │ Very High  │ SRAM     │
│ L2 Cache  │ 256KB-  │ 3-10 ns  │ Very High  │ SRAM     │
│           │ 1MB     │          │            │          │
│ L3 Cache  │ 4-32MB  │ 10-20 ns │ High       │ SRAM     │
│ Main Mem. │ 4-64GB  │ 50-100ns │ ~$5        │ DRAM     │
│ SSD       │ 256GB+  │ 10-100μs │ ~$0.10     │ Flash    │
│ HDD       │ 1TB+    │ 5-10 ms  │ ~$0.02     │ Magnetic │
└───────────┴─────────┴──────────┴────────────┴──────────┘

Key Principles

1. Principle of Locality

  • Programs tend to access a small portion of their address space at any time
  • Well-written programs exhibit good locality

2. Principle of Inclusion

  • Data in higher levels is typically a subset of data in lower levels
  • L1 ⊂ L2 ⊂ L3 ⊂ RAM ⊂ Disk

3. Caching Principle

  • Keep frequently accessed data in faster, smaller memory
  • Brings performance closer to the fast memory at the cost of the slow memory

Types of Locality

Temporal Locality

Principle: If data is accessed, it's likely to be accessed again soon.

Example:

int sum = 0;
for (int i = 0; i < n; i++) {
    sum += array[i];  // 'sum' accessed repeatedly
}

Variable sum is accessed in every iteration → high temporal locality.

Exploitation: Keep recently accessed data in cache.

Spatial Locality

Principle: If data is accessed, nearby data is likely to be accessed soon.

Example:

int array[1000];
for (int i = 0; i < 1000; i++) {
    array[i] = 0;  // Sequential access: array[0], array[1], array[2]...
}

Accessing array[i] suggests array[i+1] will be accessed next.

Exploitation: Fetch data in blocks (cache lines).

Examples of Good vs Poor Locality

Good Spatial Locality (Row-major):

// C stores arrays in row-major order
for (int i = 0; i < N; i++) {
    for (int j = 0; j < M; j++) {
        array[i][j] = 0;  // Sequential memory access
    }
}

Poor Spatial Locality (Column-major in C):

for (int j = 0; j < M; j++) {
    for (int i = 0; i < N; i++) {
        array[i][j] = 0;  // Strided access, poor cache performance
    }
}

Cache Memory

Cache is small, fast memory between the CPU and main memory.

Cache Terminology

┌──────────────────────────────────────────┐
│            Cache Line / Block            │
│  ┌────┬──────────────────────┬─────────┐│
│  │Tag │    Data Block        │  Valid  ││
│  │    │  (e.g., 64 bytes)    │   Bit   ││
│  └────┴──────────────────────┴─────────┘│
└──────────────────────────────────────────┘

Block/Line: Unit of transfer between cache and memory (typically 64 bytes)

Tag: Identifies which memory block is in this cache line

Valid Bit: Indicates if the cache line contains valid data

Dirty Bit: Indicates if data has been modified (write-back caches)

Cache Operations

Cache Hit: Data found in cache

CPU requests address 0x1000
→ Check cache
→ Found! (HIT)
→ Return data in ~1-10 cycles

Cache Miss: Data not in cache

CPU requests address 0x2000
→ Check cache
→ Not found (MISS)
→ Fetch from main memory (~100+ cycles)
→ Store in cache
→ Return data

Types of Misses:

  1. Compulsory Miss (Cold Miss): First access to data
  2. Capacity Miss: Cache is too small to hold all needed data
  3. Conflict Miss: Multiple addresses map to same cache location

Cache Performance Metrics

Hit Rate: Percentage of accesses found in cache

Hit Rate = Hits / (Hits + Misses)

Miss Rate: Percentage of accesses not found in cache

Miss Rate = 1 - Hit Rate = Misses / (Hits + Misses)

Average Memory Access Time (AMAT):

AMAT = Hit Time + (Miss Rate × Miss Penalty)

Example:
Hit Time = 1 cycle
Miss Penalty = 100 cycles
Hit Rate = 95%

AMAT = 1 + (0.05 × 100) = 1 + 5 = 6 cycles

Cache Organization

1. Direct-Mapped Cache

Each memory address maps to exactly one cache line.

Mapping:

Cache Line Index = (Address / Block Size) % Number of Cache Lines

Address Breakdown:

┌────────┬────────────┬──────────────┐
│  Tag   │   Index    │ Block Offset │
└────────┴────────────┴──────────────┘
│        │            │
│        │            └─ Byte within block
│        └─────────────── Cache line number
└──────────────────────── Identifies which memory block

Example: 16-line cache, 4-byte blocks

Address = 0x4C (binary: 0000 0000 0100 1100)

Block size = 4 bytes = 2² → 2-bit offset
Cache lines = 16 = 2⁴ → 4-bit index
Tag = remaining bits

   Tag    Index  Offset
┌────────┬──────┬────┐
│00000001│ 0011 │ 00 │
└────────┴──────┴────┘
    1        3     0

Maps to cache line 3

Direct-Mapped Cache Structure:

Index  Valid  Tag      Data
───────────────────────────────
  0     1    0x02    [data...]
  1     0    ----    [data...]
  2     1    0x15    [data...]
  3     1    0x01    [data...]  ← Address 0x4C maps here
  4     0    ----    [data...]
  ...
  15    1    0x0A    [data...]

Advantages:

  • Simple and fast
  • Low hardware cost

Disadvantages:

  • High conflict miss rate
  • Poor utilization if multiple addresses map to same line

2. Fully Associative Cache

Any memory block can go in any cache line.

Operation:

  • Search all cache lines in parallel
  • Compare tag with all tags simultaneously

Structure:

Line  Valid  Tag      Data
────────────────────────────────
  0    1    0x1234   [data...]
  1    1    0x5678   [data...]  ← Any address can be here
  2    0    ------   [data...]
  3    1    0xABCD   [data...]
  ...

Address Breakdown:

┌──────────────────┬──────────────┐
│       Tag        │ Block Offset │
└──────────────────┴──────────────┘
        │                 │
        │                 └─ Byte within block
        └─────────────────── Identifies memory block

Advantages:

  • No conflict misses
  • Best cache utilization

Disadvantages:

  • Expensive (needs comparators for all lines)
  • Slower (parallel search)
  • Needs replacement policy

3. Set-Associative Cache

Compromise between direct-mapped and fully associative.

N-way set associative: Each address maps to a set of N lines.

Example: 2-way set-associative

Set   Way 0              Way 1
    ┌──────────────┐  ┌──────────────┐
 0  │V│Tag │ Data │  │V│Tag │ Data │
    ├──┼────┼──────┤  ├──┼────┼──────┤
 1  │V│Tag │ Data │  │V│Tag │ Data │
    ├──┼────┼──────┤  ├──┼────┼──────┤
 2  │V│Tag │ Data │  │V│Tag │ Data │
    ├──┼────┼──────┤  ├──┼────┼──────┤
 3  │V│Tag │ Data │  │V│Tag │ Data │
    └──┴────┴──────┘  └──┴────┴──────┘

Address Breakdown:

┌────────┬────────────┬──────────────┐
│  Tag   │ Set Index  │ Block Offset │
└────────┴────────────┴──────────────┘

Operation:

  1. Use set index to select set
  2. Compare tag with all ways in the set (parallel)
  3. If match, return data from matching way

Advantages:

  • Fewer conflict misses than direct-mapped
  • Less expensive than fully associative
  • Good balance of performance and cost

Common Configurations:

  • 2-way, 4-way, 8-way, 16-way set-associative

Cache Replacement Policies

When cache is full, which line should be evicted?

1. LRU (Least Recently Used)

  • Evict the line that hasn't been used for the longest time
  • Good performance, exploits temporal locality
  • Requires tracking access history

2. FIFO (First In, First Out)

  • Evict the oldest line
  • Simple to implement
  • Doesn't consider access patterns

3. Random

  • Evict a random line
  • Very simple, surprisingly effective
  • No bookkeeping required

4. LFU (Least Frequently Used)

  • Evict the line used least often
  • Requires usage counters
  • Can be ineffective if access patterns change

Write Policies

What happens when CPU writes to cache?

Write-Through

Strategy: Write to both cache and memory simultaneously.

CPU writes to address 0x1000
  ↓
Write to cache (fast)
  ↓
Write to memory (slow) ← Performance bottleneck

Advantages:

  • Memory always up-to-date
  • Simpler consistency

Disadvantages:

  • Every write is slow
  • High memory traffic

Optimization: Write buffer (queue writes to memory)

Write-Back

Strategy: Write only to cache; write to memory when line is evicted.

CPU writes to address 0x1000
  ↓
Write to cache only (fast)
  ↓
Set dirty bit
  ↓
(Later, when evicted)
  ↓
Write to memory if dirty

Advantages:

  • Faster writes
  • Reduced memory traffic
  • Multiple writes to same location only written to memory once

Disadvantages:

  • Memory may be stale
  • More complex
  • Needs dirty bit

Write Miss Policies

What if write misses in cache?

Write-Allocate (Fetch on Write):

  • Fetch block from memory into cache
  • Then write to cache
  • Common with write-back

No-Write-Allocate (Write Around):

  • Write directly to memory
  • Don't fetch into cache
  • Common with write-through

Multi-Level Caches

Modern CPUs have multiple cache levels.

┌───────────────────────────────────┐
│            CPU Core               │
├───────────────────────────────────┤
│  L1 Instruction    L1 Data        │
│  Cache (32KB)      Cache (32KB)   │ ← 3-4 cycles
├───────────────────────────────────┤
│       L2 Unified Cache            │
│           (256KB)                 │ ← 10-12 cycles
├───────────────────────────────────┤
│       L3 Shared Cache             │
│         (8-32MB)                  │ ← 30-40 cycles
└────────────┬──────────────────────┘
             │
             ▼
        Main Memory (RAM)             ← 100+ cycles
           (8-64GB)

Inclusion Policies

Inclusive Cache:

  • Data in L1 is also in L2, L3
  • Easier coherency
  • Some duplication

Exclusive Cache:

  • Data in L1 is NOT in L2
  • Better total capacity
  • More complex

Non-Inclusive:

  • No guarantee either way
  • Most flexible

Multi-Level AMAT

AMAT = L1_Hit_Time
       + L1_Miss_Rate × (L2_Hit_Time + L2_Miss_Rate × Memory_Time)

Example:
L1: 1 cycle, 95% hit rate
L2: 10 cycles, 80% hit rate (of L1 misses)
Memory: 100 cycles

AMAT = 1 + 0.05 × (10 + 0.20 × 100)
     = 1 + 0.05 × (10 + 20)
     = 1 + 0.05 × 30
     = 1 + 1.5
     = 2.5 cycles (average)

Virtual Memory

Virtual memory provides each process with its own address space.

Motivation

Problems with physical addresses:

  • Programs must know exact memory locations
  • Can't move programs in memory
  • No protection between processes
  • Limited by physical memory size

Virtual memory solves:

  • Process isolation
  • Memory protection
  • Larger address space than physical RAM
  • Simplified programming

Address Translation

Virtual Address → [Translation] → Physical Address

Virtual Address: Used by programs Physical Address: Actual location in RAM

Example:

Process A sees: 0x0000 - 0xFFFF (virtual)
    ↓
Maps to: 0x10000 - 0x1FFFF (physical)

Process B sees: 0x0000 - 0xFFFF (virtual)
    ↓
Maps to: 0x20000 - 0x2FFFF (physical)

Both processes use same virtual addresses,
but access different physical memory!

Paging

Memory is divided into fixed-size pages.

Page size: Typically 4 KB (4096 bytes)

Virtual Memory:           Physical Memory:
┌──────────┐             ┌──────────┐
│  Page 0  │             │ Frame 5  │
├──────────┤             ├──────────┤
│  Page 1  │───────┐     │ Frame 2  │
├──────────┤       │     ├──────────┤
│  Page 2  │       └────►│ Frame 7  │
├──────────┤             ├──────────┤
│  Page 3  │             │ Frame 1  │
└──────────┘             └──────────┘

Virtual pages map to physical frames

Page Table

Page Table: Data structure mapping virtual pages to physical frames.

Virtual Page # → Page Table → Physical Frame #

Page Table Entry (PTE):

┌──────┬─────┬─────┬──────┬──────┬──────────────┐
│Valid │ Ref │Dirty│ R/W  │ U/S  │ Frame Number │
└──────┴─────┴─────┴──────┴──────┴──────────────┘
  │      │     │      │      │            │
  │      │     │      │      │            └─ Physical frame
  │      │     │      │      └─ User/Supervisor
  │      │     │      └─ Read/Write permission
  │      │     └─ Modified (dirty bit)
  │      └─ Referenced recently
  └─ Page present in memory

Address Translation Example:

Virtual Address: 32-bit, 4KB pages

┌────────────────────┬──────────────┐
│  Page Number (20)  │ Offset (12)  │
└────────────────────┴──────────────┘
         │                  │
         ▼                  │
    Page Table              │
         │                  │
         ▼                  │
   Frame Number             │
         │                  │
         ▼                  ▼
┌────────────────────┬──────────────┐
│  Frame Number (20) │ Offset (12)  │
└────────────────────┴──────────────┘
     Physical Address

Multi-Level Page Tables

Problem: Single-level page table is huge!

  • 32-bit address, 4KB pages: 2²⁰ entries × 4 bytes = 4MB per process
  • 64-bit address: Impossibly large!

Solution: Multi-level page tables (typically 2-4 levels)

┌──────────┬──────────┬──────────┬─────────┐
│  Level 1 │  Level 2 │  Level 3 │ Offset  │
│  (10 bit)│  (10 bit)│  (10 bit)│ (12 bit)│
└──────────┴──────────┴──────────┴─────────┘

      │         │          │
      ▼         │          │
  ┌───────┐     │          │
  │  PD   │─────┘          │
  └───────┘     ▼          │
             ┌───────┐     │
             │  PT   │─────┘
             └───────┘     ▼
                        ┌───────┐
                        │ Frame │
                        └───────┘

Benefits:

  • Only allocate page tables for used memory
  • Much smaller memory footprint
  • Still provides full address space

Translation Lookaside Buffer (TLB)

Problem: Page table in memory → Every memory access requires 2+ memory accesses!

Solution: TLB caches recent page table entries.

Virtual Address
      │
      ▼
 ┌──────────┐
 │   TLB    │  ← Cache of page table entries
 │  (fast)  │
 └────┬─────┘
      │
   Hit?├─ Yes → Physical Address (fast!)
      │
      No
      ▼
 ┌──────────┐
 │Page Table│  ← Access page table in memory
 │ (slow)   │
 └────┬─────┘
      │
      ▼
 Physical Address
 Update TLB

TLB Characteristics:

  • Small (64-512 entries)
  • Fully associative
  • Very high hit rate (>99%)

TLB Miss Handling:

1. Hardware walks page table (x86)
   OR
2. Software trap to OS (MIPS, some ARM)

Page Faults

Page Fault: Requested page not in physical memory.

Handling:

1. CPU accesses virtual address
2. Page table shows "not present"
3. Hardware raises page fault exception
4. OS page fault handler:
   a. Find page on disk
   b. Find free frame (evict if necessary)
   c. Read page from disk to frame
   d. Update page table
5. Resume instruction

Demand Paging: Pages loaded only when accessed (lazy loading).

Page Replacement Algorithms

When memory is full, which page to evict?

1. Optimal (OPT)

  • Replace page that won't be used for longest time
  • Impossible to implement (requires future knowledge)
  • Theoretical best case for comparison

2. FIFO (First In, First Out)

  • Replace oldest page
  • Simple, but suffers from Belady's anomaly

3. LRU (Least Recently Used)

  • Replace page not used for longest time
  • Good performance
  • Expensive to implement perfectly

4. Clock (Second Chance)

  • Approximation of LRU
  • Uses reference bit
  • Practical and effective

5. Working Set

  • Keep pages in process's working set
  • Based on temporal locality

Memory Performance Optimization

Cache-Friendly Code

1. Maximize Spatial Locality

// Bad: Column-major access in C (row-major language)
for (int j = 0; j < N; j++) {
    for (int i = 0; i < N; i++) {
        sum += matrix[i][j];  // Stride of N, poor cache use
    }
}

// Good: Row-major access
for (int i = 0; i < N; i++) {
    for (int j = 0; j < N; j++) {
        sum += matrix[i][j];  // Sequential, excellent cache use
    }
}

2. Maximize Temporal Locality

// Bad: Large array, access each element once
for (int i = 0; i < HUGE; i++) {
    process(array[i]);
}

// Good: Block the data to fit in cache
for (int block = 0; block < HUGE; block += BLOCK_SIZE) {
    for (int i = block; i < block + BLOCK_SIZE; i++) {
        process(array[i]);
        process_again(array[i]);  // Reuse while in cache
    }
}

3. Loop Tiling/Blocking

Matrix multiplication example:

// Standard (poor cache behavior for large matrices)
for (int i = 0; i < N; i++) {
    for (int j = 0; j < N; j++) {
        for (int k = 0; k < N; k++) {
            C[i][j] += A[i][k] * B[k][j];
        }
    }
}

// Blocked (better cache behavior)
for (int ii = 0; ii < N; ii += BLOCK) {
    for (int jj = 0; jj < N; jj += BLOCK) {
        for (int kk = 0; kk < N; kk += BLOCK) {
            // Process block
            for (int i = ii; i < ii + BLOCK; i++) {
                for (int j = jj; j < jj + BLOCK; j++) {
                    for (int k = kk; k < kk + BLOCK; k++) {
                        C[i][j] += A[i][k] * B[k][j];
                    }
                }
            }
        }
    }
}

Programming Examples

Python: Cache Performance Simulation

import random

class DirectMappedCache:
    def __init__(self, num_lines, block_size):
        self.num_lines = num_lines
        self.block_size = block_size
        self.cache = [None] * num_lines
        self.tags = [None] * num_lines
        self.valid = [False] * num_lines
        self.hits = 0
        self.misses = 0

    def access(self, address):
        """Access cache at given address"""
        # Calculate index and tag
        block_address = address // self.block_size
        index = block_address % self.num_lines
        tag = block_address // self.num_lines

        # Check for hit
        if self.valid[index] and self.tags[index] == tag:
            self.hits += 1
            return True  # Hit
        else:
            # Miss: load into cache
            self.misses += 1
            self.tags[index] = tag
            self.valid[index] = True
            return False  # Miss

    def get_stats(self):
        """Return cache statistics"""
        total = self.hits + self.misses
        hit_rate = self.hits / total if total > 0 else 0
        return {
            'hits': self.hits,
            'misses': self.misses,
            'total': total,
            'hit_rate': hit_rate
        }

# Simulate different access patterns
cache = DirectMappedCache(num_lines=16, block_size=64)

print("=== Sequential Access (Good Locality) ===")
for addr in range(0, 1000, 4):  # Access every 4 bytes
    cache.access(addr)
stats = cache.get_stats()
print(f"Hit rate: {stats['hit_rate']:.2%}")

# Reset
cache = DirectMappedCache(num_lines=16, block_size=64)

print("\n=== Random Access (Poor Locality) ===")
for _ in range(250):
    addr = random.randint(0, 10000)
    cache.access(addr)
stats = cache.get_stats()
print(f"Hit rate: {stats['hit_rate']:.2%}")

# Reset
cache = DirectMappedCache(num_lines=16, block_size=64)

print("\n=== Strided Access (Moderate Locality) ===")
for i in range(100):
    addr = i * 256  # Large stride
    cache.access(addr)
stats = cache.get_stats()
print(f"Hit rate: {stats['hit_rate']:.2%}")

Python: AMAT Calculator

def calculate_amat(l1_hit_time, l1_miss_rate,
                   l2_hit_time, l2_miss_rate,
                   memory_time):
    """Calculate Average Memory Access Time for 2-level cache"""

    # AMAT = L1_time + L1_miss_rate × (L2_time + L2_miss_rate × Mem_time)
    l2_access_time = l2_hit_time + l2_miss_rate * memory_time
    amat = l1_hit_time + l1_miss_rate * l2_access_time

    return amat

# Example system
l1_hit = 1        # 1 cycle
l1_miss = 0.05    # 5% miss rate
l2_hit = 10       # 10 cycles
l2_miss = 0.20    # 20% of L1 misses also miss L2
mem = 100         # 100 cycles

amat = calculate_amat(l1_hit, l1_miss, l2_hit, l2_miss, mem)
print(f"Average Memory Access Time: {amat:.2f} cycles")

# Compare configurations
print("\n=== Comparing Cache Configurations ===")
configs = [
    ("Small L1, No L2", 1, 0.10, 0, 0, 100),
    ("Large L1, No L2", 2, 0.05, 0, 0, 100),
    ("Small L1, L2", 1, 0.10, 10, 0.30, 100),
    ("Large L1, L2", 2, 0.05, 10, 0.20, 100),
]

for name, *params in configs:
    if params[2] == 0:  # No L2
        amat = params[0] + params[1] * params[4]
    else:
        amat = calculate_amat(*params)
    print(f"{name:20s}: {amat:6.2f} cycles")

Exercises

Basic Exercises

  1. Locality Analysis

    • Identify temporal and spatial locality in: for(i=0;i<n;i++) sum += arr[i];
    • Which has better spatial locality: row-major or column-major traversal?
    • Give an example of code with poor locality
  2. Cache Mapping

    • 4KB direct-mapped cache, 64-byte blocks. Where does address 0x4C8 map?
    • How many bits for tag, index, offset? (Assume 32-bit addresses)
    • What addresses conflict with 0x4C8?
  3. Hit Rate Calculation

    • 100 accesses: 92 hits, 8 misses. Calculate hit rate and miss rate.
    • Calculate AMAT: hit time=2ns, miss penalty=100ns, hit rate=95%
    • If miss rate decreases from 10% to 5%, how much does AMAT improve?

Intermediate Exercises

  1. Cache Design

    • Design a 4-way set-associative cache: 16KB, 32-byte blocks
    • How many sets? How many bits for tag/index/offset?
    • Compare with direct-mapped cache of same size
  2. Virtual Memory

    • 32-bit virtual address, 4KB pages. How many page table entries?
    • Calculate overhead of single-level vs 2-level page table
    • Trace translation for virtual address 0x12345678 (show page number, offset)
  3. TLB Performance

    • TLB hit time=1ns, miss penalty=20ns, hit rate=98%. Calculate AMAT for address translation.
    • If TLB is enlarged to improve hit rate to 99.5%, what's the speedup?
    • Why is TLB hit rate typically higher than cache hit rate?

Advanced Exercises

  1. Cache Optimization

    • Optimize matrix multiplication for cache (assume 8KB L1, 64-byte lines)
    • Calculate ideal block size for tiling
    • Estimate speedup from optimization
  2. Memory Hierarchy Analysis

    • System: L1(1 cycle, 5% miss), L2(10 cycles, 20% miss), RAM(100 cycles)
    • Calculate overall AMAT
    • Program runs 1B instructions, 30% are memory accesses. Calculate total memory time.
    • If L2 is doubled in size (miss rate → 10%), what's the speedup?
  3. Page Replacement

    • Simulate LRU for access sequence: 1,2,3,4,1,2,5,1,2,3,4,5 (3 frames)
    • Calculate page fault rate
    • Compare with FIFO for same sequence
  4. System Design

    • Design a cache hierarchy for: 3GHz CPU, 100ns RAM, 4GB memory
    • Choose cache sizes, associativity, block sizes
    • Justify your design with calculations
    • Estimate performance for typical workload

Summary

In this reading, we explored the memory hierarchy and caching:

Key Concepts:

  • Memory hierarchy balances speed, size, and cost
  • Locality (temporal and spatial) makes caching effective
  • Cache organization: direct-mapped, set-associative, fully associative
  • Virtual memory provides isolation, protection, and larger address spaces
  • TLB accelerates address translation
  • Page replacement algorithms manage limited physical memory

Important Skills:

  • Calculating cache performance (hit rate, AMAT)
  • Understanding cache mapping and address breakdown
  • Tracing virtual-to-physical address translation
  • Writing cache-friendly code

Why This Matters:

  • Memory access is the dominant performance bottleneck in modern systems
  • Understanding memory helps you write 10-100× faster code
  • Essential for systems programming, performance tuning, and architecture design
  • Foundation for understanding multithreading, coherency, and distributed systems

Further Reading

  • "Computer Architecture: A Quantitative Approach" by Hennessy & Patterson
  • Cache coherence protocols (MESI, MOESI)
  • NUMA (Non-Uniform Memory Access) architectures
  • Memory consistency models

Next Steps

Now that you understand how memory works, you're ready to learn assembly language and see how high-level code translates to machine operations.

Next Reading: 05-assembly.md - Assembly Language Basics


Module 5: Computer Architecture | Reading 4 of 5