Memory Hierarchy
Learning Objectives
By the end of this reading, you will be able to:
- Understand the memory hierarchy and the trade-offs between speed, size, and cost
- Explain how cache memory improves system performance
- Understand cache organization: direct-mapped, set-associative, and fully associative
- Apply principles of temporal and spatial locality
- Understand virtual memory and address translation
- Explain page tables, TLB, and memory management
- Calculate cache hit rates and effective access times
- Recognize memory optimization strategies in programming
Introduction
Memory is organized as a hierarchy, with faster but smaller storage closer to the CPU and slower but larger storage farther away. Understanding this hierarchy is crucial for writing efficient programs.
Why memory hierarchy matters:
- The CPU-memory speed gap is enormous (CPU is 100+ times faster than RAM)
- Well-designed memory systems can approach the speed of the fastest memory at the cost of the cheapest
- Memory access patterns dramatically affect program performance
- Understanding memory helps you write cache-friendly code
The Memory Hierarchy Pyramid
▲ Faster, Smaller, More Expensive (per byte)
│
┌────┴────┐
│ Registers│ ← ~1 cycle, ~100 bytes
├─────────┤
│ L1 Cache │ ← ~4 cycles, ~64 KB
├─────────┤
│ L2 Cache │ ← ~10 cycles, ~256 KB - 1 MB
├─────────┤
│ L3 Cache │ ← ~40 cycles, ~8 MB - 32 MB
├─────────┤
│Main Mem. │ ← ~100-300 cycles, ~8-64 GB
│ (RAM) │
├─────────┤
│ SSD │ ← ~10,000+ cycles, ~256 GB - 2 TB
├─────────┤
│Hard Disk│ ← ~1,000,000+ cycles, ~1-8 TB
└─────────┘
│
▼ Slower, Larger, Cheaper (per byte)
Characteristics by Level
┌───────────┬─────────┬──────────┬────────────┬──────────┐
│ Level │ Size │ Speed │ Cost/GB │ Tech │
├───────────┼─────────┼──────────┼────────────┼──────────┤
│ Registers │ < 1 KB │ 0.25 ns │ N/A │ SRAM │
│ L1 Cache │ 32-64KB │ 1 ns │ Very High │ SRAM │
│ L2 Cache │ 256KB- │ 3-10 ns │ Very High │ SRAM │
│ │ 1MB │ │ │ │
│ L3 Cache │ 4-32MB │ 10-20 ns │ High │ SRAM │
│ Main Mem. │ 4-64GB │ 50-100ns │ ~$5 │ DRAM │
│ SSD │ 256GB+ │ 10-100μs │ ~$0.10 │ Flash │
│ HDD │ 1TB+ │ 5-10 ms │ ~$0.02 │ Magnetic │
└───────────┴─────────┴──────────┴────────────┴──────────┘
Key Principles
1. Principle of Locality
- Programs tend to access a small portion of their address space at any time
- Well-written programs exhibit good locality
2. Principle of Inclusion
- Data in higher levels is typically a subset of data in lower levels
- L1 ⊂ L2 ⊂ L3 ⊂ RAM ⊂ Disk
3. Caching Principle
- Keep frequently accessed data in faster, smaller memory
- Brings performance closer to the fast memory at the cost of the slow memory
Types of Locality
Temporal Locality
Principle: If data is accessed, it's likely to be accessed again soon.
Example:
int sum = 0;
for (int i = 0; i < n; i++) {
sum += array[i]; // 'sum' accessed repeatedly
}
Variable sum is accessed in every iteration → high temporal locality.
Exploitation: Keep recently accessed data in cache.
Spatial Locality
Principle: If data is accessed, nearby data is likely to be accessed soon.
Example:
int array[1000];
for (int i = 0; i < 1000; i++) {
array[i] = 0; // Sequential access: array[0], array[1], array[2]...
}
Accessing array[i] suggests array[i+1] will be accessed next.
Exploitation: Fetch data in blocks (cache lines).
Examples of Good vs Poor Locality
Good Spatial Locality (Row-major):
// C stores arrays in row-major order
for (int i = 0; i < N; i++) {
for (int j = 0; j < M; j++) {
array[i][j] = 0; // Sequential memory access
}
}
Poor Spatial Locality (Column-major in C):
for (int j = 0; j < M; j++) {
for (int i = 0; i < N; i++) {
array[i][j] = 0; // Strided access, poor cache performance
}
}
Cache Memory
Cache is small, fast memory between the CPU and main memory.
Cache Terminology
┌──────────────────────────────────────────┐
│ Cache Line / Block │
│ ┌────┬──────────────────────┬─────────┐│
│ │Tag │ Data Block │ Valid ││
│ │ │ (e.g., 64 bytes) │ Bit ││
│ └────┴──────────────────────┴─────────┘│
└──────────────────────────────────────────┘
Block/Line: Unit of transfer between cache and memory (typically 64 bytes)
Tag: Identifies which memory block is in this cache line
Valid Bit: Indicates if the cache line contains valid data
Dirty Bit: Indicates if data has been modified (write-back caches)
Cache Operations
Cache Hit: Data found in cache
CPU requests address 0x1000
→ Check cache
→ Found! (HIT)
→ Return data in ~1-10 cycles
Cache Miss: Data not in cache
CPU requests address 0x2000
→ Check cache
→ Not found (MISS)
→ Fetch from main memory (~100+ cycles)
→ Store in cache
→ Return data
Types of Misses:
- Compulsory Miss (Cold Miss): First access to data
- Capacity Miss: Cache is too small to hold all needed data
- Conflict Miss: Multiple addresses map to same cache location
Cache Performance Metrics
Hit Rate: Percentage of accesses found in cache
Hit Rate = Hits / (Hits + Misses)
Miss Rate: Percentage of accesses not found in cache
Miss Rate = 1 - Hit Rate = Misses / (Hits + Misses)
Average Memory Access Time (AMAT):
AMAT = Hit Time + (Miss Rate × Miss Penalty)
Example:
Hit Time = 1 cycle
Miss Penalty = 100 cycles
Hit Rate = 95%
AMAT = 1 + (0.05 × 100) = 1 + 5 = 6 cycles
Cache Organization
1. Direct-Mapped Cache
Each memory address maps to exactly one cache line.
Mapping:
Cache Line Index = (Address / Block Size) % Number of Cache Lines
Address Breakdown:
┌────────┬────────────┬──────────────┐
│ Tag │ Index │ Block Offset │
└────────┴────────────┴──────────────┘
│ │ │
│ │ └─ Byte within block
│ └─────────────── Cache line number
└──────────────────────── Identifies which memory block
Example: 16-line cache, 4-byte blocks
Address = 0x4C (binary: 0000 0000 0100 1100)
Block size = 4 bytes = 2² → 2-bit offset
Cache lines = 16 = 2⁴ → 4-bit index
Tag = remaining bits
Tag Index Offset
┌────────┬──────┬────┐
│00000001│ 0011 │ 00 │
└────────┴──────┴────┘
1 3 0
Maps to cache line 3
Direct-Mapped Cache Structure:
Index Valid Tag Data
───────────────────────────────
0 1 0x02 [data...]
1 0 ---- [data...]
2 1 0x15 [data...]
3 1 0x01 [data...] ← Address 0x4C maps here
4 0 ---- [data...]
...
15 1 0x0A [data...]
Advantages:
- Simple and fast
- Low hardware cost
Disadvantages:
- High conflict miss rate
- Poor utilization if multiple addresses map to same line
2. Fully Associative Cache
Any memory block can go in any cache line.
Operation:
- Search all cache lines in parallel
- Compare tag with all tags simultaneously
Structure:
Line Valid Tag Data
────────────────────────────────
0 1 0x1234 [data...]
1 1 0x5678 [data...] ← Any address can be here
2 0 ------ [data...]
3 1 0xABCD [data...]
...
Address Breakdown:
┌──────────────────┬──────────────┐
│ Tag │ Block Offset │
└──────────────────┴──────────────┘
│ │
│ └─ Byte within block
└─────────────────── Identifies memory block
Advantages:
- No conflict misses
- Best cache utilization
Disadvantages:
- Expensive (needs comparators for all lines)
- Slower (parallel search)
- Needs replacement policy
3. Set-Associative Cache
Compromise between direct-mapped and fully associative.
N-way set associative: Each address maps to a set of N lines.
Example: 2-way set-associative
Set Way 0 Way 1
┌──────────────┐ ┌──────────────┐
0 │V│Tag │ Data │ │V│Tag │ Data │
├──┼────┼──────┤ ├──┼────┼──────┤
1 │V│Tag │ Data │ │V│Tag │ Data │
├──┼────┼──────┤ ├──┼────┼──────┤
2 │V│Tag │ Data │ │V│Tag │ Data │
├──┼────┼──────┤ ├──┼────┼──────┤
3 │V│Tag │ Data │ │V│Tag │ Data │
└──┴────┴──────┘ └──┴────┴──────┘
Address Breakdown:
┌────────┬────────────┬──────────────┐
│ Tag │ Set Index │ Block Offset │
└────────┴────────────┴──────────────┘
Operation:
- Use set index to select set
- Compare tag with all ways in the set (parallel)
- If match, return data from matching way
Advantages:
- Fewer conflict misses than direct-mapped
- Less expensive than fully associative
- Good balance of performance and cost
Common Configurations:
- 2-way, 4-way, 8-way, 16-way set-associative
Cache Replacement Policies
When cache is full, which line should be evicted?
1. LRU (Least Recently Used)
- Evict the line that hasn't been used for the longest time
- Good performance, exploits temporal locality
- Requires tracking access history
2. FIFO (First In, First Out)
- Evict the oldest line
- Simple to implement
- Doesn't consider access patterns
3. Random
- Evict a random line
- Very simple, surprisingly effective
- No bookkeeping required
4. LFU (Least Frequently Used)
- Evict the line used least often
- Requires usage counters
- Can be ineffective if access patterns change
Write Policies
What happens when CPU writes to cache?
Write-Through
Strategy: Write to both cache and memory simultaneously.
CPU writes to address 0x1000
↓
Write to cache (fast)
↓
Write to memory (slow) ← Performance bottleneck
Advantages:
- Memory always up-to-date
- Simpler consistency
Disadvantages:
- Every write is slow
- High memory traffic
Optimization: Write buffer (queue writes to memory)
Write-Back
Strategy: Write only to cache; write to memory when line is evicted.
CPU writes to address 0x1000
↓
Write to cache only (fast)
↓
Set dirty bit
↓
(Later, when evicted)
↓
Write to memory if dirty
Advantages:
- Faster writes
- Reduced memory traffic
- Multiple writes to same location only written to memory once
Disadvantages:
- Memory may be stale
- More complex
- Needs dirty bit
Write Miss Policies
What if write misses in cache?
Write-Allocate (Fetch on Write):
- Fetch block from memory into cache
- Then write to cache
- Common with write-back
No-Write-Allocate (Write Around):
- Write directly to memory
- Don't fetch into cache
- Common with write-through
Multi-Level Caches
Modern CPUs have multiple cache levels.
┌───────────────────────────────────┐
│ CPU Core │
├───────────────────────────────────┤
│ L1 Instruction L1 Data │
│ Cache (32KB) Cache (32KB) │ ← 3-4 cycles
├───────────────────────────────────┤
│ L2 Unified Cache │
│ (256KB) │ ← 10-12 cycles
├───────────────────────────────────┤
│ L3 Shared Cache │
│ (8-32MB) │ ← 30-40 cycles
└────────────┬──────────────────────┘
│
▼
Main Memory (RAM) ← 100+ cycles
(8-64GB)
Inclusion Policies
Inclusive Cache:
- Data in L1 is also in L2, L3
- Easier coherency
- Some duplication
Exclusive Cache:
- Data in L1 is NOT in L2
- Better total capacity
- More complex
Non-Inclusive:
- No guarantee either way
- Most flexible
Multi-Level AMAT
AMAT = L1_Hit_Time
+ L1_Miss_Rate × (L2_Hit_Time + L2_Miss_Rate × Memory_Time)
Example:
L1: 1 cycle, 95% hit rate
L2: 10 cycles, 80% hit rate (of L1 misses)
Memory: 100 cycles
AMAT = 1 + 0.05 × (10 + 0.20 × 100)
= 1 + 0.05 × (10 + 20)
= 1 + 0.05 × 30
= 1 + 1.5
= 2.5 cycles (average)
Virtual Memory
Virtual memory provides each process with its own address space.
Motivation
Problems with physical addresses:
- Programs must know exact memory locations
- Can't move programs in memory
- No protection between processes
- Limited by physical memory size
Virtual memory solves:
- Process isolation
- Memory protection
- Larger address space than physical RAM
- Simplified programming
Address Translation
Virtual Address → [Translation] → Physical Address
Virtual Address: Used by programs Physical Address: Actual location in RAM
Example:
Process A sees: 0x0000 - 0xFFFF (virtual)
↓
Maps to: 0x10000 - 0x1FFFF (physical)
Process B sees: 0x0000 - 0xFFFF (virtual)
↓
Maps to: 0x20000 - 0x2FFFF (physical)
Both processes use same virtual addresses,
but access different physical memory!
Paging
Memory is divided into fixed-size pages.
Page size: Typically 4 KB (4096 bytes)
Virtual Memory: Physical Memory:
┌──────────┐ ┌──────────┐
│ Page 0 │ │ Frame 5 │
├──────────┤ ├──────────┤
│ Page 1 │───────┐ │ Frame 2 │
├──────────┤ │ ├──────────┤
│ Page 2 │ └────►│ Frame 7 │
├──────────┤ ├──────────┤
│ Page 3 │ │ Frame 1 │
└──────────┘ └──────────┘
Virtual pages map to physical frames
Page Table
Page Table: Data structure mapping virtual pages to physical frames.
Virtual Page # → Page Table → Physical Frame #
Page Table Entry (PTE):
┌──────┬─────┬─────┬──────┬──────┬──────────────┐
│Valid │ Ref │Dirty│ R/W │ U/S │ Frame Number │
└──────┴─────┴─────┴──────┴──────┴──────────────┘
│ │ │ │ │ │
│ │ │ │ │ └─ Physical frame
│ │ │ │ └─ User/Supervisor
│ │ │ └─ Read/Write permission
│ │ └─ Modified (dirty bit)
│ └─ Referenced recently
└─ Page present in memory
Address Translation Example:
Virtual Address: 32-bit, 4KB pages
┌────────────────────┬──────────────┐
│ Page Number (20) │ Offset (12) │
└────────────────────┴──────────────┘
│ │
▼ │
Page Table │
│ │
▼ │
Frame Number │
│ │
▼ ▼
┌────────────────────┬──────────────┐
│ Frame Number (20) │ Offset (12) │
└────────────────────┴──────────────┘
Physical Address
Multi-Level Page Tables
Problem: Single-level page table is huge!
- 32-bit address, 4KB pages: 2²⁰ entries × 4 bytes = 4MB per process
- 64-bit address: Impossibly large!
Solution: Multi-level page tables (typically 2-4 levels)
┌──────────┬──────────┬──────────┬─────────┐
│ Level 1 │ Level 2 │ Level 3 │ Offset │
│ (10 bit)│ (10 bit)│ (10 bit)│ (12 bit)│
└──────────┴──────────┴──────────┴─────────┘
│ │ │
▼ │ │
┌───────┐ │ │
│ PD │─────┘ │
└───────┘ ▼ │
┌───────┐ │
│ PT │─────┘
└───────┘ ▼
┌───────┐
│ Frame │
└───────┘
Benefits:
- Only allocate page tables for used memory
- Much smaller memory footprint
- Still provides full address space
Translation Lookaside Buffer (TLB)
Problem: Page table in memory → Every memory access requires 2+ memory accesses!
Solution: TLB caches recent page table entries.
Virtual Address
│
▼
┌──────────┐
│ TLB │ ← Cache of page table entries
│ (fast) │
└────┬─────┘
│
Hit?├─ Yes → Physical Address (fast!)
│
No
▼
┌──────────┐
│Page Table│ ← Access page table in memory
│ (slow) │
└────┬─────┘
│
▼
Physical Address
Update TLB
TLB Characteristics:
- Small (64-512 entries)
- Fully associative
- Very high hit rate (>99%)
TLB Miss Handling:
1. Hardware walks page table (x86)
OR
2. Software trap to OS (MIPS, some ARM)
Page Faults
Page Fault: Requested page not in physical memory.
Handling:
1. CPU accesses virtual address
2. Page table shows "not present"
3. Hardware raises page fault exception
4. OS page fault handler:
a. Find page on disk
b. Find free frame (evict if necessary)
c. Read page from disk to frame
d. Update page table
5. Resume instruction
Demand Paging: Pages loaded only when accessed (lazy loading).
Page Replacement Algorithms
When memory is full, which page to evict?
1. Optimal (OPT)
- Replace page that won't be used for longest time
- Impossible to implement (requires future knowledge)
- Theoretical best case for comparison
2. FIFO (First In, First Out)
- Replace oldest page
- Simple, but suffers from Belady's anomaly
3. LRU (Least Recently Used)
- Replace page not used for longest time
- Good performance
- Expensive to implement perfectly
4. Clock (Second Chance)
- Approximation of LRU
- Uses reference bit
- Practical and effective
5. Working Set
- Keep pages in process's working set
- Based on temporal locality
Memory Performance Optimization
Cache-Friendly Code
1. Maximize Spatial Locality
// Bad: Column-major access in C (row-major language)
for (int j = 0; j < N; j++) {
for (int i = 0; i < N; i++) {
sum += matrix[i][j]; // Stride of N, poor cache use
}
}
// Good: Row-major access
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
sum += matrix[i][j]; // Sequential, excellent cache use
}
}
2. Maximize Temporal Locality
// Bad: Large array, access each element once
for (int i = 0; i < HUGE; i++) {
process(array[i]);
}
// Good: Block the data to fit in cache
for (int block = 0; block < HUGE; block += BLOCK_SIZE) {
for (int i = block; i < block + BLOCK_SIZE; i++) {
process(array[i]);
process_again(array[i]); // Reuse while in cache
}
}
3. Loop Tiling/Blocking
Matrix multiplication example:
// Standard (poor cache behavior for large matrices)
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
for (int k = 0; k < N; k++) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
// Blocked (better cache behavior)
for (int ii = 0; ii < N; ii += BLOCK) {
for (int jj = 0; jj < N; jj += BLOCK) {
for (int kk = 0; kk < N; kk += BLOCK) {
// Process block
for (int i = ii; i < ii + BLOCK; i++) {
for (int j = jj; j < jj + BLOCK; j++) {
for (int k = kk; k < kk + BLOCK; k++) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
}
}
}
Programming Examples
Python: Cache Performance Simulation
import random
class DirectMappedCache:
def __init__(self, num_lines, block_size):
self.num_lines = num_lines
self.block_size = block_size
self.cache = [None] * num_lines
self.tags = [None] * num_lines
self.valid = [False] * num_lines
self.hits = 0
self.misses = 0
def access(self, address):
"""Access cache at given address"""
# Calculate index and tag
block_address = address // self.block_size
index = block_address % self.num_lines
tag = block_address // self.num_lines
# Check for hit
if self.valid[index] and self.tags[index] == tag:
self.hits += 1
return True # Hit
else:
# Miss: load into cache
self.misses += 1
self.tags[index] = tag
self.valid[index] = True
return False # Miss
def get_stats(self):
"""Return cache statistics"""
total = self.hits + self.misses
hit_rate = self.hits / total if total > 0 else 0
return {
'hits': self.hits,
'misses': self.misses,
'total': total,
'hit_rate': hit_rate
}
# Simulate different access patterns
cache = DirectMappedCache(num_lines=16, block_size=64)
print("=== Sequential Access (Good Locality) ===")
for addr in range(0, 1000, 4): # Access every 4 bytes
cache.access(addr)
stats = cache.get_stats()
print(f"Hit rate: {stats['hit_rate']:.2%}")
# Reset
cache = DirectMappedCache(num_lines=16, block_size=64)
print("\n=== Random Access (Poor Locality) ===")
for _ in range(250):
addr = random.randint(0, 10000)
cache.access(addr)
stats = cache.get_stats()
print(f"Hit rate: {stats['hit_rate']:.2%}")
# Reset
cache = DirectMappedCache(num_lines=16, block_size=64)
print("\n=== Strided Access (Moderate Locality) ===")
for i in range(100):
addr = i * 256 # Large stride
cache.access(addr)
stats = cache.get_stats()
print(f"Hit rate: {stats['hit_rate']:.2%}")
Python: AMAT Calculator
def calculate_amat(l1_hit_time, l1_miss_rate,
l2_hit_time, l2_miss_rate,
memory_time):
"""Calculate Average Memory Access Time for 2-level cache"""
# AMAT = L1_time + L1_miss_rate × (L2_time + L2_miss_rate × Mem_time)
l2_access_time = l2_hit_time + l2_miss_rate * memory_time
amat = l1_hit_time + l1_miss_rate * l2_access_time
return amat
# Example system
l1_hit = 1 # 1 cycle
l1_miss = 0.05 # 5% miss rate
l2_hit = 10 # 10 cycles
l2_miss = 0.20 # 20% of L1 misses also miss L2
mem = 100 # 100 cycles
amat = calculate_amat(l1_hit, l1_miss, l2_hit, l2_miss, mem)
print(f"Average Memory Access Time: {amat:.2f} cycles")
# Compare configurations
print("\n=== Comparing Cache Configurations ===")
configs = [
("Small L1, No L2", 1, 0.10, 0, 0, 100),
("Large L1, No L2", 2, 0.05, 0, 0, 100),
("Small L1, L2", 1, 0.10, 10, 0.30, 100),
("Large L1, L2", 2, 0.05, 10, 0.20, 100),
]
for name, *params in configs:
if params[2] == 0: # No L2
amat = params[0] + params[1] * params[4]
else:
amat = calculate_amat(*params)
print(f"{name:20s}: {amat:6.2f} cycles")
Exercises
Basic Exercises
Locality Analysis
- Identify temporal and spatial locality in:
for(i=0;i<n;i++) sum += arr[i]; - Which has better spatial locality: row-major or column-major traversal?
- Give an example of code with poor locality
- Identify temporal and spatial locality in:
Cache Mapping
- 4KB direct-mapped cache, 64-byte blocks. Where does address 0x4C8 map?
- How many bits for tag, index, offset? (Assume 32-bit addresses)
- What addresses conflict with 0x4C8?
Hit Rate Calculation
- 100 accesses: 92 hits, 8 misses. Calculate hit rate and miss rate.
- Calculate AMAT: hit time=2ns, miss penalty=100ns, hit rate=95%
- If miss rate decreases from 10% to 5%, how much does AMAT improve?
Intermediate Exercises
Cache Design
- Design a 4-way set-associative cache: 16KB, 32-byte blocks
- How many sets? How many bits for tag/index/offset?
- Compare with direct-mapped cache of same size
Virtual Memory
- 32-bit virtual address, 4KB pages. How many page table entries?
- Calculate overhead of single-level vs 2-level page table
- Trace translation for virtual address 0x12345678 (show page number, offset)
TLB Performance
- TLB hit time=1ns, miss penalty=20ns, hit rate=98%. Calculate AMAT for address translation.
- If TLB is enlarged to improve hit rate to 99.5%, what's the speedup?
- Why is TLB hit rate typically higher than cache hit rate?
Advanced Exercises
Cache Optimization
- Optimize matrix multiplication for cache (assume 8KB L1, 64-byte lines)
- Calculate ideal block size for tiling
- Estimate speedup from optimization
Memory Hierarchy Analysis
- System: L1(1 cycle, 5% miss), L2(10 cycles, 20% miss), RAM(100 cycles)
- Calculate overall AMAT
- Program runs 1B instructions, 30% are memory accesses. Calculate total memory time.
- If L2 is doubled in size (miss rate → 10%), what's the speedup?
Page Replacement
- Simulate LRU for access sequence: 1,2,3,4,1,2,5,1,2,3,4,5 (3 frames)
- Calculate page fault rate
- Compare with FIFO for same sequence
System Design
- Design a cache hierarchy for: 3GHz CPU, 100ns RAM, 4GB memory
- Choose cache sizes, associativity, block sizes
- Justify your design with calculations
- Estimate performance for typical workload
Summary
In this reading, we explored the memory hierarchy and caching:
Key Concepts:
- Memory hierarchy balances speed, size, and cost
- Locality (temporal and spatial) makes caching effective
- Cache organization: direct-mapped, set-associative, fully associative
- Virtual memory provides isolation, protection, and larger address spaces
- TLB accelerates address translation
- Page replacement algorithms manage limited physical memory
Important Skills:
- Calculating cache performance (hit rate, AMAT)
- Understanding cache mapping and address breakdown
- Tracing virtual-to-physical address translation
- Writing cache-friendly code
Why This Matters:
- Memory access is the dominant performance bottleneck in modern systems
- Understanding memory helps you write 10-100× faster code
- Essential for systems programming, performance tuning, and architecture design
- Foundation for understanding multithreading, coherency, and distributed systems
Further Reading
- "Computer Architecture: A Quantitative Approach" by Hennessy & Patterson
- Cache coherence protocols (MESI, MOESI)
- NUMA (Non-Uniform Memory Access) architectures
- Memory consistency models
Next Steps
Now that you understand how memory works, you're ready to learn assembly language and see how high-level code translates to machine operations.
Next Reading: 05-assembly.md - Assembly Language Basics
Module 5: Computer Architecture | Reading 4 of 5