CPU Architecture
Learning Objectives
By the end of this reading, you will be able to:
- Understand the Von Neumann and Harvard architectures
- Explain the fetch-decode-execute cycle
- Identify the major components of a CPU: ALU, Control Unit, and Registers
- Understand instruction set architecture (ISA) and machine code
- Trace instruction execution through the CPU
- Understand pipelining and its impact on performance
- Recognize different CPU architectures (CISC vs RISC)
Introduction
The Central Processing Unit (CPU) is the "brain" of a computer. It executes instructions stored in memory, performing calculations, making decisions, and coordinating the entire system.
Understanding CPU architecture is crucial because:
- It reveals how software ultimately executes on hardware
- It helps you write more efficient code
- It's fundamental to understanding computer performance
- It bridges the gap between high-level languages and machine operation
Historical Context: Von Neumann Architecture
In 1945, John von Neumann proposed a computer architecture that became the foundation for modern computers.
Von Neumann Architecture Components
┌─────────────────────────────────────────────┐
│ │
│ ┌──────────────┐ ┌─────────────────┐ │
│ │ │ │ │ │
│ │ Control │◄────►│ Arithmetic │ │
│ │ Unit │ │ Logic Unit │ │
│ │ │ │ (ALU) │ │
│ └──────┬───────┘ └─────────────────┘ │
│ │ │
│ │ CPU │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ Registers │ │
│ └──────────────┘ │
│ │
└─────────┬───────────────────────────────────┘
│
│ System Bus (Address, Data, Control)
│
┌─────┴──────┐
│ │
┌───▼──────┐ ┌──▼──────┐
│ │ │ │
│ Memory │ │ Input/ │
│ │ │ Output │
│ │ │ │
└──────────┘ └─────────┘
Key Principles
- Stored-Program Concept: Both program instructions and data are stored in the same memory
- Sequential Execution: Instructions are fetched and executed one after another
- Binary Representation: All information is represented in binary
- Shared Bus: Single bus for both instructions and data (Von Neumann bottleneck)
Harvard Architecture
An alternative architecture that separates instruction and data memory:
┌─────────────┐
│ │
│ CPU │
│ │
└──┬────────┬─┘
│ │
┌───────┘ └────────┐
│ │
│ Instruction Bus │ Data Bus
│ │
┌───▼──────────┐ ┌───────▼─────┐
│ │ │ │
│ Instruction │ │ Data │
│ Memory │ │ Memory │
│ │ │ │
└──────────────┘ └─────────────┘
Advantages:
- Can fetch instructions and data simultaneously
- No Von Neumann bottleneck
- Common in embedded systems and microcontrollers
Modern CPUs often use a modified Harvard architecture with separate L1 caches.
Major CPU Components
1. Arithmetic Logic Unit (ALU)
The ALU performs arithmetic and logical operations.
Arithmetic Operations:
- Addition, Subtraction
- Multiplication, Division (sometimes separate unit)
- Increment, Decrement
Logical Operations:
- AND, OR, XOR, NOT
- Bit shifting, rotation
- Comparisons
ALU Structure:
A (operand 1) B (operand 2)
│ │
▼ ▼
┌────────────────────────┐
│ │
│ ALU │
│ ┌────────────┐ │
│ │ Adder │ │
│ │ Logic │ │ ◄─── Opcode (operation selector)
│ │ Shifter │ │
│ │ etc. │ │
│ └────────────┘ │
│ │
└───┬──────────────┬─────┘
│ │
▼ ▼
Result Flags
(Z, C, N, V)
Flags (Status Register):
- Zero (Z): Result is zero
- Carry (C): Carry out from MSB
- Negative (N): Result is negative
- Overflow (V): Signed overflow occurred
2. Control Unit (CU)
The Control Unit directs the operation of the processor.
Responsibilities:
- Fetches instructions from memory
- Decodes instructions
- Generates control signals for other components
- Coordinates timing using the clock
Control Signals:
Control Unit outputs:
├── Memory Read/Write
├── ALU operation select
├── Register select
├── Bus control
└── Interrupt handling
3. Registers
Registers are ultra-fast storage locations inside the CPU.
Common Registers:
┌──────────────────────────────────────────┐
│ General-Purpose Registers │
│ ┌────┬────┬────┬────┬─────┬─────────┐ │
│ │ R0 │ R1 │ R2 │ R3 │ ... │ R15/16 │ │
│ └────┴────┴────┴────┴─────┴─────────┘ │
│ │
│ Special-Purpose Registers │
│ ┌────────────────────────────────────┐ │
│ │ PC - Program Counter │ │
│ │ IR - Instruction Register │ │
│ │ MAR - Memory Address Register │ │
│ │ MDR - Memory Data Register │ │
│ │ SP - Stack Pointer │ │
│ │ SR - Status Register (Flags) │ │
│ └────────────────────────────────────┘ │
└──────────────────────────────────────────┘
Register Functions:
- Program Counter (PC): Holds address of next instruction
- Instruction Register (IR): Holds current instruction being executed
- Memory Address Register (MAR): Holds memory address to read/write
- Memory Data Register (MDR): Holds data read from or written to memory
- Stack Pointer (SP): Points to top of stack
- Accumulator: Stores intermediate results (some architectures)
The Fetch-Decode-Execute Cycle
The fundamental operating cycle of a CPU:
┌──────────┐
│ FETCH │ ← Retrieve instruction from memory
└────┬─────┘
│
▼
┌──────────┐
│ DECODE │ ← Interpret the instruction
└────┬─────┘
│
▼
┌──────────┐
│ EXECUTE │ ← Perform the operation
└────┬─────┘
│
└──────► (repeat)
Detailed Cycle Steps
1. FETCH
1. Copy PC to MAR
2. Send memory read signal
3. Copy instruction from memory to MDR
4. Copy MDR to IR
5. Increment PC
2. DECODE
1. Control unit examines IR
2. Determines operation (opcode)
3. Identifies operands
4. Prepares control signals
3. EXECUTE
1. Activate appropriate functional units
2. Perform the operation (ALU, memory access, etc.)
3. Store result in register or memory
4. Update status flags if necessary
Example: ADD Instruction
Instruction: ADD R1, R2, R3 (R1 = R2 + R3)
Cycle:
┌─────────┬──────────────────────────────────────────┐
│ FETCH │ 1. MAR ← PC (e.g., 0x1000) │
│ │ 2. MDR ← Memory[MAR] │
│ │ 3. IR ← MDR (instruction loaded) │
│ │ 4. PC ← PC + 4 (next instruction) │
├─────────┼──────────────────────────────────────────┤
│ DECODE │ 1. Opcode = ADD │
│ │ 2. Source1 = R2 │
│ │ 3. Source2 = R3 │
│ │ 4. Destination = R1 │
├─────────┼──────────────────────────────────────────┤
│ EXECUTE │ 1. Read values from R2 and R3 │
│ │ 2. Send to ALU with ADD operation │
│ │ 3. ALU computes result │
│ │ 4. Write result to R1 │
│ │ 5. Update flags (Z, C, N, V) │
└─────────┴──────────────────────────────────────────┘
Instruction Set Architecture (ISA)
The ISA defines the interface between software and hardware.
Components of an Instruction
┌─────────┬──────────┬──────────┬──────────┐
│ Opcode │ Operand1 │ Operand2 │ Operand3 │
└─────────┴──────────┴──────────┴──────────┘
Opcode: What operation to perform
Operands: Where to get data / where to store result
Instruction Types
1. Data Movement
LOAD R1, [address] ; Load from memory
STORE R1, [address] ; Store to memory
MOVE R1, R2 ; Copy between registers
2. Arithmetic
ADD R1, R2, R3 ; R1 = R2 + R3
SUB R1, R2, R3 ; R1 = R2 - R3
MUL R1, R2, R3 ; R1 = R2 × R3
DIV R1, R2, R3 ; R1 = R2 ÷ R3
INC R1 ; R1 = R1 + 1
3. Logical
AND R1, R2, R3 ; R1 = R2 & R3
OR R1, R2, R3 ; R1 = R2 | R3
XOR R1, R2, R3 ; R1 = R2 ^ R3
NOT R1, R2 ; R1 = ~R2
4. Shift/Rotate
SHL R1, R2, 2 ; Shift left 2 bits
SHR R1, R2, 2 ; Shift right 2 bits
ROL R1, R2, 1 ; Rotate left 1 bit
5. Control Flow
JMP label ; Unconditional jump
JZ label ; Jump if zero
JNZ label ; Jump if not zero
CALL function ; Function call
RET ; Return from function
6. Comparison
CMP R1, R2 ; Compare R1 and R2 (sets flags)
TEST R1, R2 ; Bitwise AND without storing result
Addressing Modes
Different ways to specify operands:
1. Immediate Addressing
LOAD R1, #5 ; R1 = 5 (value is in instruction)
2. Direct/Absolute Addressing
LOAD R1, [1000] ; R1 = Memory[1000]
3. Register Addressing
ADD R1, R2, R3 ; R1 = R2 + R3
4. Indirect Addressing
LOAD R1, [R2] ; R1 = Memory[R2]
5. Indexed Addressing
LOAD R1, [R2 + 100] ; R1 = Memory[R2 + 100]
; (base + offset)
6. Register Indirect with Offset
LOAD R1, [R2 + R3] ; R1 = Memory[R2 + R3]
Machine Code and Encoding
Instructions are encoded as binary numbers.
Example: Simple 16-bit Instruction Format
┌────────┬──────┬──────┬──────┐
│ Opcode │ Rd │ Rs │ Rt │
│ 4 bits │ 4 bits│4 bits│4 bits│
└────────┴──────┴──────┴──────┘
Example: ADD R1, R2, R3
Opcode: 0001 (ADD)
Rd: 0001 (R1 - destination)
Rs: 0010 (R2 - source 1)
Rt: 0011 (R3 - source 2)
Machine Code: 0001 0001 0010 0011 = 0x1123
Instruction Formats
Fixed-Length (RISC)
- All instructions same size (e.g., 32 bits)
- Simpler decoding
- May waste bits for simple operations
Variable-Length (CISC)
- Instructions can be different sizes
- More efficient encoding
- More complex decoding
CISC vs RISC
CISC (Complex Instruction Set Computer)
Examples: x86, x86-64
Characteristics:
- Large instruction set (hundreds of instructions)
- Variable-length instructions
- Complex addressing modes
- Instructions can take multiple cycles
- Fewer instructions per program
Example CISC Instructions:
MULT [R1], [R2+100] ; Memory[R1] = Memory[R1] * Memory[R2+100]
; (single instruction does a lot)
Advantages:
- Fewer instructions for complex operations
- Smaller code size
- Better use of memory bandwidth
Disadvantages:
- Complex hardware
- Harder to pipeline
- Variable execution times
RISC (Reduced Instruction Set Computer)
Examples: ARM, MIPS, RISC-V
Characteristics:
- Small instruction set (< 100 instructions)
- Fixed-length instructions
- Simple addressing modes
- Most instructions execute in one cycle
- Load/Store architecture (only LOAD/STORE access memory)
Example RISC Instructions:
LOAD R1, [R2] ; R1 = Memory[R2]
LOAD R3, [R4+100] ; R3 = Memory[R4+100]
MULT R1, R1, R3 ; R1 = R1 * R3
STORE R1, [R2] ; Memory[R2] = R1
; (same operation as CISC example, but 4 instructions)
Advantages:
- Simpler hardware
- Easier to pipeline
- Predictable execution time
- Lower power consumption
Disadvantages:
- More instructions per program
- Larger code size
Modern Trend: Hybrid Approach
Modern processors (even x86) use RISC-like execution internally:
- Decode complex instructions into micro-operations
- Execute micro-ops on RISC-like core
- Get benefits of both approaches
CPU Performance
Clock Speed
The CPU clock synchronizes operations.
Clock Cycle:
┌──┐ ┌──┐ ┌──┐ ┌──┐
─────┘ └──┘ └──┘ └──┘ └─────
Frequency: Number of cycles per second (Hz)
Period: Time per cycle (seconds)
1 GHz = 1,000,000,000 cycles/second
Period = 1/frequency = 1 nanosecond
Clock Speed Examples:
- 1 GHz = 1 billion cycles/second
- 3.5 GHz = 3.5 billion cycles/second
Note: Higher clock speed ≠ always faster (architecture matters!)
Cycles Per Instruction (CPI)
Different instructions take different numbers of cycles:
Simple instruction (ADD): 1 cycle
Memory load (LOAD): 3-5 cycles
Division (DIV): 10-40 cycles
Performance Metrics
Execution Time:
Time = (Instructions × CPI × Clock Period)
Or:
Time = (Instructions × CPI) / Clock Frequency
Example:
Program: 1,000,000 instructions
Average CPI: 2
Clock: 2 GHz
Time = (1,000,000 × 2) / (2 × 10⁹)
= 2,000,000 / 2,000,000,000
= 0.001 seconds = 1 millisecond
MIPS (Million Instructions Per Second):
MIPS = Clock Frequency (MHz) / CPI
Example:
2000 MHz / 2 CPI = 1000 MIPS
Pipelining
Pipelining overlaps instruction execution for better throughput.
Non-Pipelined Execution
Time: 1 2 3 4 5 6 7 8 9 10 11 12
┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐
Inst1 │ F │ D │ E │ │ │ │ │ │ │ │ │
├────┼────┼────┼────┼────┼────┼────┼────┼────┼────┼────┤
Inst2 │ │ │ │ F │ D │ E │ │ │ │ │ │
├────┼────┼────┼────┼────┼────┼────┼────┼────┼────┼────┤
Inst3 │ │ │ │ │ │ │ F │ D │ E │ │ │
├────┼────┼────┼────┼────┼────┼────┼────┼────┼────┼────┤
Inst4 │ │ │ │ │ │ │ │ │ │ F │ D │
└────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘
F = Fetch, D = Decode, E = Execute
Total time: 11 cycles for 4 instructions
Pipelined Execution
Time: 1 2 3 4 5 6 7
┌────┬────┬────┬────┬────┬────┐
Inst1 │ F │ D │ E │ │ │ │
├────┼────┼────┼────┼────┼────┤
Inst2 │ │ F │ D │ E │ │ │
├────┼────┼────┼────┼────┼────┤
Inst3 │ │ │ F │ D │ E │ │
├────┼────┼────┼────┼────┼────┤
Inst4 │ │ │ │ F │ D │ E │
└────┴────┴────┴────┴────┴────┘
Total time: 6 cycles for 4 instructions
Speedup: 11/6 = 1.83×
Ideal Speedup: For n stages, speedup approaches n for long programs.
Pipeline Stages (5-stage classic RISC)
1. IF - Instruction Fetch
2. ID - Instruction Decode
3. EX - Execute
4. MEM - Memory Access
5. WB - Write Back
┌────┬────┬────┬────┬────┐
│ IF │ ID │ EX │MEM │ WB │ Instruction 1
└────┴────┴────┴────┴────┘
┌────┬────┬────┬────┬────┐
│ IF │ ID │ EX │MEM │ WB │ Instruction 2
└────┴────┴────┴────┴────┘
┌────┬────┬────┬────┬────┐
│ IF │ ID │ EX │MEM │ WB │ Instruction 3
└────┴────┴────┴────┴────┘
Pipeline Hazards
1. Structural Hazards
- Resource conflict (e.g., memory access)
- Solution: Separate instruction/data caches
2. Data Hazards
ADD R1, R2, R3 ; R1 = R2 + R3
SUB R4, R1, R5 ; R4 = R1 - R5 (needs R1 from previous!)
- Solutions: Forwarding, stalling, reordering
3. Control Hazards
BEQ R1, R2, label ; Branch instruction
ADD R3, R4, R5 ; What if we branch? This shouldn't execute!
- Solutions: Branch prediction, delayed branches
Multi-Core and Parallel Processing
Modern CPUs have multiple cores:
┌─────────────────────────────────────┐
│ CPU Package │
│ │
│ ┌─────────┐ ┌─────────┐ │
│ │ Core 0 │ │ Core 1 │ │
│ │ ┌───┐ │ │ ┌───┐ │ │
│ │ │ALU│ │ │ │ALU│ │ │
│ │ ├───┤ │ │ ├───┤ │ │
│ │ │Reg│ │ │ │Reg│ │ │
│ │ └───┘ │ │ └───┘ │ │
│ │ L1 Cache│ │ L1 Cache│ │
│ └────┬────┘ └────┬────┘ │
│ │ │ │
│ └────┬──── ──────┘ │
│ │ │
│ ┌────▼────┐ │
│ │ L2/L3 │ │
│ │ Cache │ │
│ └─────────┘ │
└─────────────────────────────────────┘
Benefits:
- True parallel execution
- Better multitasking
- Higher throughput
Challenges:
- Synchronization
- Cache coherency
- Programming complexity
Practical Example: Tracing Execution
Simple Program
; Calculate: result = (a + b) - c
LOAD R1, [a_addr] ; R1 = a
LOAD R2, [b_addr] ; R2 = b
ADD R3, R1, R2 ; R3 = a + b
LOAD R4, [c_addr] ; R4 = c
SUB R5, R3, R4 ; R5 = (a+b) - c
STORE R5, [result] ; result = R5
Detailed Trace
Initial State:
Memory[a_addr] = 10
Memory[b_addr] = 20
Memory[c_addr] = 5
PC = 0x1000
Instruction 1: LOAD R1, [a_addr]
FETCH: IR ← Memory[PC], PC ← PC + 4
DECODE: Opcode = LOAD, Dest = R1, Addr = a_addr
EXECUTE: MAR ← a_addr, MDR ← Memory[MAR], R1 ← MDR
Result: R1 = 10
Instruction 2: LOAD R2, [b_addr]
Result: R2 = 20
Instruction 3: ADD R3, R1, R2
FETCH: IR ← Memory[PC], PC ← PC + 4
DECODE: Opcode = ADD, Dest = R3, Src1 = R1, Src2 = R2
EXECUTE: ALU ← R1 + R2, R3 ← ALU_Result
Result: R3 = 30
Instruction 4: LOAD R4, [c_addr]
Result: R4 = 5
Instruction 5: SUB R5, R3, R4
EXECUTE: ALU ← R3 - R4, R5 ← ALU_Result
Result: R5 = 25
Instruction 6: STORE R5, [result]
EXECUTE: MAR ← result, MDR ← R5, Memory[MAR] ← MDR
Result: Memory[result] = 25
Programming Example
Python: Simple CPU Simulator
class SimpleCPU:
def __init__(self):
self.registers = [0] * 8 # 8 general-purpose registers
self.pc = 0 # Program counter
self.ir = 0 # Instruction register
self.flags = { # Status flags
'Z': 0, # Zero
'N': 0, # Negative
'C': 0, # Carry
'V': 0 # Overflow
}
self.memory = [0] * 1024 # 1KB memory
self.running = True
def fetch(self):
"""Fetch instruction from memory"""
self.ir = self.memory[self.pc]
self.pc += 1
def decode_execute(self):
"""Decode and execute instruction"""
opcode = (self.ir >> 12) & 0xF
rd = (self.ir >> 8) & 0xF
rs = (self.ir >> 4) & 0xF
rt = self.ir & 0xF
if opcode == 0x0: # HALT
self.running = False
elif opcode == 0x1: # ADD
result = self.registers[rs] + self.registers[rt]
self.registers[rd] = result & 0xFFFF
self.update_flags(result)
elif opcode == 0x2: # SUB
result = self.registers[rs] - self.registers[rt]
self.registers[rd] = result & 0xFFFF
self.update_flags(result)
elif opcode == 0x3: # LOAD immediate
value = self.ir & 0xFF
self.registers[rd] = value
elif opcode == 0x4: # STORE
addr = self.registers[rs]
self.memory[addr] = self.registers[rd]
def update_flags(self, result):
"""Update status flags based on result"""
self.flags['Z'] = 1 if result == 0 else 0
self.flags['N'] = 1 if result < 0 else 0
def run(self):
"""Run the fetch-decode-execute cycle"""
cycle = 0
while self.running and cycle < 1000:
print(f"\n--- Cycle {cycle} ---")
print(f"PC: {self.pc}, IR: {hex(self.ir)}")
self.fetch()
print(f"Fetched: {hex(self.ir)} from address {self.pc - 1}")
self.decode_execute()
print(f"Registers: {self.registers[:4]}")
print(f"Flags: Z={self.flags['Z']}, N={self.flags['N']}")
cycle += 1
# Example usage
cpu = SimpleCPU()
# Program: Add two numbers
# LOAD R0, #10
# LOAD R1, #20
# ADD R2, R0, R1
# HALT
cpu.memory[0] = 0x300A # LOAD R0, #10
cpu.memory[1] = 0x3114 # LOAD R1, #20
cpu.memory[2] = 0x1201 # ADD R2, R0, R1
cpu.memory[3] = 0x0000 # HALT
cpu.run()
print(f"\nFinal result in R2: {cpu.registers[2]}")
Exercises
Basic Exercises
Fetch-Decode-Execute
- Trace the execution of: LOAD R1, #50; ADD R2, R1, R1
- Identify which CPU component is active at each stage
- Draw a timing diagram for the execution
Instruction Encoding
- Encode ADD R3, R1, R2 in a 16-bit format (4-bit opcode, 4-bit per register)
- Decode: 0x2315 (opcode: 2=SUB, registers: 3,1,5)
- Design a 3-bit opcode instruction set with 8 operations
Registers
- Why are registers faster than RAM?
- What happens to PC after a jump instruction?
- Trace PC values for a 5-instruction sequence with one jump
Intermediate Exercises
Pipeline Analysis
- Draw a pipeline diagram for 6 instructions (3-stage pipeline)
- Calculate speedup vs. non-pipelined execution
- Identify a data hazard in: ADD R1,R2,R3; SUB R4,R1,R5
Performance Calculation
- Program: 1M instructions, CPI=1.5, Clock=3GHz. Find execution time.
- Compare: 2GHz RISC (CPI=1) vs. 3GHz CISC (CPI=3) for same program
- Calculate MIPS rating for a 2.5GHz processor with CPI of 2
ISA Design
- Design addressing modes for: array access, function call, global variable
- Compare instruction count: RISC vs CISC for a simple loop
- Justify LOAD/STORE architecture advantages
Advanced Exercises
Complex Tracing
- Trace execution with branch: if (a>b) c=a else c=b
- Show pipeline diagram including a branch misprediction
- Calculate performance impact of branch misprediction
Architecture Comparison
- Design both CISC and RISC instruction sets for: array[i] = array[i] + 5
- Compare code size, instruction count, and complexity
- Analyze trade-offs for embedded vs desktop systems
Multi-Core
- Design a task distribution strategy for 4 cores
- Identify where cache coherency problems might occur
- Calculate theoretical speedup (consider Amdahl's Law)
CPU Design
- Design a minimal ISA (8 instructions) for basic computation
- Create a state machine for a 3-stage pipeline
- Estimate logic gate count for a simple ALU (4-bit, ADD/SUB/AND/OR)
Summary
In this reading, we explored the architecture and operation of the CPU:
Key Concepts:
- Von Neumann architecture provides the foundation for modern computers
- CPU components: ALU (computation), Control Unit (coordination), Registers (fast storage)
- Fetch-Decode-Execute cycle is the fundamental operation loop
- ISA defines the interface between hardware and software
- Pipelining improves throughput by overlapping instruction execution
- CISC vs RISC represent different design philosophies
Important Skills:
- Tracing instruction execution through the CPU
- Understanding how instructions are encoded
- Analyzing performance based on clock speed, CPI, and instruction count
- Recognizing pipeline hazards and their solutions
Why This Matters:
- Foundation for understanding how programs actually execute
- Essential for writing performance-critical code
- Helps debug low-level issues
- Critical for systems programming and compiler design
Further Reading
- "Computer Organization and Design" by Patterson & Hennessy
- Study specific architectures: x86-64, ARM, RISC-V
- Superscalar execution and out-of-order processing
- Branch prediction algorithms
Next Steps
Now that you understand how the CPU processes instructions, you're ready to learn about the memory hierarchy that stores the data and instructions the CPU needs.
Next Reading: 04-memory.md - Memory Hierarchy
Module 5: Computer Architecture | Reading 3 of 5