CPU Architecture | Computer Science Fundamentals

The Central Processing Unit (CPU) is the "brain" of a computer. It executes instructions stored in memory, performing calculations, making decisions, and coordinating the entire system.

Understanding CPU architecture is crucial because:

It reveals how software ultimately executes on hardware
It helps you write more efficient code
It's fundamental to understanding computer performance
It bridges the gap between high-level languages and machine operation

Historical Context: Von Neumann Architecture

In 1945, John von Neumann proposed a computer architecture that became the foundation for modern computers.

Von Neumann Architecture Components

┌─────────────────────────────────────────────┐
│                                             │
│  ┌──────────────┐      ┌─────────────────┐ │
│  │              │      │                 │ │
│  │  Control     │◄────►│   Arithmetic    │ │
│  │  Unit        │      │   Logic Unit    │ │
│  │              │      │   (ALU)         │ │
│  └──────┬───────┘      └─────────────────┘ │
│         │                                   │
│         │              CPU                  │
│         │                                   │
│         ▼                                   │
│  ┌──────────────┐                          │
│  │  Registers   │                          │
│  └──────────────┘                          │
│                                             │
└─────────┬───────────────────────────────────┘
          │
          │ System Bus (Address, Data, Control)
          │
    ┌─────┴──────┐
    │            │
┌───▼──────┐  ┌──▼──────┐
│          │  │         │
│  Memory  │  │  Input/ │
│          │  │  Output │
│          │  │         │
└──────────┘  └─────────┘

Key Principles

Stored-Program Concept: Both program instructions and data are stored in the same memory
Sequential Execution: Instructions are fetched and executed one after another
Binary Representation: All information is represented in binary
Shared Bus: Single bus for both instructions and data (Von Neumann bottleneck)

Harvard Architecture

An alternative architecture that separates instruction and data memory:

         ┌─────────────┐
         │             │
         │     CPU     │
         │             │
         └──┬────────┬─┘
            │        │
    ┌───────┘        └────────┐
    │                         │
    │ Instruction Bus         │ Data Bus
    │                         │
┌───▼──────────┐      ┌───────▼─────┐
│              │      │             │
│ Instruction  │      │    Data     │
│   Memory     │      │   Memory    │
│              │      │             │
└──────────────┘      └─────────────┘

Advantages:

Can fetch instructions and data simultaneously
No Von Neumann bottleneck
Common in embedded systems and microcontrollers

Modern CPUs often use a modified Harvard architecture with separate L1 caches.

Major CPU Components

1. Arithmetic Logic Unit (ALU)

The ALU performs arithmetic and logical operations.

Arithmetic Operations:

Addition, Subtraction
Multiplication, Division (sometimes separate unit)
Increment, Decrement

Logical Operations:

AND, OR, XOR, NOT
Bit shifting, rotation
Comparisons

ALU Structure:

       A (operand 1)    B (operand 2)
            │                │
            ▼                ▼
       ┌────────────────────────┐
       │                        │
       │        ALU             │
       │   ┌────────────┐       │
       │   │  Adder     │       │
       │   │  Logic     │       │  ◄─── Opcode (operation selector)
       │   │  Shifter   │       │
       │   │  etc.      │       │
       │   └────────────┘       │
       │                        │
       └───┬──────────────┬─────┘
           │              │
           ▼              ▼
        Result         Flags
                    (Z, C, N, V)

Flags (Status Register):

Zero (Z): Result is zero
Carry (C): Carry out from MSB
Negative (N): Result is negative
Overflow (V): Signed overflow occurred

2. Control Unit (CU)

The Control Unit directs the operation of the processor.

Responsibilities:

Fetches instructions from memory
Decodes instructions
Generates control signals for other components
Coordinates timing using the clock

Control Signals:

Control Unit outputs:
├── Memory Read/Write
├── ALU operation select
├── Register select
├── Bus control
└── Interrupt handling

3. Registers

Registers are ultra-fast storage locations inside the CPU.

Common Registers:

┌──────────────────────────────────────────┐
│  General-Purpose Registers               │
│  ┌────┬────┬────┬────┬─────┬─────────┐  │
│  │ R0 │ R1 │ R2 │ R3 │ ... │ R15/16  │  │
│  └────┴────┴────┴────┴─────┴─────────┘  │
│                                          │
│  Special-Purpose Registers               │
│  ┌────────────────────────────────────┐ │
│  │ PC  - Program Counter              │ │
│  │ IR  - Instruction Register         │ │
│  │ MAR - Memory Address Register      │ │
│  │ MDR - Memory Data Register         │ │
│  │ SP  - Stack Pointer                │ │
│  │ SR  - Status Register (Flags)      │ │
│  └────────────────────────────────────┘ │
└──────────────────────────────────────────┘

Register Functions:

Program Counter (PC): Holds address of next instruction
Instruction Register (IR): Holds current instruction being executed
Memory Address Register (MAR): Holds memory address to read/write
Memory Data Register (MDR): Holds data read from or written to memory
Stack Pointer (SP): Points to top of stack
Accumulator: Stores intermediate results (some architectures)

The Fetch-Decode-Execute Cycle

The fundamental operating cycle of a CPU:

     ┌──────────┐
     │  FETCH   │ ← Retrieve instruction from memory
     └────┬─────┘
          │
          ▼
     ┌──────────┐
     │  DECODE  │ ← Interpret the instruction
     └────┬─────┘
          │
          ▼
     ┌──────────┐
     │ EXECUTE  │ ← Perform the operation
     └────┬─────┘
          │
          └──────► (repeat)

Detailed Cycle Steps

1. FETCH

1. Copy PC to MAR
2. Send memory read signal
3. Copy instruction from memory to MDR
4. Copy MDR to IR
5. Increment PC

2. DECODE

1. Control unit examines IR
2. Determines operation (opcode)
3. Identifies operands
4. Prepares control signals

3. EXECUTE

1. Activate appropriate functional units
2. Perform the operation (ALU, memory access, etc.)
3. Store result in register or memory
4. Update status flags if necessary

Example: ADD Instruction

Instruction: ADD R1, R2, R3 (R1 = R2 + R3)

Cycle:

┌─────────┬──────────────────────────────────────────┐
│ FETCH   │ 1. MAR ← PC (e.g., 0x1000)              │
│         │ 2. MDR ← Memory[MAR]                     │
│         │ 3. IR ← MDR (instruction loaded)         │
│         │ 4. PC ← PC + 4 (next instruction)        │
├─────────┼──────────────────────────────────────────┤
│ DECODE  │ 1. Opcode = ADD                          │
│         │ 2. Source1 = R2                          │
│         │ 3. Source2 = R3                          │
│         │ 4. Destination = R1                      │
├─────────┼──────────────────────────────────────────┤
│ EXECUTE │ 1. Read values from R2 and R3            │
│         │ 2. Send to ALU with ADD operation        │
│         │ 3. ALU computes result                   │
│         │ 4. Write result to R1                    │
│         │ 5. Update flags (Z, C, N, V)             │
└─────────┴──────────────────────────────────────────┘

Instruction Set Architecture (ISA)

The ISA defines the interface between software and hardware.

Components of an Instruction

┌─────────┬──────────┬──────────┬──────────┐
│ Opcode  │ Operand1 │ Operand2 │ Operand3 │
└─────────┴──────────┴──────────┴──────────┘

Opcode:   What operation to perform
Operands: Where to get data / where to store result

Instruction Types

1. Data Movement

LOAD  R1, [address]   ; Load from memory
STORE R1, [address]   ; Store to memory
MOVE  R1, R2          ; Copy between registers

2. Arithmetic

ADD   R1, R2, R3      ; R1 = R2 + R3
SUB   R1, R2, R3      ; R1 = R2 - R3
MUL   R1, R2, R3      ; R1 = R2 × R3
DIV   R1, R2, R3      ; R1 = R2 ÷ R3
INC   R1              ; R1 = R1 + 1

3. Logical

AND   R1, R2, R3      ; R1 = R2 & R3
OR    R1, R2, R3      ; R1 = R2 | R3
XOR   R1, R2, R3      ; R1 = R2 ^ R3
NOT   R1, R2          ; R1 = ~R2

4. Shift/Rotate

SHL   R1, R2, 2       ; Shift left 2 bits
SHR   R1, R2, 2       ; Shift right 2 bits
ROL   R1, R2, 1       ; Rotate left 1 bit

5. Control Flow

JMP   label           ; Unconditional jump
JZ    label           ; Jump if zero
JNZ   label           ; Jump if not zero
CALL  function        ; Function call
RET                   ; Return from function

6. Comparison

CMP   R1, R2          ; Compare R1 and R2 (sets flags)
TEST  R1, R2          ; Bitwise AND without storing result

Addressing Modes

Different ways to specify operands:

1. Immediate Addressing

LOAD R1, #5          ; R1 = 5 (value is in instruction)

2. Direct/Absolute Addressing

LOAD R1, [1000]      ; R1 = Memory[1000]

3. Register Addressing

ADD R1, R2, R3       ; R1 = R2 + R3

4. Indirect Addressing

LOAD R1, [R2]        ; R1 = Memory[R2]

5. Indexed Addressing

LOAD R1, [R2 + 100]  ; R1 = Memory[R2 + 100]
                     ; (base + offset)

6. Register Indirect with Offset

LOAD R1, [R2 + R3]   ; R1 = Memory[R2 + R3]

Machine Code and Encoding

Instructions are encoded as binary numbers.

Example: Simple 16-bit Instruction Format

┌────────┬──────┬──────┬──────┐
│ Opcode │  Rd  │  Rs  │  Rt  │
│ 4 bits │ 4 bits│4 bits│4 bits│
└────────┴──────┴──────┴──────┘

Example: ADD R1, R2, R3

Opcode: 0001 (ADD)
Rd:     0001 (R1 - destination)
Rs:     0010 (R2 - source 1)
Rt:     0011 (R3 - source 2)

Machine Code: 0001 0001 0010 0011 = 0x1123

Instruction Formats

Fixed-Length (RISC)

All instructions same size (e.g., 32 bits)
Simpler decoding
May waste bits for simple operations

Variable-Length (CISC)

Instructions can be different sizes
More efficient encoding
More complex decoding

CISC vs RISC

CISC (Complex Instruction Set Computer)

Examples: x86, x86-64

Characteristics:

Large instruction set (hundreds of instructions)
Variable-length instructions
Complex addressing modes
Instructions can take multiple cycles
Fewer instructions per program

Example CISC Instructions:

MULT [R1], [R2+100]  ; Memory[R1] = Memory[R1] * Memory[R2+100]
                     ; (single instruction does a lot)

Advantages:

Fewer instructions for complex operations
Smaller code size
Better use of memory bandwidth

Disadvantages:

Complex hardware
Harder to pipeline
Variable execution times

RISC (Reduced Instruction Set Computer)

Examples: ARM, MIPS, RISC-V

Characteristics:

Small instruction set (< 100 instructions)
Fixed-length instructions
Simple addressing modes
Most instructions execute in one cycle
Load/Store architecture (only LOAD/STORE access memory)

Example RISC Instructions:

LOAD  R1, [R2]       ; R1 = Memory[R2]
LOAD  R3, [R4+100]   ; R3 = Memory[R4+100]
MULT  R1, R1, R3     ; R1 = R1 * R3
STORE R1, [R2]       ; Memory[R2] = R1

; (same operation as CISC example, but 4 instructions)

Advantages:

Simpler hardware
Easier to pipeline
Predictable execution time
Lower power consumption

Disadvantages:

More instructions per program
Larger code size

Modern Trend: Hybrid Approach

Modern processors (even x86) use RISC-like execution internally:

Decode complex instructions into micro-operations
Execute micro-ops on RISC-like core
Get benefits of both approaches

CPU Performance

Clock Speed

The CPU clock synchronizes operations.

Clock Cycle:
     ┌──┐  ┌──┐  ┌──┐  ┌──┐
─────┘  └──┘  └──┘  └──┘  └─────

Frequency: Number of cycles per second (Hz)
Period:    Time per cycle (seconds)

1 GHz = 1,000,000,000 cycles/second
Period = 1/frequency = 1 nanosecond

Clock Speed Examples:

1 GHz = 1 billion cycles/second
3.5 GHz = 3.5 billion cycles/second

Note: Higher clock speed ≠ always faster (architecture matters!)

Cycles Per Instruction (CPI)

Different instructions take different numbers of cycles:

Simple instruction (ADD):    1 cycle
Memory load (LOAD):          3-5 cycles
Division (DIV):              10-40 cycles

Performance Metrics

Execution Time:

Time = (Instructions × CPI × Clock Period)

Or:
Time = (Instructions × CPI) / Clock Frequency

Example:

Program: 1,000,000 instructions
Average CPI: 2
Clock: 2 GHz

Time = (1,000,000 × 2) / (2 × 10⁹)
     = 2,000,000 / 2,000,000,000
     = 0.001 seconds = 1 millisecond

MIPS (Million Instructions Per Second):

MIPS = Clock Frequency (MHz) / CPI

Example:
2000 MHz / 2 CPI = 1000 MIPS

Pipelining

Pipelining overlaps instruction execution for better throughput.

Non-Pipelined Execution

Time: 1    2    3    4    5    6    7    8    9    10   11   12
      ┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐
Inst1 │ F  │ D  │ E  │    │    │    │    │    │    │    │    │
      ├────┼────┼────┼────┼────┼────┼────┼────┼────┼────┼────┤
Inst2 │    │    │    │ F  │ D  │ E  │    │    │    │    │    │
      ├────┼────┼────┼────┼────┼────┼────┼────┼────┼────┼────┤
Inst3 │    │    │    │    │    │    │ F  │ D  │ E  │    │    │
      ├────┼────┼────┼────┼────┼────┼────┼────┼────┼────┼────┤
Inst4 │    │    │    │    │    │    │    │    │    │ F  │ D  │
      └────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘

F = Fetch, D = Decode, E = Execute
Total time: 11 cycles for 4 instructions

Pipelined Execution

Time: 1    2    3    4    5    6    7
      ┌────┬────┬────┬────┬────┬────┐
Inst1 │ F  │ D  │ E  │    │    │    │
      ├────┼────┼────┼────┼────┼────┤
Inst2 │    │ F  │ D  │ E  │    │    │
      ├────┼────┼────┼────┼────┼────┤
Inst3 │    │    │ F  │ D  │ E  │    │
      ├────┼────┼────┼────┼────┼────┤
Inst4 │    │    │    │ F  │ D  │ E  │
      └────┴────┴────┴────┴────┴────┘

Total time: 6 cycles for 4 instructions
Speedup: 11/6 = 1.83×

Ideal Speedup: For n stages, speedup approaches n for long programs.

Pipeline Stages (5-stage classic RISC)

1. IF  - Instruction Fetch
2. ID  - Instruction Decode
3. EX  - Execute
4. MEM - Memory Access
5. WB  - Write Back

┌────┬────┬────┬────┬────┐
│ IF │ ID │ EX │MEM │ WB │  Instruction 1
└────┴────┴────┴────┴────┘
     ┌────┬────┬────┬────┬────┐
     │ IF │ ID │ EX │MEM │ WB │  Instruction 2
     └────┴────┴────┴────┴────┘
          ┌────┬────┬────┬────┬────┐
          │ IF │ ID │ EX │MEM │ WB │  Instruction 3
          └────┴────┴────┴────┴────┘

Pipeline Hazards

1. Structural Hazards

Resource conflict (e.g., memory access)
Solution: Separate instruction/data caches

2. Data Hazards

ADD R1, R2, R3   ; R1 = R2 + R3
SUB R4, R1, R5   ; R4 = R1 - R5  (needs R1 from previous!)

Solutions: Forwarding, stalling, reordering

3. Control Hazards

BEQ R1, R2, label  ; Branch instruction
ADD R3, R4, R5     ; What if we branch? This shouldn't execute!

Solutions: Branch prediction, delayed branches

Multi-Core and Parallel Processing

Modern CPUs have multiple cores:

┌─────────────────────────────────────┐
│           CPU Package               │
│                                     │
│  ┌─────────┐      ┌─────────┐     │
│  │ Core 0  │      │ Core 1  │     │
│  │  ┌───┐  │      │  ┌───┐  │     │
│  │  │ALU│  │      │  │ALU│  │     │
│  │  ├───┤  │      │  ├───┤  │     │
│  │  │Reg│  │      │  │Reg│  │     │
│  │  └───┘  │      │  └───┘  │     │
│  │ L1 Cache│      │ L1 Cache│     │
│  └────┬────┘      └────┬────┘     │
│       │                │          │
│       └────┬──── ──────┘          │
│            │                       │
│       ┌────▼────┐                 │
│       │ L2/L3   │                 │
│       │ Cache   │                 │
│       └─────────┘                 │
└─────────────────────────────────────┘

Benefits:

True parallel execution
Better multitasking
Higher throughput

Challenges:

Synchronization
Cache coherency
Programming complexity

Practical Example: Tracing Execution

Simple Program

; Calculate: result = (a + b) - c
LOAD  R1, [a_addr]    ; R1 = a
LOAD  R2, [b_addr]    ; R2 = b
ADD   R3, R1, R2      ; R3 = a + b
LOAD  R4, [c_addr]    ; R4 = c
SUB   R5, R3, R4      ; R5 = (a+b) - c
STORE R5, [result]    ; result = R5

Detailed Trace

Initial State:

Memory[a_addr] = 10
Memory[b_addr] = 20
Memory[c_addr] = 5
PC = 0x1000

Instruction 1: LOAD R1, [a_addr]

FETCH:   IR ← Memory[PC], PC ← PC + 4
DECODE:  Opcode = LOAD, Dest = R1, Addr = a_addr
EXECUTE: MAR ← a_addr, MDR ← Memory[MAR], R1 ← MDR
Result:  R1 = 10

Instruction 2: LOAD R2, [b_addr]

Result: R2 = 20

Instruction 3: ADD R3, R1, R2

FETCH:   IR ← Memory[PC], PC ← PC + 4
DECODE:  Opcode = ADD, Dest = R3, Src1 = R1, Src2 = R2
EXECUTE: ALU ← R1 + R2, R3 ← ALU_Result
Result:  R3 = 30

Instruction 4: LOAD R4, [c_addr]

Result: R4 = 5

Instruction 5: SUB R5, R3, R4

EXECUTE: ALU ← R3 - R4, R5 ← ALU_Result
Result:  R5 = 25

Instruction 6: STORE R5, [result]

EXECUTE: MAR ← result, MDR ← R5, Memory[MAR] ← MDR
Result:  Memory[result] = 25

Programming Example

Python: Simple CPU Simulator

class SimpleCPU:
    def __init__(self):
        self.registers = [0] * 8  # 8 general-purpose registers
        self.pc = 0               # Program counter
        self.ir = 0               # Instruction register
        self.flags = {            # Status flags
            'Z': 0,  # Zero
            'N': 0,  # Negative
            'C': 0,  # Carry
            'V': 0   # Overflow
        }
        self.memory = [0] * 1024  # 1KB memory
        self.running = True

    def fetch(self):
        """Fetch instruction from memory"""
        self.ir = self.memory[self.pc]
        self.pc += 1

    def decode_execute(self):
        """Decode and execute instruction"""
        opcode = (self.ir >> 12) & 0xF
        rd = (self.ir >> 8) & 0xF
        rs = (self.ir >> 4) & 0xF
        rt = self.ir & 0xF

        if opcode == 0x0:    # HALT
            self.running = False
        elif opcode == 0x1:  # ADD
            result = self.registers[rs] + self.registers[rt]
            self.registers[rd] = result & 0xFFFF
            self.update_flags(result)
        elif opcode == 0x2:  # SUB
            result = self.registers[rs] - self.registers[rt]
            self.registers[rd] = result & 0xFFFF
            self.update_flags(result)
        elif opcode == 0x3:  # LOAD immediate
            value = self.ir & 0xFF
            self.registers[rd] = value
        elif opcode == 0x4:  # STORE
            addr = self.registers[rs]
            self.memory[addr] = self.registers[rd]

    def update_flags(self, result):
        """Update status flags based on result"""
        self.flags['Z'] = 1 if result == 0 else 0
        self.flags['N'] = 1 if result < 0 else 0

    def run(self):
        """Run the fetch-decode-execute cycle"""
        cycle = 0
        while self.running and cycle < 1000:
            print(f"\n--- Cycle {cycle} ---")
            print(f"PC: {self.pc}, IR: {hex(self.ir)}")

            self.fetch()
            print(f"Fetched: {hex(self.ir)} from address {self.pc - 1}")

            self.decode_execute()
            print(f"Registers: {self.registers[:4]}")
            print(f"Flags: Z={self.flags['Z']}, N={self.flags['N']}")

            cycle += 1

# Example usage
cpu = SimpleCPU()

# Program: Add two numbers
# LOAD R0, #10
# LOAD R1, #20
# ADD R2, R0, R1
# HALT

cpu.memory[0] = 0x300A  # LOAD R0, #10
cpu.memory[1] = 0x3114  # LOAD R1, #20
cpu.memory[2] = 0x1201  # ADD R2, R0, R1
cpu.memory[3] = 0x0000  # HALT

cpu.run()
print(f"\nFinal result in R2: {cpu.registers[2]}")

Exercises

Basic Exercises

Fetch-Decode-Execute
- Trace the execution of: LOAD R1, #50; ADD R2, R1, R1
- Identify which CPU component is active at each stage
- Draw a timing diagram for the execution
Instruction Encoding
- Encode ADD R3, R1, R2 in a 16-bit format (4-bit opcode, 4-bit per register)
- Decode: 0x2315 (opcode: 2=SUB, registers: 3,1,5)
- Design a 3-bit opcode instruction set with 8 operations
Registers
- Why are registers faster than RAM?
- What happens to PC after a jump instruction?
- Trace PC values for a 5-instruction sequence with one jump

Intermediate Exercises

Pipeline Analysis
- Draw a pipeline diagram for 6 instructions (3-stage pipeline)
- Calculate speedup vs. non-pipelined execution
- Identify a data hazard in: ADD R1,R2,R3; SUB R4,R1,R5
Performance Calculation
- Program: 1M instructions, CPI=1.5, Clock=3GHz. Find execution time.
- Compare: 2GHz RISC (CPI=1) vs. 3GHz CISC (CPI=3) for same program
- Calculate MIPS rating for a 2.5GHz processor with CPI of 2
ISA Design
- Design addressing modes for: array access, function call, global variable
- Compare instruction count: RISC vs CISC for a simple loop
- Justify LOAD/STORE architecture advantages

Advanced Exercises

Complex Tracing
- Trace execution with branch: if (a>b) c=a else c=b
- Show pipeline diagram including a branch misprediction
- Calculate performance impact of branch misprediction
Architecture Comparison
- Design both CISC and RISC instruction sets for: array[i] = array[i] + 5
- Compare code size, instruction count, and complexity
- Analyze trade-offs for embedded vs desktop systems
Multi-Core
- Design a task distribution strategy for 4 cores
- Identify where cache coherency problems might occur
- Calculate theoretical speedup (consider Amdahl's Law)
CPU Design
- Design a minimal ISA (8 instructions) for basic computation
- Create a state machine for a 3-stage pipeline
- Estimate logic gate count for a simple ALU (4-bit, ADD/SUB/AND/OR)

Summary

In this reading, we explored the architecture and operation of the CPU:

Key Concepts:

Von Neumann architecture provides the foundation for modern computers
CPU components: ALU (computation), Control Unit (coordination), Registers (fast storage)
Fetch-Decode-Execute cycle is the fundamental operation loop
ISA defines the interface between hardware and software
Pipelining improves throughput by overlapping instruction execution
CISC vs RISC represent different design philosophies

Important Skills:

Tracing instruction execution through the CPU
Understanding how instructions are encoded
Analyzing performance based on clock speed, CPI, and instruction count
Recognizing pipeline hazards and their solutions

Why This Matters:

Foundation for understanding how programs actually execute
Essential for writing performance-critical code
Helps debug low-level issues
Critical for systems programming and compiler design

Next Steps

Now that you understand how the CPU processes instructions, you're ready to learn about the memory hierarchy that stores the data and instructions the CPU needs.

Next Reading: 04-memory.md - Memory Hierarchy

Module 5: Computer Architecture | Reading 3 of 5

Historical Context: Von Neumann Architecture

Von Neumann Architecture Components

Key Principles

Harvard Architecture

Major CPU Components

1. Arithmetic Logic Unit (ALU)

2. Control Unit (CU)

3. Registers

The Fetch-Decode-Execute Cycle

Detailed Cycle Steps

Example: ADD Instruction

Instruction Set Architecture (ISA)

Components of an Instruction

Instruction Types

Addressing Modes

Machine Code and Encoding

Example: Simple 16-bit Instruction Format

Instruction Formats

CISC vs RISC

CISC (Complex Instruction Set Computer)

RISC (Reduced Instruction Set Computer)

Modern Trend: Hybrid Approach

CPU Performance

Clock Speed

Cycles Per Instruction (CPI)

Performance Metrics

Pipelining

Non-Pipelined Execution

Pipelined Execution

Pipeline Stages (5-stage classic RISC)

Pipeline Hazards

Multi-Core and Parallel Processing

Practical Example: Tracing Execution

Simple Program

Detailed Trace

Programming Example

Python: Simple CPU Simulator

Exercises

Basic Exercises

Intermediate Exercises

Advanced Exercises

Summary

Further Reading

Next Steps