GPU and Parallel Hardware | Computer Science Fundamentals

GPUs (Graphics Processing Units) offer massive parallelism with thousands of cores optimized for data-parallel workloads. ## 1. GPU Architecture

CPU vs GPU

CPU (Few powerful cores):
┌─────────────┐
│ Core 0      │  - Complex control logic
│ [ALU][Cache]│  - Out-of-order execution
├─────────────┤  - Branch prediction
│ Core 1      │  - Large caches
│ [ALU][Cache]│  - Optimized for latency
├─────────────┤
│ Core 2-7... │
└─────────────┘

GPU (Many simple cores):
┌────────────────────────────────────┐
│ SM 0                               │
│ [C][C][C][C][C][C][C][C]...32 cores│
│ [Shared Memory] [Registers]        │
├────────────────────────────────────┤
│ SM 1                               │  - Simple cores
│ [C][C][C][C][C][C][C][C]...32 cores│  - SIMT execution
├────────────────────────────────────┤  - Small caches
│ SM 2-N...                          │  - Optimized for throughput
└────────────────────────────────────┘
│              Global Memory                │
└───────────────────────────────────────────┘

Key Concepts

"""
GPU Programming Concepts:

1. KERNEL: Function that runs on GPU
   - Launched with many threads
   - Each thread runs same code on different data

2. THREAD: Single execution unit
   - Has unique ID (threadIdx)
   - Can access shared and global memory

3. BLOCK (Thread Block): Group of threads
   - Share fast shared memory
   - Can synchronize within block
   - Has unique ID (blockIdx)

4. GRID: Collection of blocks
   - Organizes all threads for kernel
   - Can be 1D, 2D, or 3D

5. WARP: 32 threads executed in lockstep (NVIDIA)
   - All threads in warp execute same instruction
   - Divergence causes serialization
"""

# Conceptual thread indexing
def get_thread_index_1d(block_idx, block_dim, thread_idx):
    """1D global thread index"""
    return block_idx * block_dim + thread_idx

def get_thread_index_2d(block_idx, block_dim, thread_idx):
    """2D global thread index"""
    global_x = block_idx[0] * block_dim[0] + thread_idx[0]
    global_y = block_idx[1] * block_dim[1] + thread_idx[1]
    return (global_x, global_y)

Memory Hierarchy

"""
GPU Memory Hierarchy (fastest to slowest):

1. REGISTERS
   - Per-thread private
   - Fastest access (~1 cycle)
   - Limited quantity

2. SHARED MEMORY
   - Per-block, shared by threads in block
   - Fast (~5 cycles)
   - User-managed cache
   - ~48KB-164KB per block

3. L1/L2 CACHE
   - Hardware-managed
   - L1 is per-SM, L2 is shared

4. GLOBAL MEMORY (VRAM)
   - Accessible by all threads
   - Slowest (~400-800 cycles)
   - Large (8-80GB)

5. CONSTANT MEMORY
   - Read-only, cached
   - Broadcast to all threads

6. TEXTURE MEMORY
   - Read-only, cached
   - Optimized for spatial locality
"""

class MemoryAccessPattern:
    """Demonstrate memory access patterns"""

    @staticmethod
    def coalesced_access(thread_id, data):
        """
        GOOD: Adjacent threads access adjacent memory.
        GPU can combine into single memory transaction.

        Thread 0: data[0]
        Thread 1: data[1]
        Thread 2: data[2]
        ...
        """
        return data[thread_id]

    @staticmethod
    def strided_access(thread_id, data, stride):
        """
        BAD: Threads access non-adjacent memory.
        Causes multiple memory transactions.

        Thread 0: data[0]
        Thread 1: data[stride]
        Thread 2: data[2*stride]
        ...
        """
        return data[thread_id * stride]

    @staticmethod
    def random_access(thread_id, data, indices):
        """
        WORST: Random access pattern.
        No coalescing possible.

        Thread 0: data[indices[0]]  # could be anywhere
        Thread 1: data[indices[1]]  # could be anywhere
        ...
        """
        return data[indices[thread_id]]

2. GPU Programming Model

Conceptual CUDA-like Programming

import numpy as np
from typing import Callable
import multiprocessing as mp

class SimulatedGPU:
    """Simulates GPU execution model for learning"""

    def __init__(self, num_sms=4, threads_per_block=256):
        self.num_sms = num_sms
        self.threads_per_block = threads_per_block

    def launch_kernel(
        self,
        kernel: Callable,
        grid_dim: tuple,
        block_dim: tuple,
        *args
    ):
        """
        Launch a kernel with specified grid and block dimensions.

        kernel: Function to execute for each thread
        grid_dim: Number of blocks (x, y, z)
        block_dim: Threads per block (x, y, z)
        args: Arguments passed to kernel
        """
        total_blocks = np.prod(grid_dim)
        total_threads = total_blocks * np.prod(block_dim)

        print(f"Launching kernel with {total_blocks} blocks, "
              f"{total_threads} total threads")

        # Simulate execution (in reality, GPU executes in parallel)
        for bx in range(grid_dim[0]):
            for by in range(grid_dim[1] if len(grid_dim) > 1 else 1):
                for bz in range(grid_dim[2] if len(grid_dim) > 2 else 1):
                    block_idx = (bx, by, bz)

                    for tx in range(block_dim[0]):
                        for ty in range(block_dim[1] if len(block_dim) > 1 else 1):
                            for tz in range(block_dim[2] if len(block_dim) > 2 else 1):
                                thread_idx = (tx, ty, tz)

                                # Execute kernel for this thread
                                kernel(block_idx, block_dim, thread_idx, *args)

# Example kernel: Vector addition
def vector_add_kernel(block_idx, block_dim, thread_idx, a, b, c, n):
    """
    Kernel for c = a + b

    Each thread computes one element.
    """
    # Calculate global thread index
    idx = block_idx[0] * block_dim[0] + thread_idx[0]

    # Bounds check (grid might be larger than data)
    if idx < n:
        c[idx] = a[idx] + b[idx]

def vector_add_example():
    """Demonstrate GPU vector addition"""
    n = 1000
    a = np.random.randn(n).astype(np.float32)
    b = np.random.randn(n).astype(np.float32)
    c = np.zeros(n, dtype=np.float32)

    gpu = SimulatedGPU()

    # Calculate grid dimensions
    threads_per_block = 256
    blocks = (n + threads_per_block - 1) // threads_per_block

    gpu.launch_kernel(
        vector_add_kernel,
        grid_dim=(blocks,),
        block_dim=(threads_per_block,),
        a, b, c, n
    )

    # Verify
    expected = a + b
    print(f"Correct: {np.allclose(c, expected)}")

NumPy/CuPy Style GPU Programming

# In practice, use high-level libraries
# CuPy provides NumPy-compatible GPU arrays

"""
# With CuPy (actual GPU library):
import cupy as cp

# Create GPU arrays
a_gpu = cp.array([1, 2, 3, 4, 5])
b_gpu = cp.array([5, 4, 3, 2, 1])

# Operations run on GPU automatically
c_gpu = a_gpu + b_gpu
d_gpu = cp.dot(a_gpu, b_gpu)
e_gpu = cp.sum(a_gpu)

# Transfer back to CPU
c_cpu = cp.asnumpy(c_gpu)
"""

# Simulated high-level GPU operations
class GPUArray:
    """Simulated GPU array for demonstration"""

    def __init__(self, data):
        if isinstance(data, np.ndarray):
            self.data = data.copy()
        else:
            self.data = np.array(data)
        self._on_gpu = True

    def __add__(self, other):
        """Element-wise addition (would be parallel on GPU)"""
        if isinstance(other, GPUArray):
            return GPUArray(self.data + other.data)
        return GPUArray(self.data + other)

    def __mul__(self, other):
        """Element-wise multiplication"""
        if isinstance(other, GPUArray):
            return GPUArray(self.data * other.data)
        return GPUArray(self.data * other)

    def sum(self):
        """Reduction (parallel on GPU)"""
        return GPUArray(np.array([self.data.sum()]))

    def to_cpu(self):
        """Transfer to CPU memory"""
        self._on_gpu = False
        return self.data.copy()

    @staticmethod
    def from_cpu(data):
        """Transfer from CPU to GPU"""
        return GPUArray(data)

3. Common GPU Algorithms

Parallel Reduction

class ParallelReduction:
    """
    Parallel reduction pattern.

    Reduce N elements to 1 using O(log N) parallel steps.
    """

    @staticmethod
    def naive_reduction(data: list, op=lambda a, b: a + b) -> float:
        """
        Naive sequential reduction: O(N)
        """
        result = data[0]
        for x in data[1:]:
            result = op(result, x)
        return result

    @staticmethod
    def parallel_reduction_concept(data: list, op=lambda a, b: a + b) -> float:
        """
        Parallel reduction: O(log N) parallel steps.

        Step 1: [0,1,2,3,4,5,6,7] -> pairs
        Step 2: [0+1, 2+3, 4+5, 6+7] = [1,5,9,13]
        Step 3: [1+5, 9+13] = [6, 22]
        Step 4: [6+22] = 28
        """
        n = len(data)
        result = data.copy()

        stride = 1
        while stride < n:
            # In parallel, each thread i computes:
            # result[i*2*stride] += result[i*2*stride + stride]
            new_result = []
            for i in range(0, n, stride * 2):
                if i + stride < n:
                    new_result.append(op(result[i], result[i + stride]))
                else:
                    new_result.append(result[i])
            result = new_result + [0] * (n - len(new_result))
            stride *= 2

        return result[0]

    @staticmethod
    def tree_reduction(data: list, op=lambda a, b: a + b) -> float:
        """
        Tree reduction (better memory access pattern).

        Thread i computes: result[i] = op(data[i], data[i + n/2])
        Then: result[i] = op(result[i], result[i + n/4])
        etc.
        """
        n = len(data)
        result = data.copy()

        active_threads = n // 2
        while active_threads > 0:
            # Each active thread combines two elements
            for i in range(active_threads):
                result[i] = op(result[i], result[i + active_threads])
            active_threads //= 2

        return result[0]

# Demonstrate reduction
data = [1, 2, 3, 4, 5, 6, 7, 8]
pr = ParallelReduction()
print(f"Naive: {pr.naive_reduction(data)}")
print(f"Parallel concept: {pr.parallel_reduction_concept(data)}")
print(f"Tree: {pr.tree_reduction(data)}")

Parallel Prefix Sum (Scan)

class ParallelScan:
    """
    Parallel prefix sum (scan).

    Input:  [3, 1, 7, 0, 4, 1, 6, 3]
    Output: [3, 4, 11, 11, 15, 16, 22, 25] (inclusive)
    Or:     [0, 3, 4, 11, 11, 15, 16, 22] (exclusive)
    """

    @staticmethod
    def sequential_scan(data: list) -> list:
        """Sequential prefix sum: O(N)"""
        result = []
        total = 0
        for x in data:
            total += x
            result.append(total)
        return result

    @staticmethod
    def hillis_steele_scan(data: list) -> list:
        """
        Hillis-Steele scan: O(N log N) work, O(log N) steps.
        Work-inefficient but highly parallel.
        """
        n = len(data)
        result = data.copy()

        stride = 1
        while stride < n:
            # Each thread i computes (in parallel):
            # new[i] = result[i] + result[i - stride] (if i >= stride)
            new_result = result.copy()
            for i in range(stride, n):
                new_result[i] = result[i] + result[i - stride]
            result = new_result
            stride *= 2

        return result

    @staticmethod
    def blelloch_scan(data: list) -> list:
        """
        Blelloch scan: O(N) work, O(log N) steps.
        Work-efficient, two-phase algorithm.

        Phase 1 (Up-sweep): Build reduction tree
        Phase 2 (Down-sweep): Traverse tree to compute prefix sums
        """
        n = len(data)
        # Pad to power of 2
        import math
        padded_n = 2 ** math.ceil(math.log2(n))
        result = data + [0] * (padded_n - n)

        # Up-sweep (reduce)
        stride = 1
        while stride < padded_n:
            for i in range(stride - 1, padded_n - 1, stride * 2):
                result[i + stride] += result[i]
            stride *= 2

        # Set root to 0 for exclusive scan
        result[padded_n - 1] = 0

        # Down-sweep
        stride = padded_n // 2
        while stride > 0:
            for i in range(stride - 1, padded_n - 1, stride * 2):
                temp = result[i]
                result[i] = result[i + stride]
                result[i + stride] += temp
            stride //= 2

        # Convert to inclusive by shifting and adding original
        inclusive = [result[i] + data[i] if i < n else 0 for i in range(n)]
        return inclusive

# Demonstrate scans
data = [3, 1, 7, 0, 4, 1, 6, 3]
ps = ParallelScan()
print(f"Input:         {data}")
print(f"Sequential:    {ps.sequential_scan(data)}")
print(f"Hillis-Steele: {ps.hillis_steele_scan(data)}")
print(f"Blelloch:      {ps.blelloch_scan(data)}")

Parallel Matrix Multiplication

class ParallelMatrixMultiply:
    """
    Matrix multiplication on GPU.
    C = A * B where A is MxK and B is KxN
    """

    @staticmethod
    def naive_kernel(block_idx, block_dim, thread_idx, A, B, C, M, K, N):
        """
        Naive: Each thread computes one element of C.
        Poor memory access pattern.
        """
        row = block_idx[0] * block_dim[0] + thread_idx[0]
        col = block_idx[1] * block_dim[1] + thread_idx[1]

        if row < M and col < N:
            total = 0
            for k in range(K):
                total += A[row][k] * B[k][col]
            C[row][col] = total

    @staticmethod
    def tiled_matmul_concept(A, B, tile_size=16):
        """
        Tiled matrix multiplication.

        Load tiles into shared memory, compute, repeat.
        Much better memory access pattern.
        """
        M, K = A.shape
        K2, N = B.shape
        assert K == K2

        C = np.zeros((M, N))

        # Process tiles
        for tile_row in range(0, M, tile_size):
            for tile_col in range(0, N, tile_size):
                # Initialize tile result
                tile_c = np.zeros((tile_size, tile_size))

                for tile_k in range(0, K, tile_size):
                    # Load tiles into "shared memory"
                    tile_a = A[tile_row:tile_row+tile_size,
                              tile_k:tile_k+tile_size]
                    tile_b = B[tile_k:tile_k+tile_size,
                              tile_col:tile_col+tile_size]

                    # Compute partial result
                    tile_c += tile_a @ tile_b

                # Write result
                C[tile_row:tile_row+tile_size,
                  tile_col:tile_col+tile_size] = tile_c

        return C

# Why tiling helps
"""
Without tiling:
- Each thread loads M*K elements from A (many times)
- Total global memory reads: M*N*K + M*N*K = 2MNK

With tiling (tile size T):
- Load each tile once into shared memory
- Reuse T times
- Total global memory reads: 2MNK/T

For T=16, this is 16x fewer memory accesses!
"""

4. GPU Optimization

Memory Coalescing

class MemoryCoalescing:
    """
    Memory coalescing: Combining memory accesses.

    When threads in a warp access consecutive addresses,
    GPU combines into fewer memory transactions.
    """

    @staticmethod
    def coalesced_example():
        """
        GOOD: Threads access consecutive elements.

        Thread 0: array[0]
        Thread 1: array[1]
        ...
        Thread 31: array[31]

        -> 1 memory transaction (128 bytes)
        """
        pass

    @staticmethod
    def strided_example():
        """
        BAD: Threads access with stride.

        Thread 0: array[0]
        Thread 1: array[32]
        Thread 2: array[64]
        ...

        -> 32 memory transactions!
        """
        pass

    @staticmethod
    def transpose_for_coalescing(matrix, rows, cols):
        """
        Sometimes transpose data to enable coalescing.

        Row-major access (C-style): Adjacent cols are adjacent in memory
        - Reading row: coalesced
        - Reading column: strided

        If algorithm needs column access, transpose first!
        """
        # Original: row-major
        # matrix[row][col] = matrix[row * cols + col]

        # Transposed: column becomes row
        transposed = [[matrix[r][c] for r in range(rows)] for c in range(cols)]
        return transposed

Avoiding Warp Divergence

class WarpDivergence:
    """
    Warp divergence: When threads in a warp take different branches.

    All 32 threads in a warp execute in lockstep.
    If some take branch A and others take branch B:
    - First, all execute A (non-A threads are masked)
    - Then, all execute B (non-B threads are masked)
    - Serialization!
    """

    @staticmethod
    def divergent_kernel(thread_idx, data, result):
        """
        BAD: Threads diverge based on data value.

        Different threads likely have different values.
        """
        if data[thread_idx] > 0:
            result[thread_idx] = data[thread_idx] * 2
        else:
            result[thread_idx] = data[thread_idx] * -1

    @staticmethod
    def less_divergent_kernel(thread_idx, data, result):
        """
        BETTER: Diverge based on thread ID.

        Threads 0-15 all take same branch.
        Threads 16-31 all take same branch.
        Warp 0 doesn't diverge, warp 1 doesn't diverge.
        """
        if thread_idx < 16:
            result[thread_idx] = data[thread_idx] * 2
        else:
            result[thread_idx] = data[thread_idx] * -1

    @staticmethod
    def no_divergence_kernel(thread_idx, data, result):
        """
        BEST: No branching, use math.

        Predicated execution or branchless code.
        """
        sign = 1 if data[thread_idx] > 0 else -1
        factor = 2 if data[thread_idx] > 0 else 1
        result[thread_idx] = data[thread_idx] * factor * sign

        # Or branchless:
        # pos = int(data[thread_idx] > 0)
        # result[thread_idx] = data[thread_idx] * (2 * pos + (1 - pos) * -1)

Occupancy and Resource Usage

class GPUOccupancy:
    """
    Occupancy: Ratio of active warps to maximum possible.

    Higher occupancy can hide memory latency.
    But sometimes lower occupancy with more registers is better!
    """

    def __init__(self, max_threads_per_sm=2048, max_blocks_per_sm=32,
                 registers_per_sm=65536, shared_memory_per_sm=49152):
        self.max_threads_per_sm = max_threads_per_sm
        self.max_blocks_per_sm = max_blocks_per_sm
        self.registers_per_sm = registers_per_sm
        self.shared_memory_per_sm = shared_memory_per_sm

    def calculate_occupancy(self, threads_per_block, registers_per_thread,
                           shared_memory_per_block):
        """Calculate theoretical occupancy"""

        # Limit by threads per block
        blocks_by_threads = self.max_threads_per_sm // threads_per_block

        # Limit by registers
        registers_per_block = threads_per_block * registers_per_thread
        blocks_by_registers = self.registers_per_sm // registers_per_block

        # Limit by shared memory
        if shared_memory_per_block > 0:
            blocks_by_smem = self.shared_memory_per_sm // shared_memory_per_block
        else:
            blocks_by_smem = self.max_blocks_per_sm

        # Limit by max blocks
        blocks_by_limit = self.max_blocks_per_sm

        # Actual blocks
        blocks = min(blocks_by_threads, blocks_by_registers,
                    blocks_by_smem, blocks_by_limit)

        active_threads = blocks * threads_per_block
        occupancy = active_threads / self.max_threads_per_sm

        return {
            'blocks_per_sm': blocks,
            'active_threads': active_threads,
            'occupancy': occupancy,
            'limited_by': self._find_limiter(
                blocks_by_threads, blocks_by_registers,
                blocks_by_smem, blocks_by_limit
            )
        }

    def _find_limiter(self, t, r, s, l):
        m = min(t, r, s, l)
        if m == t: return 'threads'
        if m == r: return 'registers'
        if m == s: return 'shared_memory'
        return 'block_limit'

# Example
occ = GPUOccupancy()
result = occ.calculate_occupancy(
    threads_per_block=256,
    registers_per_thread=32,
    shared_memory_per_block=4096
)
print(f"Occupancy: {result['occupancy']:.1%}")
print(f"Limited by: {result['limited_by']}")

5. When to Use GPUs

Decision Framework

class GPUDecisionFramework:
    """Guidelines for when to use GPU vs CPU"""

    @staticmethod
    def should_use_gpu(problem):
        """
        Factors favoring GPU:
        1. Data parallelism (same operation on many elements)
        2. Large dataset (amortize transfer overhead)
        3. Compute-intensive (arithmetic intensity)
        4. Regular memory access patterns

        Factors favoring CPU:
        1. Sequential/branching logic
        2. Small dataset
        3. Memory-intensive (low arithmetic intensity)
        4. Irregular memory access
        5. Need for quick turnaround (no transfer overhead)
        """
        score = 0

        # Data parallelism
        if problem.get('parallelizable_fraction', 0) > 0.9:
            score += 3

        # Dataset size
        size = problem.get('data_size_mb', 0)
        if size > 100:
            score += 2
        elif size < 1:
            score -= 2

        # Arithmetic intensity (FLOPs per byte)
        intensity = problem.get('arithmetic_intensity', 0)
        if intensity > 10:
            score += 3
        elif intensity < 1:
            score -= 2

        # Memory pattern
        if problem.get('memory_pattern') == 'regular':
            score += 1
        elif problem.get('memory_pattern') == 'random':
            score -= 2

        return score > 2

# Examples
problems = [
    {
        'name': 'Matrix multiplication',
        'parallelizable_fraction': 0.99,
        'data_size_mb': 500,
        'arithmetic_intensity': 100,  # N FLOPs per element
        'memory_pattern': 'regular'
    },
    {
        'name': 'Graph traversal (BFS)',
        'parallelizable_fraction': 0.7,
        'data_size_mb': 50,
        'arithmetic_intensity': 0.5,  # Just checking neighbors
        'memory_pattern': 'random'
    },
    {
        'name': 'Small sort',
        'parallelizable_fraction': 0.8,
        'data_size_mb': 0.1,
        'arithmetic_intensity': 5,
        'memory_pattern': 'irregular'
    }
]

df = GPUDecisionFramework()
for p in problems:
    use_gpu = df.should_use_gpu(p)
    print(f"{p['name']}: {'GPU' if use_gpu else 'CPU'}")

Profiling and Benchmarking

import time

def benchmark_comparison(cpu_func, gpu_func, data, warmup=3, iterations=10):
    """Compare CPU vs GPU performance"""

    # Warmup (important for GPU!)
    for _ in range(warmup):
        cpu_func(data)
        gpu_func(data)

    # CPU timing
    cpu_times = []
    for _ in range(iterations):
        start = time.perf_counter()
        cpu_result = cpu_func(data)
        cpu_times.append(time.perf_counter() - start)

    # GPU timing (include transfer time)
    gpu_times = []
    for _ in range(iterations):
        start = time.perf_counter()
        gpu_result = gpu_func(data)
        gpu_times.append(time.perf_counter() - start)

    return {
        'cpu_mean': sum(cpu_times) / len(cpu_times),
        'gpu_mean': sum(gpu_times) / len(gpu_times),
        'speedup': (sum(cpu_times) / len(cpu_times)) /
                   (sum(gpu_times) / len(gpu_times))
    }

Exercises

Basic

Calculate the global thread index for a 2D grid with blocks (4, 4) and threads (16, 16).
Explain why adjacent threads should access adjacent memory locations.
Implement a parallel reduction for finding the maximum element.

Intermediate

Implement tiled matrix multiplication and compare memory accesses with naive version.
Analyze a piece of code for warp divergence and optimize it.
Calculate the occupancy for a kernel using 128 threads, 40 registers/thread, and 8KB shared memory.

Advanced

Implement parallel radix sort using prefix sums.
Design a GPU algorithm for sparse matrix-vector multiplication.
Optimize a GPU kernel by analyzing and improving its memory access patterns.

Summary

GPUs have thousands of simple cores for throughput
Memory coalescing is critical for performance
Avoid warp divergence by restructuring conditionals
Parallel reduction and scan are fundamental building blocks
Tiling improves data reuse through shared memory
Use GPUs for data-parallel, compute-intensive workloads

Module Complete

This completes the Parallel Computing module! You've learned fundamental concepts for writing efficient parallel programs.

← Previous: Concurrent Patterns | Back to Course Index