GPU and Parallel Hardware

Introduction

GPUs (Graphics Processing Units) offer massive parallelism with thousands of cores optimized for data-parallel workloads. This reading covers GPU architecture, programming models, and parallel algorithms for accelerators.

Learning Objectives

By the end of this reading, you will be able to:

  • Understand GPU architecture and memory hierarchy
  • Write data-parallel programs conceptually
  • Apply GPU programming patterns
  • Optimize memory access patterns
  • Understand when to use GPUs vs CPUs

1. GPU Architecture

CPU vs GPU

CPU (Few powerful cores):
┌─────────────┐
│ Core 0      │  - Complex control logic
│ [ALU][Cache]│  - Out-of-order execution
├─────────────┤  - Branch prediction
│ Core 1      │  - Large caches
│ [ALU][Cache]│  - Optimized for latency
├─────────────┤
│ Core 2-7... │
└─────────────┘

GPU (Many simple cores):
┌────────────────────────────────────┐
│ SM 0                               │
│ [C][C][C][C][C][C][C][C]...32 cores│
│ [Shared Memory] [Registers]        │
├────────────────────────────────────┤
│ SM 1                               │  - Simple cores
│ [C][C][C][C][C][C][C][C]...32 cores│  - SIMT execution
├────────────────────────────────────┤  - Small caches
│ SM 2-N...                          │  - Optimized for throughput
└────────────────────────────────────┘
│              Global Memory                │
└───────────────────────────────────────────┘

Key Concepts

"""
GPU Programming Concepts:

1. KERNEL: Function that runs on GPU
   - Launched with many threads
   - Each thread runs same code on different data

2. THREAD: Single execution unit
   - Has unique ID (threadIdx)
   - Can access shared and global memory

3. BLOCK (Thread Block): Group of threads
   - Share fast shared memory
   - Can synchronize within block
   - Has unique ID (blockIdx)

4. GRID: Collection of blocks
   - Organizes all threads for kernel
   - Can be 1D, 2D, or 3D

5. WARP: 32 threads executed in lockstep (NVIDIA)
   - All threads in warp execute same instruction
   - Divergence causes serialization
"""

# Conceptual thread indexing
def get_thread_index_1d(block_idx, block_dim, thread_idx):
    """1D global thread index"""
    return block_idx * block_dim + thread_idx

def get_thread_index_2d(block_idx, block_dim, thread_idx):
    """2D global thread index"""
    global_x = block_idx[0] * block_dim[0] + thread_idx[0]
    global_y = block_idx[1] * block_dim[1] + thread_idx[1]
    return (global_x, global_y)

Memory Hierarchy

"""
GPU Memory Hierarchy (fastest to slowest):

1. REGISTERS
   - Per-thread private
   - Fastest access (~1 cycle)
   - Limited quantity

2. SHARED MEMORY
   - Per-block, shared by threads in block
   - Fast (~5 cycles)
   - User-managed cache
   - ~48KB-164KB per block

3. L1/L2 CACHE
   - Hardware-managed
   - L1 is per-SM, L2 is shared

4. GLOBAL MEMORY (VRAM)
   - Accessible by all threads
   - Slowest (~400-800 cycles)
   - Large (8-80GB)

5. CONSTANT MEMORY
   - Read-only, cached
   - Broadcast to all threads

6. TEXTURE MEMORY
   - Read-only, cached
   - Optimized for spatial locality
"""

class MemoryAccessPattern:
    """Demonstrate memory access patterns"""

    @staticmethod
    def coalesced_access(thread_id, data):
        """
        GOOD: Adjacent threads access adjacent memory.
        GPU can combine into single memory transaction.

        Thread 0: data[0]
        Thread 1: data[1]
        Thread 2: data[2]
        ...
        """
        return data[thread_id]

    @staticmethod
    def strided_access(thread_id, data, stride):
        """
        BAD: Threads access non-adjacent memory.
        Causes multiple memory transactions.

        Thread 0: data[0]
        Thread 1: data[stride]
        Thread 2: data[2*stride]
        ...
        """
        return data[thread_id * stride]

    @staticmethod
    def random_access(thread_id, data, indices):
        """
        WORST: Random access pattern.
        No coalescing possible.

        Thread 0: data[indices[0]]  # could be anywhere
        Thread 1: data[indices[1]]  # could be anywhere
        ...
        """
        return data[indices[thread_id]]

2. GPU Programming Model

Conceptual CUDA-like Programming

import numpy as np
from typing import Callable
import multiprocessing as mp

class SimulatedGPU:
    """Simulates GPU execution model for learning"""

    def __init__(self, num_sms=4, threads_per_block=256):
        self.num_sms = num_sms
        self.threads_per_block = threads_per_block

    def launch_kernel(
        self,
        kernel: Callable,
        grid_dim: tuple,
        block_dim: tuple,
        *args
    ):
        """
        Launch a kernel with specified grid and block dimensions.

        kernel: Function to execute for each thread
        grid_dim: Number of blocks (x, y, z)
        block_dim: Threads per block (x, y, z)
        args: Arguments passed to kernel
        """
        total_blocks = np.prod(grid_dim)
        total_threads = total_blocks * np.prod(block_dim)

        print(f"Launching kernel with {total_blocks} blocks, "
              f"{total_threads} total threads")

        # Simulate execution (in reality, GPU executes in parallel)
        for bx in range(grid_dim[0]):
            for by in range(grid_dim[1] if len(grid_dim) > 1 else 1):
                for bz in range(grid_dim[2] if len(grid_dim) > 2 else 1):
                    block_idx = (bx, by, bz)

                    for tx in range(block_dim[0]):
                        for ty in range(block_dim[1] if len(block_dim) > 1 else 1):
                            for tz in range(block_dim[2] if len(block_dim) > 2 else 1):
                                thread_idx = (tx, ty, tz)

                                # Execute kernel for this thread
                                kernel(block_idx, block_dim, thread_idx, *args)

# Example kernel: Vector addition
def vector_add_kernel(block_idx, block_dim, thread_idx, a, b, c, n):
    """
    Kernel for c = a + b

    Each thread computes one element.
    """
    # Calculate global thread index
    idx = block_idx[0] * block_dim[0] + thread_idx[0]

    # Bounds check (grid might be larger than data)
    if idx < n:
        c[idx] = a[idx] + b[idx]

def vector_add_example():
    """Demonstrate GPU vector addition"""
    n = 1000
    a = np.random.randn(n).astype(np.float32)
    b = np.random.randn(n).astype(np.float32)
    c = np.zeros(n, dtype=np.float32)

    gpu = SimulatedGPU()

    # Calculate grid dimensions
    threads_per_block = 256
    blocks = (n + threads_per_block - 1) // threads_per_block

    gpu.launch_kernel(
        vector_add_kernel,
        grid_dim=(blocks,),
        block_dim=(threads_per_block,),
        a, b, c, n
    )

    # Verify
    expected = a + b
    print(f"Correct: {np.allclose(c, expected)}")

NumPy/CuPy Style GPU Programming

# In practice, use high-level libraries
# CuPy provides NumPy-compatible GPU arrays

"""
# With CuPy (actual GPU library):
import cupy as cp

# Create GPU arrays
a_gpu = cp.array([1, 2, 3, 4, 5])
b_gpu = cp.array([5, 4, 3, 2, 1])

# Operations run on GPU automatically
c_gpu = a_gpu + b_gpu
d_gpu = cp.dot(a_gpu, b_gpu)
e_gpu = cp.sum(a_gpu)

# Transfer back to CPU
c_cpu = cp.asnumpy(c_gpu)
"""

# Simulated high-level GPU operations
class GPUArray:
    """Simulated GPU array for demonstration"""

    def __init__(self, data):
        if isinstance(data, np.ndarray):
            self.data = data.copy()
        else:
            self.data = np.array(data)
        self._on_gpu = True

    def __add__(self, other):
        """Element-wise addition (would be parallel on GPU)"""
        if isinstance(other, GPUArray):
            return GPUArray(self.data + other.data)
        return GPUArray(self.data + other)

    def __mul__(self, other):
        """Element-wise multiplication"""
        if isinstance(other, GPUArray):
            return GPUArray(self.data * other.data)
        return GPUArray(self.data * other)

    def sum(self):
        """Reduction (parallel on GPU)"""
        return GPUArray(np.array([self.data.sum()]))

    def to_cpu(self):
        """Transfer to CPU memory"""
        self._on_gpu = False
        return self.data.copy()

    @staticmethod
    def from_cpu(data):
        """Transfer from CPU to GPU"""
        return GPUArray(data)

3. Common GPU Algorithms

Parallel Reduction

class ParallelReduction:
    """
    Parallel reduction pattern.

    Reduce N elements to 1 using O(log N) parallel steps.
    """

    @staticmethod
    def naive_reduction(data: list, op=lambda a, b: a + b) -> float:
        """
        Naive sequential reduction: O(N)
        """
        result = data[0]
        for x in data[1:]:
            result = op(result, x)
        return result

    @staticmethod
    def parallel_reduction_concept(data: list, op=lambda a, b: a + b) -> float:
        """
        Parallel reduction: O(log N) parallel steps.

        Step 1: [0,1,2,3,4,5,6,7] -> pairs
        Step 2: [0+1, 2+3, 4+5, 6+7] = [1,5,9,13]
        Step 3: [1+5, 9+13] = [6, 22]
        Step 4: [6+22] = 28
        """
        n = len(data)
        result = data.copy()

        stride = 1
        while stride < n:
            # In parallel, each thread i computes:
            # result[i*2*stride] += result[i*2*stride + stride]
            new_result = []
            for i in range(0, n, stride * 2):
                if i + stride < n:
                    new_result.append(op(result[i], result[i + stride]))
                else:
                    new_result.append(result[i])
            result = new_result + [0] * (n - len(new_result))
            stride *= 2

        return result[0]

    @staticmethod
    def tree_reduction(data: list, op=lambda a, b: a + b) -> float:
        """
        Tree reduction (better memory access pattern).

        Thread i computes: result[i] = op(data[i], data[i + n/2])
        Then: result[i] = op(result[i], result[i + n/4])
        etc.
        """
        n = len(data)
        result = data.copy()

        active_threads = n // 2
        while active_threads > 0:
            # Each active thread combines two elements
            for i in range(active_threads):
                result[i] = op(result[i], result[i + active_threads])
            active_threads //= 2

        return result[0]

# Demonstrate reduction
data = [1, 2, 3, 4, 5, 6, 7, 8]
pr = ParallelReduction()
print(f"Naive: {pr.naive_reduction(data)}")
print(f"Parallel concept: {pr.parallel_reduction_concept(data)}")
print(f"Tree: {pr.tree_reduction(data)}")

Parallel Prefix Sum (Scan)

class ParallelScan:
    """
    Parallel prefix sum (scan).

    Input:  [3, 1, 7, 0, 4, 1, 6, 3]
    Output: [3, 4, 11, 11, 15, 16, 22, 25] (inclusive)
    Or:     [0, 3, 4, 11, 11, 15, 16, 22] (exclusive)
    """

    @staticmethod
    def sequential_scan(data: list) -> list:
        """Sequential prefix sum: O(N)"""
        result = []
        total = 0
        for x in data:
            total += x
            result.append(total)
        return result

    @staticmethod
    def hillis_steele_scan(data: list) -> list:
        """
        Hillis-Steele scan: O(N log N) work, O(log N) steps.
        Work-inefficient but highly parallel.
        """
        n = len(data)
        result = data.copy()

        stride = 1
        while stride < n:
            # Each thread i computes (in parallel):
            # new[i] = result[i] + result[i - stride] (if i >= stride)
            new_result = result.copy()
            for i in range(stride, n):
                new_result[i] = result[i] + result[i - stride]
            result = new_result
            stride *= 2

        return result

    @staticmethod
    def blelloch_scan(data: list) -> list:
        """
        Blelloch scan: O(N) work, O(log N) steps.
        Work-efficient, two-phase algorithm.

        Phase 1 (Up-sweep): Build reduction tree
        Phase 2 (Down-sweep): Traverse tree to compute prefix sums
        """
        n = len(data)
        # Pad to power of 2
        import math
        padded_n = 2 ** math.ceil(math.log2(n))
        result = data + [0] * (padded_n - n)

        # Up-sweep (reduce)
        stride = 1
        while stride < padded_n:
            for i in range(stride - 1, padded_n - 1, stride * 2):
                result[i + stride] += result[i]
            stride *= 2

        # Set root to 0 for exclusive scan
        result[padded_n - 1] = 0

        # Down-sweep
        stride = padded_n // 2
        while stride > 0:
            for i in range(stride - 1, padded_n - 1, stride * 2):
                temp = result[i]
                result[i] = result[i + stride]
                result[i + stride] += temp
            stride //= 2

        # Convert to inclusive by shifting and adding original
        inclusive = [result[i] + data[i] if i < n else 0 for i in range(n)]
        return inclusive

# Demonstrate scans
data = [3, 1, 7, 0, 4, 1, 6, 3]
ps = ParallelScan()
print(f"Input:         {data}")
print(f"Sequential:    {ps.sequential_scan(data)}")
print(f"Hillis-Steele: {ps.hillis_steele_scan(data)}")
print(f"Blelloch:      {ps.blelloch_scan(data)}")

Parallel Matrix Multiplication

class ParallelMatrixMultiply:
    """
    Matrix multiplication on GPU.
    C = A * B where A is MxK and B is KxN
    """

    @staticmethod
    def naive_kernel(block_idx, block_dim, thread_idx, A, B, C, M, K, N):
        """
        Naive: Each thread computes one element of C.
        Poor memory access pattern.
        """
        row = block_idx[0] * block_dim[0] + thread_idx[0]
        col = block_idx[1] * block_dim[1] + thread_idx[1]

        if row < M and col < N:
            total = 0
            for k in range(K):
                total += A[row][k] * B[k][col]
            C[row][col] = total

    @staticmethod
    def tiled_matmul_concept(A, B, tile_size=16):
        """
        Tiled matrix multiplication.

        Load tiles into shared memory, compute, repeat.
        Much better memory access pattern.
        """
        M, K = A.shape
        K2, N = B.shape
        assert K == K2

        C = np.zeros((M, N))

        # Process tiles
        for tile_row in range(0, M, tile_size):
            for tile_col in range(0, N, tile_size):
                # Initialize tile result
                tile_c = np.zeros((tile_size, tile_size))

                for tile_k in range(0, K, tile_size):
                    # Load tiles into "shared memory"
                    tile_a = A[tile_row:tile_row+tile_size,
                              tile_k:tile_k+tile_size]
                    tile_b = B[tile_k:tile_k+tile_size,
                              tile_col:tile_col+tile_size]

                    # Compute partial result
                    tile_c += tile_a @ tile_b

                # Write result
                C[tile_row:tile_row+tile_size,
                  tile_col:tile_col+tile_size] = tile_c

        return C

# Why tiling helps
"""
Without tiling:
- Each thread loads M*K elements from A (many times)
- Total global memory reads: M*N*K + M*N*K = 2MNK

With tiling (tile size T):
- Load each tile once into shared memory
- Reuse T times
- Total global memory reads: 2MNK/T

For T=16, this is 16x fewer memory accesses!
"""

4. GPU Optimization

Memory Coalescing

class MemoryCoalescing:
    """
    Memory coalescing: Combining memory accesses.

    When threads in a warp access consecutive addresses,
    GPU combines into fewer memory transactions.
    """

    @staticmethod
    def coalesced_example():
        """
        GOOD: Threads access consecutive elements.

        Thread 0: array[0]
        Thread 1: array[1]
        ...
        Thread 31: array[31]

        -> 1 memory transaction (128 bytes)
        """
        pass

    @staticmethod
    def strided_example():
        """
        BAD: Threads access with stride.

        Thread 0: array[0]
        Thread 1: array[32]
        Thread 2: array[64]
        ...

        -> 32 memory transactions!
        """
        pass

    @staticmethod
    def transpose_for_coalescing(matrix, rows, cols):
        """
        Sometimes transpose data to enable coalescing.

        Row-major access (C-style): Adjacent cols are adjacent in memory
        - Reading row: coalesced
        - Reading column: strided

        If algorithm needs column access, transpose first!
        """
        # Original: row-major
        # matrix[row][col] = matrix[row * cols + col]

        # Transposed: column becomes row
        transposed = [[matrix[r][c] for r in range(rows)] for c in range(cols)]
        return transposed

Avoiding Warp Divergence

class WarpDivergence:
    """
    Warp divergence: When threads in a warp take different branches.

    All 32 threads in a warp execute in lockstep.
    If some take branch A and others take branch B:
    - First, all execute A (non-A threads are masked)
    - Then, all execute B (non-B threads are masked)
    - Serialization!
    """

    @staticmethod
    def divergent_kernel(thread_idx, data, result):
        """
        BAD: Threads diverge based on data value.

        Different threads likely have different values.
        """
        if data[thread_idx] > 0:
            result[thread_idx] = data[thread_idx] * 2
        else:
            result[thread_idx] = data[thread_idx] * -1

    @staticmethod
    def less_divergent_kernel(thread_idx, data, result):
        """
        BETTER: Diverge based on thread ID.

        Threads 0-15 all take same branch.
        Threads 16-31 all take same branch.
        Warp 0 doesn't diverge, warp 1 doesn't diverge.
        """
        if thread_idx < 16:
            result[thread_idx] = data[thread_idx] * 2
        else:
            result[thread_idx] = data[thread_idx] * -1

    @staticmethod
    def no_divergence_kernel(thread_idx, data, result):
        """
        BEST: No branching, use math.

        Predicated execution or branchless code.
        """
        sign = 1 if data[thread_idx] > 0 else -1
        factor = 2 if data[thread_idx] > 0 else 1
        result[thread_idx] = data[thread_idx] * factor * sign

        # Or branchless:
        # pos = int(data[thread_idx] > 0)
        # result[thread_idx] = data[thread_idx] * (2 * pos + (1 - pos) * -1)

Occupancy and Resource Usage

class GPUOccupancy:
    """
    Occupancy: Ratio of active warps to maximum possible.

    Higher occupancy can hide memory latency.
    But sometimes lower occupancy with more registers is better!
    """

    def __init__(self, max_threads_per_sm=2048, max_blocks_per_sm=32,
                 registers_per_sm=65536, shared_memory_per_sm=49152):
        self.max_threads_per_sm = max_threads_per_sm
        self.max_blocks_per_sm = max_blocks_per_sm
        self.registers_per_sm = registers_per_sm
        self.shared_memory_per_sm = shared_memory_per_sm

    def calculate_occupancy(self, threads_per_block, registers_per_thread,
                           shared_memory_per_block):
        """Calculate theoretical occupancy"""

        # Limit by threads per block
        blocks_by_threads = self.max_threads_per_sm // threads_per_block

        # Limit by registers
        registers_per_block = threads_per_block * registers_per_thread
        blocks_by_registers = self.registers_per_sm // registers_per_block

        # Limit by shared memory
        if shared_memory_per_block > 0:
            blocks_by_smem = self.shared_memory_per_sm // shared_memory_per_block
        else:
            blocks_by_smem = self.max_blocks_per_sm

        # Limit by max blocks
        blocks_by_limit = self.max_blocks_per_sm

        # Actual blocks
        blocks = min(blocks_by_threads, blocks_by_registers,
                    blocks_by_smem, blocks_by_limit)

        active_threads = blocks * threads_per_block
        occupancy = active_threads / self.max_threads_per_sm

        return {
            'blocks_per_sm': blocks,
            'active_threads': active_threads,
            'occupancy': occupancy,
            'limited_by': self._find_limiter(
                blocks_by_threads, blocks_by_registers,
                blocks_by_smem, blocks_by_limit
            )
        }

    def _find_limiter(self, t, r, s, l):
        m = min(t, r, s, l)
        if m == t: return 'threads'
        if m == r: return 'registers'
        if m == s: return 'shared_memory'
        return 'block_limit'

# Example
occ = GPUOccupancy()
result = occ.calculate_occupancy(
    threads_per_block=256,
    registers_per_thread=32,
    shared_memory_per_block=4096
)
print(f"Occupancy: {result['occupancy']:.1%}")
print(f"Limited by: {result['limited_by']}")

5. When to Use GPUs

Decision Framework

class GPUDecisionFramework:
    """Guidelines for when to use GPU vs CPU"""

    @staticmethod
    def should_use_gpu(problem):
        """
        Factors favoring GPU:
        1. Data parallelism (same operation on many elements)
        2. Large dataset (amortize transfer overhead)
        3. Compute-intensive (arithmetic intensity)
        4. Regular memory access patterns

        Factors favoring CPU:
        1. Sequential/branching logic
        2. Small dataset
        3. Memory-intensive (low arithmetic intensity)
        4. Irregular memory access
        5. Need for quick turnaround (no transfer overhead)
        """
        score = 0

        # Data parallelism
        if problem.get('parallelizable_fraction', 0) > 0.9:
            score += 3

        # Dataset size
        size = problem.get('data_size_mb', 0)
        if size > 100:
            score += 2
        elif size < 1:
            score -= 2

        # Arithmetic intensity (FLOPs per byte)
        intensity = problem.get('arithmetic_intensity', 0)
        if intensity > 10:
            score += 3
        elif intensity < 1:
            score -= 2

        # Memory pattern
        if problem.get('memory_pattern') == 'regular':
            score += 1
        elif problem.get('memory_pattern') == 'random':
            score -= 2

        return score > 2

# Examples
problems = [
    {
        'name': 'Matrix multiplication',
        'parallelizable_fraction': 0.99,
        'data_size_mb': 500,
        'arithmetic_intensity': 100,  # N FLOPs per element
        'memory_pattern': 'regular'
    },
    {
        'name': 'Graph traversal (BFS)',
        'parallelizable_fraction': 0.7,
        'data_size_mb': 50,
        'arithmetic_intensity': 0.5,  # Just checking neighbors
        'memory_pattern': 'random'
    },
    {
        'name': 'Small sort',
        'parallelizable_fraction': 0.8,
        'data_size_mb': 0.1,
        'arithmetic_intensity': 5,
        'memory_pattern': 'irregular'
    }
]

df = GPUDecisionFramework()
for p in problems:
    use_gpu = df.should_use_gpu(p)
    print(f"{p['name']}: {'GPU' if use_gpu else 'CPU'}")

Profiling and Benchmarking

import time

def benchmark_comparison(cpu_func, gpu_func, data, warmup=3, iterations=10):
    """Compare CPU vs GPU performance"""

    # Warmup (important for GPU!)
    for _ in range(warmup):
        cpu_func(data)
        gpu_func(data)

    # CPU timing
    cpu_times = []
    for _ in range(iterations):
        start = time.perf_counter()
        cpu_result = cpu_func(data)
        cpu_times.append(time.perf_counter() - start)

    # GPU timing (include transfer time)
    gpu_times = []
    for _ in range(iterations):
        start = time.perf_counter()
        gpu_result = gpu_func(data)
        gpu_times.append(time.perf_counter() - start)

    return {
        'cpu_mean': sum(cpu_times) / len(cpu_times),
        'gpu_mean': sum(gpu_times) / len(gpu_times),
        'speedup': (sum(cpu_times) / len(cpu_times)) /
                   (sum(gpu_times) / len(gpu_times))
    }

Exercises

Basic

  1. Calculate the global thread index for a 2D grid with blocks (4, 4) and threads (16, 16).

  2. Explain why adjacent threads should access adjacent memory locations.

  3. Implement a parallel reduction for finding the maximum element.

Intermediate

  1. Implement tiled matrix multiplication and compare memory accesses with naive version.

  2. Analyze a piece of code for warp divergence and optimize it.

  3. Calculate the occupancy for a kernel using 128 threads, 40 registers/thread, and 8KB shared memory.

Advanced

  1. Implement parallel radix sort using prefix sums.

  2. Design a GPU algorithm for sparse matrix-vector multiplication.

  3. Optimize a GPU kernel by analyzing and improving its memory access patterns.


Summary

  • GPUs have thousands of simple cores for throughput
  • Memory coalescing is critical for performance
  • Avoid warp divergence by restructuring conditionals
  • Parallel reduction and scan are fundamental building blocks
  • Tiling improves data reuse through shared memory
  • Use GPUs for data-parallel, compute-intensive workloads

Module Complete

This completes the Parallel Computing module! You've learned fundamental concepts for writing efficient parallel programs.

← Previous: Concurrent Patterns | Back to Course Index