GPU and Parallel Hardware
Introduction
GPUs (Graphics Processing Units) offer massive parallelism with thousands of cores optimized for data-parallel workloads. This reading covers GPU architecture, programming models, and parallel algorithms for accelerators.
Learning Objectives
By the end of this reading, you will be able to:
- Understand GPU architecture and memory hierarchy
- Write data-parallel programs conceptually
- Apply GPU programming patterns
- Optimize memory access patterns
- Understand when to use GPUs vs CPUs
1. GPU Architecture
CPU vs GPU
CPU (Few powerful cores):
┌─────────────┐
│ Core 0 │ - Complex control logic
│ [ALU][Cache]│ - Out-of-order execution
├─────────────┤ - Branch prediction
│ Core 1 │ - Large caches
│ [ALU][Cache]│ - Optimized for latency
├─────────────┤
│ Core 2-7... │
└─────────────┘
GPU (Many simple cores):
┌────────────────────────────────────┐
│ SM 0 │
│ [C][C][C][C][C][C][C][C]...32 cores│
│ [Shared Memory] [Registers] │
├────────────────────────────────────┤
│ SM 1 │ - Simple cores
│ [C][C][C][C][C][C][C][C]...32 cores│ - SIMT execution
├────────────────────────────────────┤ - Small caches
│ SM 2-N... │ - Optimized for throughput
└────────────────────────────────────┘
│ Global Memory │
└───────────────────────────────────────────┘
Key Concepts
"""
GPU Programming Concepts:
1. KERNEL: Function that runs on GPU
- Launched with many threads
- Each thread runs same code on different data
2. THREAD: Single execution unit
- Has unique ID (threadIdx)
- Can access shared and global memory
3. BLOCK (Thread Block): Group of threads
- Share fast shared memory
- Can synchronize within block
- Has unique ID (blockIdx)
4. GRID: Collection of blocks
- Organizes all threads for kernel
- Can be 1D, 2D, or 3D
5. WARP: 32 threads executed in lockstep (NVIDIA)
- All threads in warp execute same instruction
- Divergence causes serialization
"""
# Conceptual thread indexing
def get_thread_index_1d(block_idx, block_dim, thread_idx):
"""1D global thread index"""
return block_idx * block_dim + thread_idx
def get_thread_index_2d(block_idx, block_dim, thread_idx):
"""2D global thread index"""
global_x = block_idx[0] * block_dim[0] + thread_idx[0]
global_y = block_idx[1] * block_dim[1] + thread_idx[1]
return (global_x, global_y)
Memory Hierarchy
"""
GPU Memory Hierarchy (fastest to slowest):
1. REGISTERS
- Per-thread private
- Fastest access (~1 cycle)
- Limited quantity
2. SHARED MEMORY
- Per-block, shared by threads in block
- Fast (~5 cycles)
- User-managed cache
- ~48KB-164KB per block
3. L1/L2 CACHE
- Hardware-managed
- L1 is per-SM, L2 is shared
4. GLOBAL MEMORY (VRAM)
- Accessible by all threads
- Slowest (~400-800 cycles)
- Large (8-80GB)
5. CONSTANT MEMORY
- Read-only, cached
- Broadcast to all threads
6. TEXTURE MEMORY
- Read-only, cached
- Optimized for spatial locality
"""
class MemoryAccessPattern:
"""Demonstrate memory access patterns"""
@staticmethod
def coalesced_access(thread_id, data):
"""
GOOD: Adjacent threads access adjacent memory.
GPU can combine into single memory transaction.
Thread 0: data[0]
Thread 1: data[1]
Thread 2: data[2]
...
"""
return data[thread_id]
@staticmethod
def strided_access(thread_id, data, stride):
"""
BAD: Threads access non-adjacent memory.
Causes multiple memory transactions.
Thread 0: data[0]
Thread 1: data[stride]
Thread 2: data[2*stride]
...
"""
return data[thread_id * stride]
@staticmethod
def random_access(thread_id, data, indices):
"""
WORST: Random access pattern.
No coalescing possible.
Thread 0: data[indices[0]] # could be anywhere
Thread 1: data[indices[1]] # could be anywhere
...
"""
return data[indices[thread_id]]
2. GPU Programming Model
Conceptual CUDA-like Programming
import numpy as np
from typing import Callable
import multiprocessing as mp
class SimulatedGPU:
"""Simulates GPU execution model for learning"""
def __init__(self, num_sms=4, threads_per_block=256):
self.num_sms = num_sms
self.threads_per_block = threads_per_block
def launch_kernel(
self,
kernel: Callable,
grid_dim: tuple,
block_dim: tuple,
*args
):
"""
Launch a kernel with specified grid and block dimensions.
kernel: Function to execute for each thread
grid_dim: Number of blocks (x, y, z)
block_dim: Threads per block (x, y, z)
args: Arguments passed to kernel
"""
total_blocks = np.prod(grid_dim)
total_threads = total_blocks * np.prod(block_dim)
print(f"Launching kernel with {total_blocks} blocks, "
f"{total_threads} total threads")
# Simulate execution (in reality, GPU executes in parallel)
for bx in range(grid_dim[0]):
for by in range(grid_dim[1] if len(grid_dim) > 1 else 1):
for bz in range(grid_dim[2] if len(grid_dim) > 2 else 1):
block_idx = (bx, by, bz)
for tx in range(block_dim[0]):
for ty in range(block_dim[1] if len(block_dim) > 1 else 1):
for tz in range(block_dim[2] if len(block_dim) > 2 else 1):
thread_idx = (tx, ty, tz)
# Execute kernel for this thread
kernel(block_idx, block_dim, thread_idx, *args)
# Example kernel: Vector addition
def vector_add_kernel(block_idx, block_dim, thread_idx, a, b, c, n):
"""
Kernel for c = a + b
Each thread computes one element.
"""
# Calculate global thread index
idx = block_idx[0] * block_dim[0] + thread_idx[0]
# Bounds check (grid might be larger than data)
if idx < n:
c[idx] = a[idx] + b[idx]
def vector_add_example():
"""Demonstrate GPU vector addition"""
n = 1000
a = np.random.randn(n).astype(np.float32)
b = np.random.randn(n).astype(np.float32)
c = np.zeros(n, dtype=np.float32)
gpu = SimulatedGPU()
# Calculate grid dimensions
threads_per_block = 256
blocks = (n + threads_per_block - 1) // threads_per_block
gpu.launch_kernel(
vector_add_kernel,
grid_dim=(blocks,),
block_dim=(threads_per_block,),
a, b, c, n
)
# Verify
expected = a + b
print(f"Correct: {np.allclose(c, expected)}")
NumPy/CuPy Style GPU Programming
# In practice, use high-level libraries
# CuPy provides NumPy-compatible GPU arrays
"""
# With CuPy (actual GPU library):
import cupy as cp
# Create GPU arrays
a_gpu = cp.array([1, 2, 3, 4, 5])
b_gpu = cp.array([5, 4, 3, 2, 1])
# Operations run on GPU automatically
c_gpu = a_gpu + b_gpu
d_gpu = cp.dot(a_gpu, b_gpu)
e_gpu = cp.sum(a_gpu)
# Transfer back to CPU
c_cpu = cp.asnumpy(c_gpu)
"""
# Simulated high-level GPU operations
class GPUArray:
"""Simulated GPU array for demonstration"""
def __init__(self, data):
if isinstance(data, np.ndarray):
self.data = data.copy()
else:
self.data = np.array(data)
self._on_gpu = True
def __add__(self, other):
"""Element-wise addition (would be parallel on GPU)"""
if isinstance(other, GPUArray):
return GPUArray(self.data + other.data)
return GPUArray(self.data + other)
def __mul__(self, other):
"""Element-wise multiplication"""
if isinstance(other, GPUArray):
return GPUArray(self.data * other.data)
return GPUArray(self.data * other)
def sum(self):
"""Reduction (parallel on GPU)"""
return GPUArray(np.array([self.data.sum()]))
def to_cpu(self):
"""Transfer to CPU memory"""
self._on_gpu = False
return self.data.copy()
@staticmethod
def from_cpu(data):
"""Transfer from CPU to GPU"""
return GPUArray(data)
3. Common GPU Algorithms
Parallel Reduction
class ParallelReduction:
"""
Parallel reduction pattern.
Reduce N elements to 1 using O(log N) parallel steps.
"""
@staticmethod
def naive_reduction(data: list, op=lambda a, b: a + b) -> float:
"""
Naive sequential reduction: O(N)
"""
result = data[0]
for x in data[1:]:
result = op(result, x)
return result
@staticmethod
def parallel_reduction_concept(data: list, op=lambda a, b: a + b) -> float:
"""
Parallel reduction: O(log N) parallel steps.
Step 1: [0,1,2,3,4,5,6,7] -> pairs
Step 2: [0+1, 2+3, 4+5, 6+7] = [1,5,9,13]
Step 3: [1+5, 9+13] = [6, 22]
Step 4: [6+22] = 28
"""
n = len(data)
result = data.copy()
stride = 1
while stride < n:
# In parallel, each thread i computes:
# result[i*2*stride] += result[i*2*stride + stride]
new_result = []
for i in range(0, n, stride * 2):
if i + stride < n:
new_result.append(op(result[i], result[i + stride]))
else:
new_result.append(result[i])
result = new_result + [0] * (n - len(new_result))
stride *= 2
return result[0]
@staticmethod
def tree_reduction(data: list, op=lambda a, b: a + b) -> float:
"""
Tree reduction (better memory access pattern).
Thread i computes: result[i] = op(data[i], data[i + n/2])
Then: result[i] = op(result[i], result[i + n/4])
etc.
"""
n = len(data)
result = data.copy()
active_threads = n // 2
while active_threads > 0:
# Each active thread combines two elements
for i in range(active_threads):
result[i] = op(result[i], result[i + active_threads])
active_threads //= 2
return result[0]
# Demonstrate reduction
data = [1, 2, 3, 4, 5, 6, 7, 8]
pr = ParallelReduction()
print(f"Naive: {pr.naive_reduction(data)}")
print(f"Parallel concept: {pr.parallel_reduction_concept(data)}")
print(f"Tree: {pr.tree_reduction(data)}")
Parallel Prefix Sum (Scan)
class ParallelScan:
"""
Parallel prefix sum (scan).
Input: [3, 1, 7, 0, 4, 1, 6, 3]
Output: [3, 4, 11, 11, 15, 16, 22, 25] (inclusive)
Or: [0, 3, 4, 11, 11, 15, 16, 22] (exclusive)
"""
@staticmethod
def sequential_scan(data: list) -> list:
"""Sequential prefix sum: O(N)"""
result = []
total = 0
for x in data:
total += x
result.append(total)
return result
@staticmethod
def hillis_steele_scan(data: list) -> list:
"""
Hillis-Steele scan: O(N log N) work, O(log N) steps.
Work-inefficient but highly parallel.
"""
n = len(data)
result = data.copy()
stride = 1
while stride < n:
# Each thread i computes (in parallel):
# new[i] = result[i] + result[i - stride] (if i >= stride)
new_result = result.copy()
for i in range(stride, n):
new_result[i] = result[i] + result[i - stride]
result = new_result
stride *= 2
return result
@staticmethod
def blelloch_scan(data: list) -> list:
"""
Blelloch scan: O(N) work, O(log N) steps.
Work-efficient, two-phase algorithm.
Phase 1 (Up-sweep): Build reduction tree
Phase 2 (Down-sweep): Traverse tree to compute prefix sums
"""
n = len(data)
# Pad to power of 2
import math
padded_n = 2 ** math.ceil(math.log2(n))
result = data + [0] * (padded_n - n)
# Up-sweep (reduce)
stride = 1
while stride < padded_n:
for i in range(stride - 1, padded_n - 1, stride * 2):
result[i + stride] += result[i]
stride *= 2
# Set root to 0 for exclusive scan
result[padded_n - 1] = 0
# Down-sweep
stride = padded_n // 2
while stride > 0:
for i in range(stride - 1, padded_n - 1, stride * 2):
temp = result[i]
result[i] = result[i + stride]
result[i + stride] += temp
stride //= 2
# Convert to inclusive by shifting and adding original
inclusive = [result[i] + data[i] if i < n else 0 for i in range(n)]
return inclusive
# Demonstrate scans
data = [3, 1, 7, 0, 4, 1, 6, 3]
ps = ParallelScan()
print(f"Input: {data}")
print(f"Sequential: {ps.sequential_scan(data)}")
print(f"Hillis-Steele: {ps.hillis_steele_scan(data)}")
print(f"Blelloch: {ps.blelloch_scan(data)}")
Parallel Matrix Multiplication
class ParallelMatrixMultiply:
"""
Matrix multiplication on GPU.
C = A * B where A is MxK and B is KxN
"""
@staticmethod
def naive_kernel(block_idx, block_dim, thread_idx, A, B, C, M, K, N):
"""
Naive: Each thread computes one element of C.
Poor memory access pattern.
"""
row = block_idx[0] * block_dim[0] + thread_idx[0]
col = block_idx[1] * block_dim[1] + thread_idx[1]
if row < M and col < N:
total = 0
for k in range(K):
total += A[row][k] * B[k][col]
C[row][col] = total
@staticmethod
def tiled_matmul_concept(A, B, tile_size=16):
"""
Tiled matrix multiplication.
Load tiles into shared memory, compute, repeat.
Much better memory access pattern.
"""
M, K = A.shape
K2, N = B.shape
assert K == K2
C = np.zeros((M, N))
# Process tiles
for tile_row in range(0, M, tile_size):
for tile_col in range(0, N, tile_size):
# Initialize tile result
tile_c = np.zeros((tile_size, tile_size))
for tile_k in range(0, K, tile_size):
# Load tiles into "shared memory"
tile_a = A[tile_row:tile_row+tile_size,
tile_k:tile_k+tile_size]
tile_b = B[tile_k:tile_k+tile_size,
tile_col:tile_col+tile_size]
# Compute partial result
tile_c += tile_a @ tile_b
# Write result
C[tile_row:tile_row+tile_size,
tile_col:tile_col+tile_size] = tile_c
return C
# Why tiling helps
"""
Without tiling:
- Each thread loads M*K elements from A (many times)
- Total global memory reads: M*N*K + M*N*K = 2MNK
With tiling (tile size T):
- Load each tile once into shared memory
- Reuse T times
- Total global memory reads: 2MNK/T
For T=16, this is 16x fewer memory accesses!
"""
4. GPU Optimization
Memory Coalescing
class MemoryCoalescing:
"""
Memory coalescing: Combining memory accesses.
When threads in a warp access consecutive addresses,
GPU combines into fewer memory transactions.
"""
@staticmethod
def coalesced_example():
"""
GOOD: Threads access consecutive elements.
Thread 0: array[0]
Thread 1: array[1]
...
Thread 31: array[31]
-> 1 memory transaction (128 bytes)
"""
pass
@staticmethod
def strided_example():
"""
BAD: Threads access with stride.
Thread 0: array[0]
Thread 1: array[32]
Thread 2: array[64]
...
-> 32 memory transactions!
"""
pass
@staticmethod
def transpose_for_coalescing(matrix, rows, cols):
"""
Sometimes transpose data to enable coalescing.
Row-major access (C-style): Adjacent cols are adjacent in memory
- Reading row: coalesced
- Reading column: strided
If algorithm needs column access, transpose first!
"""
# Original: row-major
# matrix[row][col] = matrix[row * cols + col]
# Transposed: column becomes row
transposed = [[matrix[r][c] for r in range(rows)] for c in range(cols)]
return transposed
Avoiding Warp Divergence
class WarpDivergence:
"""
Warp divergence: When threads in a warp take different branches.
All 32 threads in a warp execute in lockstep.
If some take branch A and others take branch B:
- First, all execute A (non-A threads are masked)
- Then, all execute B (non-B threads are masked)
- Serialization!
"""
@staticmethod
def divergent_kernel(thread_idx, data, result):
"""
BAD: Threads diverge based on data value.
Different threads likely have different values.
"""
if data[thread_idx] > 0:
result[thread_idx] = data[thread_idx] * 2
else:
result[thread_idx] = data[thread_idx] * -1
@staticmethod
def less_divergent_kernel(thread_idx, data, result):
"""
BETTER: Diverge based on thread ID.
Threads 0-15 all take same branch.
Threads 16-31 all take same branch.
Warp 0 doesn't diverge, warp 1 doesn't diverge.
"""
if thread_idx < 16:
result[thread_idx] = data[thread_idx] * 2
else:
result[thread_idx] = data[thread_idx] * -1
@staticmethod
def no_divergence_kernel(thread_idx, data, result):
"""
BEST: No branching, use math.
Predicated execution or branchless code.
"""
sign = 1 if data[thread_idx] > 0 else -1
factor = 2 if data[thread_idx] > 0 else 1
result[thread_idx] = data[thread_idx] * factor * sign
# Or branchless:
# pos = int(data[thread_idx] > 0)
# result[thread_idx] = data[thread_idx] * (2 * pos + (1 - pos) * -1)
Occupancy and Resource Usage
class GPUOccupancy:
"""
Occupancy: Ratio of active warps to maximum possible.
Higher occupancy can hide memory latency.
But sometimes lower occupancy with more registers is better!
"""
def __init__(self, max_threads_per_sm=2048, max_blocks_per_sm=32,
registers_per_sm=65536, shared_memory_per_sm=49152):
self.max_threads_per_sm = max_threads_per_sm
self.max_blocks_per_sm = max_blocks_per_sm
self.registers_per_sm = registers_per_sm
self.shared_memory_per_sm = shared_memory_per_sm
def calculate_occupancy(self, threads_per_block, registers_per_thread,
shared_memory_per_block):
"""Calculate theoretical occupancy"""
# Limit by threads per block
blocks_by_threads = self.max_threads_per_sm // threads_per_block
# Limit by registers
registers_per_block = threads_per_block * registers_per_thread
blocks_by_registers = self.registers_per_sm // registers_per_block
# Limit by shared memory
if shared_memory_per_block > 0:
blocks_by_smem = self.shared_memory_per_sm // shared_memory_per_block
else:
blocks_by_smem = self.max_blocks_per_sm
# Limit by max blocks
blocks_by_limit = self.max_blocks_per_sm
# Actual blocks
blocks = min(blocks_by_threads, blocks_by_registers,
blocks_by_smem, blocks_by_limit)
active_threads = blocks * threads_per_block
occupancy = active_threads / self.max_threads_per_sm
return {
'blocks_per_sm': blocks,
'active_threads': active_threads,
'occupancy': occupancy,
'limited_by': self._find_limiter(
blocks_by_threads, blocks_by_registers,
blocks_by_smem, blocks_by_limit
)
}
def _find_limiter(self, t, r, s, l):
m = min(t, r, s, l)
if m == t: return 'threads'
if m == r: return 'registers'
if m == s: return 'shared_memory'
return 'block_limit'
# Example
occ = GPUOccupancy()
result = occ.calculate_occupancy(
threads_per_block=256,
registers_per_thread=32,
shared_memory_per_block=4096
)
print(f"Occupancy: {result['occupancy']:.1%}")
print(f"Limited by: {result['limited_by']}")
5. When to Use GPUs
Decision Framework
class GPUDecisionFramework:
"""Guidelines for when to use GPU vs CPU"""
@staticmethod
def should_use_gpu(problem):
"""
Factors favoring GPU:
1. Data parallelism (same operation on many elements)
2. Large dataset (amortize transfer overhead)
3. Compute-intensive (arithmetic intensity)
4. Regular memory access patterns
Factors favoring CPU:
1. Sequential/branching logic
2. Small dataset
3. Memory-intensive (low arithmetic intensity)
4. Irregular memory access
5. Need for quick turnaround (no transfer overhead)
"""
score = 0
# Data parallelism
if problem.get('parallelizable_fraction', 0) > 0.9:
score += 3
# Dataset size
size = problem.get('data_size_mb', 0)
if size > 100:
score += 2
elif size < 1:
score -= 2
# Arithmetic intensity (FLOPs per byte)
intensity = problem.get('arithmetic_intensity', 0)
if intensity > 10:
score += 3
elif intensity < 1:
score -= 2
# Memory pattern
if problem.get('memory_pattern') == 'regular':
score += 1
elif problem.get('memory_pattern') == 'random':
score -= 2
return score > 2
# Examples
problems = [
{
'name': 'Matrix multiplication',
'parallelizable_fraction': 0.99,
'data_size_mb': 500,
'arithmetic_intensity': 100, # N FLOPs per element
'memory_pattern': 'regular'
},
{
'name': 'Graph traversal (BFS)',
'parallelizable_fraction': 0.7,
'data_size_mb': 50,
'arithmetic_intensity': 0.5, # Just checking neighbors
'memory_pattern': 'random'
},
{
'name': 'Small sort',
'parallelizable_fraction': 0.8,
'data_size_mb': 0.1,
'arithmetic_intensity': 5,
'memory_pattern': 'irregular'
}
]
df = GPUDecisionFramework()
for p in problems:
use_gpu = df.should_use_gpu(p)
print(f"{p['name']}: {'GPU' if use_gpu else 'CPU'}")
Profiling and Benchmarking
import time
def benchmark_comparison(cpu_func, gpu_func, data, warmup=3, iterations=10):
"""Compare CPU vs GPU performance"""
# Warmup (important for GPU!)
for _ in range(warmup):
cpu_func(data)
gpu_func(data)
# CPU timing
cpu_times = []
for _ in range(iterations):
start = time.perf_counter()
cpu_result = cpu_func(data)
cpu_times.append(time.perf_counter() - start)
# GPU timing (include transfer time)
gpu_times = []
for _ in range(iterations):
start = time.perf_counter()
gpu_result = gpu_func(data)
gpu_times.append(time.perf_counter() - start)
return {
'cpu_mean': sum(cpu_times) / len(cpu_times),
'gpu_mean': sum(gpu_times) / len(gpu_times),
'speedup': (sum(cpu_times) / len(cpu_times)) /
(sum(gpu_times) / len(gpu_times))
}
Exercises
Basic
Calculate the global thread index for a 2D grid with blocks (4, 4) and threads (16, 16).
Explain why adjacent threads should access adjacent memory locations.
Implement a parallel reduction for finding the maximum element.
Intermediate
Implement tiled matrix multiplication and compare memory accesses with naive version.
Analyze a piece of code for warp divergence and optimize it.
Calculate the occupancy for a kernel using 128 threads, 40 registers/thread, and 8KB shared memory.
Advanced
Implement parallel radix sort using prefix sums.
Design a GPU algorithm for sparse matrix-vector multiplication.
Optimize a GPU kernel by analyzing and improving its memory access patterns.
Summary
- GPUs have thousands of simple cores for throughput
- Memory coalescing is critical for performance
- Avoid warp divergence by restructuring conditionals
- Parallel reduction and scan are fundamental building blocks
- Tiling improves data reuse through shared memory
- Use GPUs for data-parallel, compute-intensive workloads
Module Complete
This completes the Parallel Computing module! You've learned fundamental concepts for writing efficient parallel programs.