Hash Tables
Introduction
Hash tables (also called hash maps or dictionaries) provide near-constant time O(1) average case for insertion, deletion, and lookup. They're one of the most important and widely-used data structures in programming.
Learning Objectives
By the end of this reading, you will be able to:
- Understand how hash functions work
- Implement a hash table with collision handling
- Analyze hash table performance
- Choose appropriate hash table configurations
1. What is a Hash Table?
A hash table stores key-value pairs using a hash function to compute an index into an array of buckets.
The Idea
Key → Hash Function → Index → Bucket → Value
"apple" → hash("apple") → 3 → bucket[3] → "red"
Visual Representation
Hash Table with array of size 8:
Index: 0 1 2 3 4 5 6 7
┌───────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┐
│ │"cat": │ │"apple"│ │"dog": │ │"bird":│
│ - │ 5 │ - │:"red" │ - │ 3 │ - │ 2 │
└───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘
hash("apple") % 8 = 3
hash("cat") % 8 = 1
hash("dog") % 8 = 5
hash("bird") % 8 = 7
2. Hash Functions
A hash function converts a key into an array index.
Properties of Good Hash Functions
- Deterministic: Same input always produces same output
- Uniform Distribution: Spreads keys evenly across buckets
- Fast to Compute: O(1) for practical purposes
- Minimizes Collisions: Different keys should map to different indices
Simple Hash Functions
def hash_string(s, table_size):
"""Simple string hash"""
total = 0
for char in s:
total += ord(char)
return total % table_size
# Problem: "abc" and "cba" hash to same value!
Better Hash Functions
def hash_string_polynomial(s, table_size):
"""Polynomial rolling hash"""
hash_value = 0
base = 31
for i, char in enumerate(s):
hash_value += ord(char) * (base ** i)
return hash_value % table_size
# "abc" ≠ "cba" with this function
Python's Built-in hash()
print(hash("hello")) # Large integer
print(hash(42)) # 42 (integers hash to themselves)
print(hash((1, 2, 3))) # Tuples are hashable
# hash([1, 2, 3]) # Error! Lists are not hashable
Note: Python's hash() can return different values between runs (for security). For hash tables, use hash(key) % table_size.
3. Collision Handling
A collision occurs when two different keys hash to the same index.
Method 1: Chaining (Separate Chaining)
Each bucket stores a linked list (or other collection) of entries.
Index 3: [ ("apple", "red") -> ("grape", "purple") -> None ]
class HashTableChaining:
def __init__(self, capacity=10):
self.capacity = capacity
self.size = 0
self.buckets = [[] for _ in range(capacity)]
def _hash(self, key):
return hash(key) % self.capacity
def put(self, key, value):
index = self._hash(key)
bucket = self.buckets[index]
# Update if key exists
for i, (k, v) in enumerate(bucket):
if k == key:
bucket[i] = (key, value)
return
# Add new entry
bucket.append((key, value))
self.size += 1
def get(self, key):
index = self._hash(key)
bucket = self.buckets[index]
for k, v in bucket:
if k == key:
return v
raise KeyError(key)
def remove(self, key):
index = self._hash(key)
bucket = self.buckets[index]
for i, (k, v) in enumerate(bucket):
if k == key:
del bucket[i]
self.size -= 1
return
raise KeyError(key)
def __contains__(self, key):
try:
self.get(key)
return True
except KeyError:
return False
Method 2: Open Addressing (Linear Probing)
On collision, find the next empty slot.
Insert "apple" at index 3
Insert "grape" (also hashes to 3) → Try 4, 5, ... until empty
Index: 0 1 2 3 4 5
┌───────┬───────┬───────┬───────┬───────┬───────┐
│ │ │ │"apple"│"grape"│ │
└───────┴───────┴───────┴───────┴───────┴───────┘
class HashTableOpenAddressing:
def __init__(self, capacity=10):
self.capacity = capacity
self.size = 0
self.keys = [None] * capacity
self.values = [None] * capacity
self.DELETED = object() # Tombstone marker
def _hash(self, key):
return hash(key) % self.capacity
def _probe(self, key):
"""Find index for key using linear probing"""
index = self._hash(key)
first_deleted = None
for _ in range(self.capacity):
if self.keys[index] is None:
return first_deleted if first_deleted is not None else index
if self.keys[index] is self.DELETED:
if first_deleted is None:
first_deleted = index
elif self.keys[index] == key:
return index
index = (index + 1) % self.capacity
return first_deleted # Table full but found deleted slot
def put(self, key, value):
if self.size >= self.capacity * 0.7: # Load factor threshold
self._resize()
index = self._probe(key)
if self.keys[index] != key: # New key
self.size += 1
self.keys[index] = key
self.values[index] = value
def get(self, key):
index = self._hash(key)
for _ in range(self.capacity):
if self.keys[index] is None:
raise KeyError(key)
if self.keys[index] == key:
return self.values[index]
index = (index + 1) % self.capacity
raise KeyError(key)
def remove(self, key):
index = self._hash(key)
for _ in range(self.capacity):
if self.keys[index] is None:
raise KeyError(key)
if self.keys[index] == key:
self.keys[index] = self.DELETED
self.values[index] = None
self.size -= 1
return
index = (index + 1) % self.capacity
raise KeyError(key)
def _resize(self):
old_keys = self.keys
old_values = self.values
self.capacity *= 2
self.keys = [None] * self.capacity
self.values = [None] * self.capacity
self.size = 0
for i, key in enumerate(old_keys):
if key is not None and key is not self.DELETED:
self.put(key, old_values[i])
Other Probing Methods
Quadratic Probing: Try index + 1², index + 2², index + 3², ...
# Reduces clustering but can miss empty slots
index = (original_index + i**2) % capacity
Double Hashing: Use a second hash function
# More uniform distribution
step = hash2(key)
index = (original_index + i * step) % capacity
4. Load Factor and Resizing
Load Factor
Load Factor (α) = n / m
- n = number of entries
- m = number of buckets
Higher load factor → more collisions → slower performance
When to Resize
- Chaining: Resize when α > 1 (or some threshold like 0.75)
- Open Addressing: Must resize before full; typically at α = 0.5-0.7
Resizing Process
- Create new, larger array (typically 2× size)
- Rehash all existing entries
- Replace old array with new
def _resize(self):
old_buckets = self.buckets
self.capacity *= 2
self.buckets = [[] for _ in range(self.capacity)]
self.size = 0
for bucket in old_buckets:
for key, value in bucket:
self.put(key, value) # Rehash with new capacity
5. Time Complexity
| Operation | Average | Worst (many collisions) |
|---|---|---|
| Insert | O(1) | O(n) |
| Lookup | O(1) | O(n) |
| Delete | O(1) | O(n) |
With a good hash function and reasonable load factor, worst case is rare.
Space Complexity: O(n) for n entries
6. Hash Tables in Different Languages
Python (dict)
# Create
d = {}
d = {'a': 1, 'b': 2}
d = dict(a=1, b=2)
# Operations
d['c'] = 3 # Insert/Update
value = d['a'] # Access (KeyError if missing)
value = d.get('x', default) # Access with default
'a' in d # Check existence
del d['a'] # Delete
# Iteration
for key in d:
print(key)
for key, value in d.items():
print(key, value)
Java (HashMap)
HashMap<String, Integer> map = new HashMap<>();
map.put("a", 1);
int value = map.get("a"); // null if missing
int value = map.getOrDefault("x", 0);
map.containsKey("a");
map.remove("a");
for (Map.Entry<String, Integer> entry : map.entrySet()) {
System.out.println(entry.getKey() + ": " + entry.getValue());
}
JavaScript (Object and Map)
// Object (keys converted to strings)
let obj = { a: 1, b: 2 };
obj['c'] = 3;
obj.d = 4;
// Map (any type as key)
let map = new Map();
map.set('a', 1);
map.get('a');
map.has('a');
map.delete('a');
7. Common Operations and Patterns
Counting Occurrences
def count_chars(s):
counts = {}
for char in s:
counts[char] = counts.get(char, 0) + 1
return counts
# Or using Counter
from collections import Counter
counts = Counter(s)
Grouping by Key
def group_by_length(words):
groups = {}
for word in words:
length = len(word)
if length not in groups:
groups[length] = []
groups[length].append(word)
return groups
# Or using defaultdict
from collections import defaultdict
groups = defaultdict(list)
for word in words:
groups[len(word)].append(word)
Two Sum Problem
def two_sum(nums, target):
"""Find indices of two numbers that add to target"""
seen = {} # value -> index
for i, num in enumerate(nums):
complement = target - num
if complement in seen:
return [seen[complement], i]
seen[num] = i
return None
# O(n) time, O(n) space
Finding Duplicates
def find_duplicates(nums):
seen = set()
duplicates = set()
for num in nums:
if num in seen:
duplicates.add(num)
seen.add(num)
return list(duplicates)
Anagram Grouping
def group_anagrams(words):
groups = defaultdict(list)
for word in words:
# Sorted letters as key
key = tuple(sorted(word))
groups[key].append(word)
return list(groups.values())
words = ["eat", "tea", "tan", "ate", "nat", "bat"]
print(group_anagrams(words))
# [['eat', 'tea', 'ate'], ['tan', 'nat'], ['bat']]
8. Hash Sets
A hash set is like a hash table but only stores keys (no values).
# Python set
s = set()
s = {1, 2, 3}
s.add(4) # Insert
s.remove(1) # Delete (KeyError if missing)
s.discard(10) # Delete (no error if missing)
1 in s # Membership test
len(s) # Size
# Set operations
a = {1, 2, 3}
b = {2, 3, 4}
a | b # Union: {1, 2, 3, 4}
a & b # Intersection: {2, 3}
a - b # Difference: {1}
a ^ b # Symmetric difference: {1, 4}
9. Specialized Hash Maps
OrderedDict
Maintains insertion order (Python dict does this by default since 3.7).
from collections import OrderedDict
od = OrderedDict()
od['a'] = 1
od['b'] = 2
od['c'] = 3
# Iteration is in insertion order
defaultdict
Returns default value for missing keys.
from collections import defaultdict
# List as default
dd = defaultdict(list)
dd['key'].append(1) # No KeyError!
# Int as default (useful for counting)
dd = defaultdict(int)
dd['count'] += 1 # Starts at 0
Counter
Specialized for counting.
from collections import Counter
c = Counter(['a', 'b', 'a', 'c', 'a', 'b'])
print(c) # Counter({'a': 3, 'b': 2, 'c': 1})
print(c.most_common(2)) # [('a', 3), ('b', 2)]
print(c['a']) # 3
print(c['x']) # 0 (not KeyError)
10. When to Use Hash Tables
Use Hash Tables When:
- You need O(1) average lookup by key
- You need to count occurrences
- You need to detect duplicates
- You need to group items by some property
- Order doesn't matter
Consider Alternatives When:
- You need ordered data → TreeMap/sorted structure
- Keys are small integers → Direct array indexing
- Memory is very constrained → Array-based alternatives
- Worst-case guarantees needed → Balanced BST
Exercises
Basic
Implement a hash set (keys only, no values).
Given two arrays, find their intersection.
Check if a string has all unique characters using a hash set.
Intermediate
Implement an LRU (Least Recently Used) cache using a hash map and doubly linked list.
Given an array, find the length of the longest consecutive sequence (e.g., [100, 4, 200, 1, 3, 2] → 4, for [1,2,3,4]).
Find all pairs of integers that sum to a given target.
Advanced
Design a hash map that supports O(1) getRandom() in addition to insert, delete, and lookup.
Implement consistent hashing for distributed cache.
Given a string, find the first non-repeating character.
Summary
- Hash tables provide O(1) average case for insert, lookup, delete
- Hash functions convert keys to array indices
- Collisions handled via chaining or open addressing
- Load factor affects performance; resize when needed
- Python: dict, set, defaultdict, Counter
- Use when you need fast key-based access
- Watch for worst-case O(n) with many collisions