Hash Tables

Introduction

Hash tables (also called hash maps or dictionaries) provide near-constant time O(1) average case for insertion, deletion, and lookup. They're one of the most important and widely-used data structures in programming.

Learning Objectives

By the end of this reading, you will be able to:

  • Understand how hash functions work
  • Implement a hash table with collision handling
  • Analyze hash table performance
  • Choose appropriate hash table configurations

1. What is a Hash Table?

A hash table stores key-value pairs using a hash function to compute an index into an array of buckets.

The Idea

Key → Hash Function → Index → Bucket → Value

"apple" → hash("apple") → 3 → bucket[3] → "red"

Visual Representation

Hash Table with array of size 8:

Index:  0       1       2       3       4       5       6       7
      ┌───────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┐
      │       │"cat": │       │"apple"│       │"dog": │       │"bird":│
      │  -    │ 5     │  -    │:"red" │  -    │ 3     │  -    │ 2     │
      └───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘

hash("apple") % 8 = 3
hash("cat") % 8 = 1
hash("dog") % 8 = 5
hash("bird") % 8 = 7

2. Hash Functions

A hash function converts a key into an array index.

Properties of Good Hash Functions

  1. Deterministic: Same input always produces same output
  2. Uniform Distribution: Spreads keys evenly across buckets
  3. Fast to Compute: O(1) for practical purposes
  4. Minimizes Collisions: Different keys should map to different indices

Simple Hash Functions

def hash_string(s, table_size):
    """Simple string hash"""
    total = 0
    for char in s:
        total += ord(char)
    return total % table_size

# Problem: "abc" and "cba" hash to same value!

Better Hash Functions

def hash_string_polynomial(s, table_size):
    """Polynomial rolling hash"""
    hash_value = 0
    base = 31
    for i, char in enumerate(s):
        hash_value += ord(char) * (base ** i)
    return hash_value % table_size

# "abc" ≠ "cba" with this function

Python's Built-in hash()

print(hash("hello"))     # Large integer
print(hash(42))          # 42 (integers hash to themselves)
print(hash((1, 2, 3)))   # Tuples are hashable
# hash([1, 2, 3])        # Error! Lists are not hashable

Note: Python's hash() can return different values between runs (for security). For hash tables, use hash(key) % table_size.


3. Collision Handling

A collision occurs when two different keys hash to the same index.

Method 1: Chaining (Separate Chaining)

Each bucket stores a linked list (or other collection) of entries.

Index 3: [ ("apple", "red") -> ("grape", "purple") -> None ]
class HashTableChaining:
    def __init__(self, capacity=10):
        self.capacity = capacity
        self.size = 0
        self.buckets = [[] for _ in range(capacity)]

    def _hash(self, key):
        return hash(key) % self.capacity

    def put(self, key, value):
        index = self._hash(key)
        bucket = self.buckets[index]

        # Update if key exists
        for i, (k, v) in enumerate(bucket):
            if k == key:
                bucket[i] = (key, value)
                return

        # Add new entry
        bucket.append((key, value))
        self.size += 1

    def get(self, key):
        index = self._hash(key)
        bucket = self.buckets[index]

        for k, v in bucket:
            if k == key:
                return v

        raise KeyError(key)

    def remove(self, key):
        index = self._hash(key)
        bucket = self.buckets[index]

        for i, (k, v) in enumerate(bucket):
            if k == key:
                del bucket[i]
                self.size -= 1
                return

        raise KeyError(key)

    def __contains__(self, key):
        try:
            self.get(key)
            return True
        except KeyError:
            return False

Method 2: Open Addressing (Linear Probing)

On collision, find the next empty slot.

Insert "apple" at index 3
Insert "grape" (also hashes to 3) → Try 4, 5, ... until empty

Index:  0       1       2       3       4       5
      ┌───────┬───────┬───────┬───────┬───────┬───────┐
      │       │       │       │"apple"│"grape"│       │
      └───────┴───────┴───────┴───────┴───────┴───────┘
class HashTableOpenAddressing:
    def __init__(self, capacity=10):
        self.capacity = capacity
        self.size = 0
        self.keys = [None] * capacity
        self.values = [None] * capacity
        self.DELETED = object()  # Tombstone marker

    def _hash(self, key):
        return hash(key) % self.capacity

    def _probe(self, key):
        """Find index for key using linear probing"""
        index = self._hash(key)
        first_deleted = None

        for _ in range(self.capacity):
            if self.keys[index] is None:
                return first_deleted if first_deleted is not None else index
            if self.keys[index] is self.DELETED:
                if first_deleted is None:
                    first_deleted = index
            elif self.keys[index] == key:
                return index
            index = (index + 1) % self.capacity

        return first_deleted  # Table full but found deleted slot

    def put(self, key, value):
        if self.size >= self.capacity * 0.7:  # Load factor threshold
            self._resize()

        index = self._probe(key)
        if self.keys[index] != key:  # New key
            self.size += 1
        self.keys[index] = key
        self.values[index] = value

    def get(self, key):
        index = self._hash(key)

        for _ in range(self.capacity):
            if self.keys[index] is None:
                raise KeyError(key)
            if self.keys[index] == key:
                return self.values[index]
            index = (index + 1) % self.capacity

        raise KeyError(key)

    def remove(self, key):
        index = self._hash(key)

        for _ in range(self.capacity):
            if self.keys[index] is None:
                raise KeyError(key)
            if self.keys[index] == key:
                self.keys[index] = self.DELETED
                self.values[index] = None
                self.size -= 1
                return
            index = (index + 1) % self.capacity

        raise KeyError(key)

    def _resize(self):
        old_keys = self.keys
        old_values = self.values

        self.capacity *= 2
        self.keys = [None] * self.capacity
        self.values = [None] * self.capacity
        self.size = 0

        for i, key in enumerate(old_keys):
            if key is not None and key is not self.DELETED:
                self.put(key, old_values[i])

Other Probing Methods

Quadratic Probing: Try index + 1², index + 2², index + 3², ...

# Reduces clustering but can miss empty slots
index = (original_index + i**2) % capacity

Double Hashing: Use a second hash function

# More uniform distribution
step = hash2(key)
index = (original_index + i * step) % capacity

4. Load Factor and Resizing

Load Factor

Load Factor (α) = n / m

  • n = number of entries
  • m = number of buckets

Higher load factor → more collisions → slower performance

When to Resize

  • Chaining: Resize when α > 1 (or some threshold like 0.75)
  • Open Addressing: Must resize before full; typically at α = 0.5-0.7

Resizing Process

  1. Create new, larger array (typically 2× size)
  2. Rehash all existing entries
  3. Replace old array with new
def _resize(self):
    old_buckets = self.buckets
    self.capacity *= 2
    self.buckets = [[] for _ in range(self.capacity)]
    self.size = 0

    for bucket in old_buckets:
        for key, value in bucket:
            self.put(key, value)  # Rehash with new capacity

5. Time Complexity

OperationAverageWorst (many collisions)
InsertO(1)O(n)
LookupO(1)O(n)
DeleteO(1)O(n)

With a good hash function and reasonable load factor, worst case is rare.

Space Complexity: O(n) for n entries


6. Hash Tables in Different Languages

Python (dict)

# Create
d = {}
d = {'a': 1, 'b': 2}
d = dict(a=1, b=2)

# Operations
d['c'] = 3           # Insert/Update
value = d['a']       # Access (KeyError if missing)
value = d.get('x', default)  # Access with default
'a' in d             # Check existence
del d['a']           # Delete

# Iteration
for key in d:
    print(key)
for key, value in d.items():
    print(key, value)

Java (HashMap)

HashMap<String, Integer> map = new HashMap<>();
map.put("a", 1);
int value = map.get("a");       // null if missing
int value = map.getOrDefault("x", 0);
map.containsKey("a");
map.remove("a");

for (Map.Entry<String, Integer> entry : map.entrySet()) {
    System.out.println(entry.getKey() + ": " + entry.getValue());
}

JavaScript (Object and Map)

// Object (keys converted to strings)
let obj = { a: 1, b: 2 };
obj['c'] = 3;
obj.d = 4;

// Map (any type as key)
let map = new Map();
map.set('a', 1);
map.get('a');
map.has('a');
map.delete('a');

7. Common Operations and Patterns

Counting Occurrences

def count_chars(s):
    counts = {}
    for char in s:
        counts[char] = counts.get(char, 0) + 1
    return counts

# Or using Counter
from collections import Counter
counts = Counter(s)

Grouping by Key

def group_by_length(words):
    groups = {}
    for word in words:
        length = len(word)
        if length not in groups:
            groups[length] = []
        groups[length].append(word)
    return groups

# Or using defaultdict
from collections import defaultdict
groups = defaultdict(list)
for word in words:
    groups[len(word)].append(word)

Two Sum Problem

def two_sum(nums, target):
    """Find indices of two numbers that add to target"""
    seen = {}  # value -> index

    for i, num in enumerate(nums):
        complement = target - num
        if complement in seen:
            return [seen[complement], i]
        seen[num] = i

    return None

# O(n) time, O(n) space

Finding Duplicates

def find_duplicates(nums):
    seen = set()
    duplicates = set()

    for num in nums:
        if num in seen:
            duplicates.add(num)
        seen.add(num)

    return list(duplicates)

Anagram Grouping

def group_anagrams(words):
    groups = defaultdict(list)

    for word in words:
        # Sorted letters as key
        key = tuple(sorted(word))
        groups[key].append(word)

    return list(groups.values())

words = ["eat", "tea", "tan", "ate", "nat", "bat"]
print(group_anagrams(words))
# [['eat', 'tea', 'ate'], ['tan', 'nat'], ['bat']]

8. Hash Sets

A hash set is like a hash table but only stores keys (no values).

# Python set
s = set()
s = {1, 2, 3}

s.add(4)         # Insert
s.remove(1)      # Delete (KeyError if missing)
s.discard(10)    # Delete (no error if missing)
1 in s           # Membership test
len(s)           # Size

# Set operations
a = {1, 2, 3}
b = {2, 3, 4}
a | b            # Union: {1, 2, 3, 4}
a & b            # Intersection: {2, 3}
a - b            # Difference: {1}
a ^ b            # Symmetric difference: {1, 4}

9. Specialized Hash Maps

OrderedDict

Maintains insertion order (Python dict does this by default since 3.7).

from collections import OrderedDict
od = OrderedDict()
od['a'] = 1
od['b'] = 2
od['c'] = 3
# Iteration is in insertion order

defaultdict

Returns default value for missing keys.

from collections import defaultdict

# List as default
dd = defaultdict(list)
dd['key'].append(1)  # No KeyError!

# Int as default (useful for counting)
dd = defaultdict(int)
dd['count'] += 1     # Starts at 0

Counter

Specialized for counting.

from collections import Counter

c = Counter(['a', 'b', 'a', 'c', 'a', 'b'])
print(c)                # Counter({'a': 3, 'b': 2, 'c': 1})
print(c.most_common(2)) # [('a', 3), ('b', 2)]
print(c['a'])           # 3
print(c['x'])           # 0 (not KeyError)

10. When to Use Hash Tables

Use Hash Tables When:

  • You need O(1) average lookup by key
  • You need to count occurrences
  • You need to detect duplicates
  • You need to group items by some property
  • Order doesn't matter

Consider Alternatives When:

  • You need ordered data → TreeMap/sorted structure
  • Keys are small integers → Direct array indexing
  • Memory is very constrained → Array-based alternatives
  • Worst-case guarantees needed → Balanced BST

Exercises

Basic

  1. Implement a hash set (keys only, no values).

  2. Given two arrays, find their intersection.

  3. Check if a string has all unique characters using a hash set.

Intermediate

  1. Implement an LRU (Least Recently Used) cache using a hash map and doubly linked list.

  2. Given an array, find the length of the longest consecutive sequence (e.g., [100, 4, 200, 1, 3, 2] → 4, for [1,2,3,4]).

  3. Find all pairs of integers that sum to a given target.

Advanced

  1. Design a hash map that supports O(1) getRandom() in addition to insert, delete, and lookup.

  2. Implement consistent hashing for distributed cache.

  3. Given a string, find the first non-repeating character.


Summary

  • Hash tables provide O(1) average case for insert, lookup, delete
  • Hash functions convert keys to array indices
  • Collisions handled via chaining or open addressing
  • Load factor affects performance; resize when needed
  • Python: dict, set, defaultdict, Counter
  • Use when you need fast key-based access
  • Watch for worst-case O(n) with many collisions

Next Reading

Trees and Binary Search Trees →