Hash Tables | Computer Science Fundamentals

Hash tables (also called hash maps or dictionaries) provide near-constant time O(1) average case for insertion, deletion, and lookup. They're one of the most important and widely-used data structures in programming.

1. What is a Hash Table?

A hash table stores key-value pairs using a hash function to compute an index into an array of buckets.

The Idea

Key → Hash Function → Index → Bucket → Value

"apple" → hash("apple") → 3 → bucket[3] → "red"

Visual Representation

Hash Table with array of size 8:

Index:  0       1       2       3       4       5       6       7
      ┌───────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┐
      │       │"cat": │       │"apple"│       │"dog": │       │"bird":│
      │  -    │ 5     │  -    │:"red" │  -    │ 3     │  -    │ 2     │
      └───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘

hash("apple") % 8 = 3
hash("cat") % 8 = 1
hash("dog") % 8 = 5
hash("bird") % 8 = 7

2. Hash Functions

A hash function converts a key into an array index.

Properties of Good Hash Functions

Deterministic: Same input always produces same output
Uniform Distribution: Spreads keys evenly across buckets
Fast to Compute: O(1) for practical purposes
Minimizes Collisions: Different keys should map to different indices

Simple Hash Functions

def hash_string(s, table_size):
    """Simple string hash"""
    total = 0
    for char in s:
        total += ord(char)
    return total % table_size

# Problem: "abc" and "cba" hash to same value!

Better Hash Functions

def hash_string_polynomial(s, table_size):
    """Polynomial rolling hash"""
    hash_value = 0
    base = 31
    for i, char in enumerate(s):
        hash_value += ord(char) * (base ** i)
    return hash_value % table_size

# "abc" ≠ "cba" with this function

Python's Built-in hash()

print(hash("hello"))     # Large integer
print(hash(42))          # 42 (integers hash to themselves)
print(hash((1, 2, 3)))   # Tuples are hashable
# hash([1, 2, 3])        # Error! Lists are not hashable

Note: Python's hash() can return different values between runs (for security). For hash tables, use hash(key) % table_size.

3. Collision Handling

A collision occurs when two different keys hash to the same index.

Method 1: Chaining (Separate Chaining)

Each bucket stores a linked list (or other collection) of entries.

Index 3: [ ("apple", "red") -> ("grape", "purple") -> None ]

class HashTableChaining:
    def __init__(self, capacity=10):
        self.capacity = capacity
        self.size = 0
        self.buckets = [[] for _ in range(capacity)]

    def _hash(self, key):
        return hash(key) % self.capacity

    def put(self, key, value):
        index = self._hash(key)
        bucket = self.buckets[index]

        # Update if key exists
        for i, (k, v) in enumerate(bucket):
            if k == key:
                bucket[i] = (key, value)
                return

        # Add new entry
        bucket.append((key, value))
        self.size += 1

    def get(self, key):
        index = self._hash(key)
        bucket = self.buckets[index]

        for k, v in bucket:
            if k == key:
                return v

        raise KeyError(key)

    def remove(self, key):
        index = self._hash(key)
        bucket = self.buckets[index]

        for i, (k, v) in enumerate(bucket):
            if k == key:
                del bucket[i]
                self.size -= 1
                return

        raise KeyError(key)

    def __contains__(self, key):
        try:
            self.get(key)
            return True
        except KeyError:
            return False

Method 2: Open Addressing (Linear Probing)

On collision, find the next empty slot.

Insert "apple" at index 3
Insert "grape" (also hashes to 3) → Try 4, 5, ... until empty

Index:  0       1       2       3       4       5
      ┌───────┬───────┬───────┬───────┬───────┬───────┐
      │       │       │       │"apple"│"grape"│       │
      └───────┴───────┴───────┴───────┴───────┴───────┘

class HashTableOpenAddressing:
    def __init__(self, capacity=10):
        self.capacity = capacity
        self.size = 0
        self.keys = [None] * capacity
        self.values = [None] * capacity
        self.DELETED = object()  # Tombstone marker

    def _hash(self, key):
        return hash(key) % self.capacity

    def _probe(self, key):
        """Find index for key using linear probing"""
        index = self._hash(key)
        first_deleted = None

        for _ in range(self.capacity):
            if self.keys[index] is None:
                return first_deleted if first_deleted is not None else index
            if self.keys[index] is self.DELETED:
                if first_deleted is None:
                    first_deleted = index
            elif self.keys[index] == key:
                return index
            index = (index + 1) % self.capacity

        return first_deleted  # Table full but found deleted slot

    def put(self, key, value):
        if self.size >= self.capacity * 0.7:  # Load factor threshold
            self._resize()

        index = self._probe(key)
        if self.keys[index] != key:  # New key
            self.size += 1
        self.keys[index] = key
        self.values[index] = value

    def get(self, key):
        index = self._hash(key)

        for _ in range(self.capacity):
            if self.keys[index] is None:
                raise KeyError(key)
            if self.keys[index] == key:
                return self.values[index]
            index = (index + 1) % self.capacity

        raise KeyError(key)

    def remove(self, key):
        index = self._hash(key)

        for _ in range(self.capacity):
            if self.keys[index] is None:
                raise KeyError(key)
            if self.keys[index] == key:
                self.keys[index] = self.DELETED
                self.values[index] = None
                self.size -= 1
                return
            index = (index + 1) % self.capacity

        raise KeyError(key)

    def _resize(self):
        old_keys = self.keys
        old_values = self.values

        self.capacity *= 2
        self.keys = [None] * self.capacity
        self.values = [None] * self.capacity
        self.size = 0

        for i, key in enumerate(old_keys):
            if key is not None and key is not self.DELETED:
                self.put(key, old_values[i])

Other Probing Methods

Quadratic Probing: Try index + 1², index + 2², index + 3², ...

# Reduces clustering but can miss empty slots
index = (original_index + i**2) % capacity

Double Hashing: Use a second hash function

# More uniform distribution
step = hash2(key)
index = (original_index + i * step) % capacity

4. Load Factor and Resizing

Load Factor

Load Factor (α) = n / m

n = number of entries
m = number of buckets

Higher load factor → more collisions → slower performance

When to Resize

Chaining: Resize when α > 1 (or some threshold like 0.75)
Open Addressing: Must resize before full; typically at α = 0.5-0.7

Resizing Process

Create new, larger array (typically 2× size)
Rehash all existing entries
Replace old array with new

def _resize(self):
    old_buckets = self.buckets
    self.capacity *= 2
    self.buckets = [[] for _ in range(self.capacity)]
    self.size = 0

    for bucket in old_buckets:
        for key, value in bucket:
            self.put(key, value)  # Rehash with new capacity

5. Time Complexity

Operation	Average	Worst (many collisions)
Insert	O(1)	O(n)
Lookup	O(1)	O(n)
Delete	O(1)	O(n)

With a good hash function and reasonable load factor, worst case is rare.

Space Complexity: O(n) for n entries

6. Hash Tables in Different Languages

Python (dict)

# Create
d = {}
d = {'a': 1, 'b': 2}
d = dict(a=1, b=2)

# Operations
d['c'] = 3           # Insert/Update
value = d['a']       # Access (KeyError if missing)
value = d.get('x', default)  # Access with default
'a' in d             # Check existence
del d['a']           # Delete

# Iteration
for key in d:
    print(key)
for key, value in d.items():
    print(key, value)

Java (HashMap)

HashMap<String, Integer> map = new HashMap<>();
map.put("a", 1);
int value = map.get("a");       // null if missing
int value = map.getOrDefault("x", 0);
map.containsKey("a");
map.remove("a");

for (Map.Entry<String, Integer> entry : map.entrySet()) {
    System.out.println(entry.getKey() + ": " + entry.getValue());
}

JavaScript (Object and Map)

// Object (keys converted to strings)
let obj = { a: 1, b: 2 };
obj['c'] = 3;
obj.d = 4;

// Map (any type as key)
let map = new Map();
map.set('a', 1);
map.get('a');
map.has('a');
map.delete('a');

7. Common Operations and Patterns

Counting Occurrences

def count_chars(s):
    counts = {}
    for char in s:
        counts[char] = counts.get(char, 0) + 1
    return counts

# Or using Counter
from collections import Counter
counts = Counter(s)

Grouping by Key

def group_by_length(words):
    groups = {}
    for word in words:
        length = len(word)
        if length not in groups:
            groups[length] = []
        groups[length].append(word)
    return groups

# Or using defaultdict
from collections import defaultdict
groups = defaultdict(list)
for word in words:
    groups[len(word)].append(word)

Two Sum Problem

def two_sum(nums, target):
    """Find indices of two numbers that add to target"""
    seen = {}  # value -> index

    for i, num in enumerate(nums):
        complement = target - num
        if complement in seen:
            return [seen[complement], i]
        seen[num] = i

    return None

# O(n) time, O(n) space

Finding Duplicates

def find_duplicates(nums):
    seen = set()
    duplicates = set()

    for num in nums:
        if num in seen:
            duplicates.add(num)
        seen.add(num)

    return list(duplicates)

Anagram Grouping

def group_anagrams(words):
    groups = defaultdict(list)

    for word in words:
        # Sorted letters as key
        key = tuple(sorted(word))
        groups[key].append(word)

    return list(groups.values())

words = ["eat", "tea", "tan", "ate", "nat", "bat"]
print(group_anagrams(words))
# [['eat', 'tea', 'ate'], ['tan', 'nat'], ['bat']]

8. Hash Sets

A hash set is like a hash table but only stores keys (no values).

# Python set
s = set()
s = {1, 2, 3}

s.add(4)         # Insert
s.remove(1)      # Delete (KeyError if missing)
s.discard(10)    # Delete (no error if missing)
1 in s           # Membership test
len(s)           # Size

# Set operations
a = {1, 2, 3}
b = {2, 3, 4}
a | b            # Union: {1, 2, 3, 4}
a & b            # Intersection: {2, 3}
a - b            # Difference: {1}
a ^ b            # Symmetric difference: {1, 4}

9. Specialized Hash Maps

OrderedDict

Maintains insertion order (Python dict does this by default since 3.7).

from collections import OrderedDict
od = OrderedDict()
od['a'] = 1
od['b'] = 2
od['c'] = 3
# Iteration is in insertion order

defaultdict

Returns default value for missing keys.

from collections import defaultdict

# List as default
dd = defaultdict(list)
dd['key'].append(1)  # No KeyError!

# Int as default (useful for counting)
dd = defaultdict(int)
dd['count'] += 1     # Starts at 0

Counter

Specialized for counting.

from collections import Counter

c = Counter(['a', 'b', 'a', 'c', 'a', 'b'])
print(c)                # Counter({'a': 3, 'b': 2, 'c': 1})
print(c.most_common(2)) # [('a', 3), ('b', 2)]
print(c['a'])           # 3
print(c['x'])           # 0 (not KeyError)

10. When to Use Hash Tables

Use Hash Tables When:

You need O(1) average lookup by key
You need to count occurrences
You need to detect duplicates
You need to group items by some property
Order doesn't matter

Consider Alternatives When:

You need ordered data → TreeMap/sorted structure
Keys are small integers → Direct array indexing
Memory is very constrained → Array-based alternatives
Worst-case guarantees needed → Balanced BST

Exercises

Basic

Implement a hash set (keys only, no values).
Given two arrays, find their intersection.
Check if a string has all unique characters using a hash set.

Intermediate

Implement an LRU (Least Recently Used) cache using a hash map and doubly linked list.
Given an array, find the length of the longest consecutive sequence (e.g., [100, 4, 200, 1, 3, 2] → 4, for [1,2,3,4]).
Find all pairs of integers that sum to a given target.

Advanced

Design a hash map that supports O(1) getRandom() in addition to insert, delete, and lookup.
Implement consistent hashing for distributed cache.
Given a string, find the first non-repeating character.

Summary

Hash tables provide O(1) average case for insert, lookup, delete
Hash functions convert keys to array indices
Collisions handled via chaining or open addressing
Load factor affects performance; resize when needed
Python: dict, set, defaultdict, Counter
Use when you need fast key-based access
Watch for worst-case O(n) with many collisions

Next Reading

Trees and Binary Search Trees →