Arrays and Hashing: The Everyday Heroes | Algorithms Tutorial

Arrays (Python Lists)

A Python list is a dynamic array: contiguous memory, indexed from 0, doubles in capacity when full.

arr = [3, 1, 4, 1, 5, 9, 2, 6]

print(arr[0])       # 3
print(arr[-1])      # 6 (negative indexing from the end)
print(arr[2:5])     # [4, 1, 5] (slice)
arr.append(10)      # amortized O(1)
arr.insert(0, 0)    # O(n): everything shifts right
arr.pop()           # O(1) from the end
arr.pop(0)          # O(n): everything shifts left

Key complexities:

arr[i]             O(1)          index access
arr[i] = x         O(1)          index assignment
len(arr)           O(1)          size is tracked
arr.append(x)      O(1)          amortized; occasional resize
arr.pop()          O(1)
arr.insert(i, x)   O(n)          shift everything right of i
arr.pop(i)         O(n)          shift everything right of i left
x in arr           O(n)          linear scan
sorted(arr)        O(n log n)    Timsort

If you insert at the front a lot, don't use a list. Use a deque from collections:

from collections import deque

dq = deque([1, 2, 3])
dq.appendleft(0)     # O(1)
dq.popleft()         # O(1)

deque is O(1) at both ends. It's O(n) for random index access, so it's a different tool.

Why Lists Are Usually Fast

arr[i] is O(1) because the list stores a pointer array in memory; index i jumps there directly. No scanning.

Iteration is cache-friendly: consecutive elements are next to each other in memory, so reads hit CPU cache nicely. Linked lists (Chapter 4) don't have this property.

Hash Tables: Dict and Set

Hash tables are the other pillar. In Python, dict and set.

# dict: key → value
users = {"ada": 36, "grace": 41}
print(users["ada"])            # 36, O(1)
users["alan"] = 42             # O(1)
del users["grace"]             # O(1)
print("ada" in users)          # True, O(1)

# set: unique keys, no values
seen = {1, 2, 3}
seen.add(4)                    # O(1)
seen.discard(2)                # O(1)
print(1 in seen)               # True, O(1)

All primary ops are amortized O(1). Occasionally they're O(n) due to hash collisions or resizing, but across many ops, the average is constant.

How Hash Tables Work (Briefly)

A hash function maps a key to an integer (mostly uniform over the table size). That integer is an index into an array. Lookups go straight to the right index.

Two keys might hash to the same index (a collision). Implementations handle this with chaining (linked list at each slot) or open addressing (probe other slots). Either way, well-distributed hashes make the average O(1).

Bad hash functions (or adversarial inputs) can degrade to O(n). Python's hash randomization (PYTHONHASHSEED) mitigates most adversarial cases.

Counting with Dicts

The canonical dict pattern: count occurrences.

text = "mississippi"
counts = {}
for c in text:
    counts[c] = counts.get(c, 0) + 1
print(counts)   # {'m': 1, 'i': 4, 's': 4, 'p': 2}

dict.get(key, default) avoids the "does it exist?" check.

Cleaner: collections.Counter:

from collections import Counter

counts = Counter("mississippi")
print(counts)                 # Counter({'i': 4, 's': 4, 'p': 2, 'm': 1})
print(counts.most_common(2))  # [('i', 4), ('s', 4)]

Counter is a dict with counting helpers. Use it whenever you're counting.

Grouping

from collections import defaultdict

by_letter = defaultdict(list)
for word in ["apple", "ant", "banana", "berry"]:
    by_letter[word[0]].append(word)

print(dict(by_letter))
# {'a': ['apple', 'ant'], 'b': ['banana', 'berry']}

defaultdict(list) auto-creates empty lists. Saves the if key not in d: d[key] = [] dance.

The Two Sum Problem

The interview classic. Given an array and a target, find two indices whose values sum to target.

Brute force: O(n²)

def two_sum(nums, target):
    for i in range(len(nums)):
        for j in range(i + 1, len(nums)):
            if nums[i] + nums[j] == target:
                return (i, j)

Hash lookup: O(n)

def two_sum(nums, target):
    seen = {}    # value → index
    for i, x in enumerate(nums):
        need = target - x
        if need in seen:
            return (seen[need], i)
        seen[x] = i

Same problem, different data structure, one order of magnitude faster. "What pair?" becomes "given x, what value would complete it, and have I seen it?"

This is the core of hash-based patterns: trade O(n) space for an O(n) time solution instead of O(n²).

Sets for Uniqueness

Duplicates check:

def has_duplicate(nums):
    return len(nums) != len(set(nums))

Fast, obvious, O(n).

Intersection:

def common(a, b):
    return set(a) & set(b)

Intersection is O(len(a) + len(b)) average case.

When Hashing Hurts

Hash tables aren't free. They use more memory (empty slots for low load factor). Hashing complex keys is slow. Iteration order is insertion order in Python 3.7+ but there's no fast "next smallest" or "range" query (use a tree, Chapter 7).

Don't reach for a dict when:

You need sorted iteration: use a sorted list or tree.
You need range queries: use a sorted list (with bisect) or tree.
Memory is tight and lookup isn't the bottleneck: use an array.

Common Patterns

Seen-Before

def first_duplicate(nums):
    seen = set()
    for x in nums:
        if x in seen:
            return x
        seen.add(x)
    return None

Frequency

from collections import Counter

def most_common_char(s):
    return Counter(s).most_common(1)[0][0]

Index Lookup (value → where it lives)

def two_sum(nums, target):
    seen = {}
    for i, x in enumerate(nums):
        if target - x in seen:
            return (seen[target - x], i)
        seen[x] = i

Grouping by Canonical Form

from collections import defaultdict

def group_anagrams(words):
    groups = defaultdict(list)
    for w in words:
        key = tuple(sorted(w))
        groups[key].append(w)
    return list(groups.values())

Canonical form (sorted letters) as the key. Anagram groups fall out.

Common Pitfalls

Using a list when you meant a set. x in list is O(n). x in set is O(1). Converting a list to a set before checking membership pays off at any non-trivial size.

Mutable keys. Lists, dicts, and sets aren't hashable (they're mutable). Tuples and frozensets are. Convert first.

Assuming dict iteration order. It's insertion order in modern Python, but that's a guarantee you shouldn't rely on across languages.

Hashing complex objects expensively. Each hash call runs your __hash__. If it's expensive (walks the whole object), the "O(1)" becomes "O(size of key)".

Over-indexing. A dict mapping user_id to user object when you only ever iterate all users is wasteful. Simpler structures are fine.

Next Steps

Continue to 04-linked-lists.md for pointer-based structures.