Databases: Choice, Replication, Sharding | System Design Tutorial

Picking a Database

You already have a relational default: PostgreSQL (or MySQL). Start there unless you have a specific reason not to.

Reasons to pick something else:

Key-value, huge scale, simple access. DynamoDB, Cassandra.
Search. Elasticsearch, OpenSearch, Meilisearch.
Time series. TimescaleDB, InfluxDB, Prometheus (for metrics).
Graph queries with 3+ hops. Neo4j, TigerGraph.
Analytics. ClickHouse, BigQuery, Snowflake.
Embedded. SQLite.

Most teams end up with two or three: a primary relational DB for everything, a cache (Redis), and one specialized store (search or analytics). Resist the urge to add a fourth unless there's pain.

See the databases/ tutorial in this repo for PostgreSQL coverage.

Replication

Replication keeps copies of the data on multiple machines. The two purposes:

Availability: if the primary dies, a replica takes over.
Read scaling: replicas serve read traffic, freeing the primary for writes.

Primary-Replica

The most common topology. One primary accepts writes; replicas follow via replication logs.

            writes
Client ─────────────► Primary
                       │
                       │ replication log
                       ▼
                    Replica  ◄─── reads
                    Replica  ◄─── reads
                    Replica  ◄─── reads

Replication is usually asynchronous: the primary acknowledges the write, then pushes to replicas. Fast, but replicas lag.

Options for stronger guarantees:

Synchronous replication. Primary waits for at least one replica to ack. Slower, but the write survives primary crash.
Semi-synchronous. Wait for ack only until a timeout; then behave as async. A common middle ground.

Replication Lag

Async replicas are behind the primary by milliseconds to seconds (occasionally much more). Practical consequences:

Reads from a replica can return stale data.
A user who writes a value, then reads it on a different connection, may get the old value.

Common mitigations:

Read-your-writes: route a user's reads to the primary for a short time after their write.
Monotonic reads: make sure a user is never served from a replica that's "behind" one they previously saw.
Causal consistency (Chapter 7): track causality so session reads are always at or after the previous one.

Chapter 7 covers this formally.

Multi-Primary (Multi-Master)

Multiple primaries accept writes. They replicate to each other.

Problems:

Write conflicts. A user updates the same row on two primaries simultaneously. Who wins?
Complexity. Conflict detection, resolution, and monitoring are hard.

Most teams don't need multi-primary. Use it only when one of these is true:

You need active-active writes across regions and latency demands it.
You're using a database designed for it (Cassandra, DynamoDB, CRDTs). These handle conflict resolution at the data model level.

Otherwise, stick to primary-replica.

Failover

When the primary dies, you need to promote a replica.

Key questions:

Who decides? (Automated via Patroni, Orchestrator, cloud-managed failover; or manual.)
Does the promoted replica have all the writes? (With async replication, no.)
How do clients learn the new primary's address? (DNS, service discovery, a proxy.)

Managed databases (RDS, Cloud SQL, Neon) handle most of this. If you're running your own, invest in automation; manual failover at 3am is a bad time to figure it out.

Partitioning

Replication scales reads. It doesn't scale writes. Every write still goes through the primary.

To scale writes, partition (often called sharding): split the data across multiple primaries, each responsible for a subset of the data.

Client ──── shard 1 (users A-H)
       ──── shard 2 (users I-P)
       ──── shard 3 (users Q-Z)

Each shard has its own primary and replicas. Writes for "Ada" go to shard 1; writes for "Grace" go to shard 2. Total write throughput is the sum of per-shard throughput.

Sharding Strategies

Three ways to decide which shard a key lives on.

Range Partitioning

Shard 1 holds IDs 0 to 1M, shard 2 holds 1M to 2M, and so on.

Pros: range queries are efficient (WHERE id BETWEEN 500k AND 600k hits one shard).

Cons: hotspots. If IDs are sequential (auto-increment), all new writes go to the shard holding the latest range.

Good fit: time-series data (each day is its own shard).

Hash Partitioning

Shard = hash(key) mod N. Keys are spread evenly across shards.

Pros: no hotspots (hash distributes writes).

Cons: range queries span all shards (WHERE id BETWEEN ... hits everywhere).

Good fit: point lookups where the key is well-distributed.

Directory / Lookup

A separate service maps key to shard. Write a user, look up their shard, go there.

Pros: flexible. Can rebalance without changing the hash.

Cons: another hop, another service to run.

Good fit: when shard assignments must be adjustable at runtime (e.g. moving a large tenant to a dedicated shard).

Consistent Hashing

The hash ring: hash both keys and shards onto a ring; each key belongs to the next shard clockwise. When a shard is added or removed, only 1/N of keys re-map.

Used by Cassandra, DynamoDB's partition map, and most cache clusters.

The Cross-Shard Pain

Once data is sharded, operations that cross shards are expensive or impossible.

Joins

SELECT ... FROM orders JOIN users ... where orders and users are on different shards requires one of:

Fetching data from both shards and joining in application code.
Denormalizing (copy the user's name onto the order row).

Most sharded systems denormalize. Joins across shards are painful and slow.

Transactions

A transaction that spans shards is a distributed transaction. Two-phase commit works but is slow and fragile. Most sharded databases don't support them; design around them (e.g. use sagas, Chapter 10).

Foreign Keys

FK constraints across shards don't exist. You enforce referential integrity in application code, or you pick a shard key that keeps related rows together.

Shard Key Choice

Pick a shard key such that the operations you care about are within a single shard.

For an e-commerce app:

Shard by user_id. All of a user's orders are on one shard. Order listings, user profile, and account dashboards are single-shard queries. Cross-user queries (leaderboards, analytics) are cross-shard and need a different tool.
Shard by order_id. Order lookups by ID are fast. User-centric queries fan out everywhere.

User-centric sharding wins for most product apps. Pick the dimension your main workload groups by.

Rebalancing

You've outgrown N shards and need N+1. Moving data around is called rebalancing.

Naive approach: change hash(key) mod N to hash(key) mod (N+1). Now every key has a new shard. Catastrophic.

Common approaches:

Consistent hashing minimizes movement: only 1/N of keys move.
Fixed partition count. Create 1000 "virtual partitions" up front, map them to physical shards. When you add a shard, move some virtual partitions, not re-hash every key.
Directory-based. Move tenants individually; the lookup service records the new home.

Rebalancing takes hours to days for large datasets. Monitor bandwidth. Rate-limit the move. Plan for it to be interruptible.

When Not to Shard

Sharding is not free. A sharded system:

Is much harder to operate.
Forces denormalization.
Loses cross-shard joins and transactions.
Usually needs a proxy or custom client logic.

Scale vertically first. Use read replicas aggressively. Only shard when one database (with plenty of RAM, SSDs, and tuning) can't handle the write load.

Most applications never need to shard. A well-tuned PostgreSQL on modern hardware handles tens of thousands of writes per second, which covers most businesses.

Popular Databases, Briefly

Quick guide. Each has a tutorial's worth of nuance.

PostgreSQL      Relational, rich SQL, huge ecosystem. The default. Replicas, partitioning, FDW.
MySQL           Relational, widely hosted. Slightly less feature-rich than Postgres, easier to run on big clusters.
SQLite          Embedded, single-file. Great for mobile, CLI, tests, small sites.
MongoDB         Document, flexible schema. JSONB in Postgres is often enough.
Cassandra       Wide-column, multi-primary, linearly scalable for writes. Ops complexity is real.
DynamoDB        Managed key-value + document. Predictable perf, pay-per-use. AWS lock-in.
Redis           In-memory data structures. Cache + queue + pub/sub + data types. Not primary storage.
Elasticsearch   Full-text search, analytics. Not a primary DB; pair with something durable.
ClickHouse      Column store, blazing analytics. Sits next to your OLTP, not instead of.
TimescaleDB     Time series on Postgres. If you already have Postgres, first choice.
BigQuery        Serverless columnar warehouse. Petabyte-range analytics, pay-per-query.

Common Pitfalls

Sharding before you need to. Every decision downstream is harder. Scale vertically and read-replicate first.

Picking the wrong shard key. Hard to change later. Model your main queries before committing.

Ignoring replication lag. Users see their own write, then don't on the next page load. Route post-write reads to the primary, or use bounded-staleness reads.

No failover plan. Primary dies. Someone has to decide what to do in the first 5 minutes. Write the runbook before the outage.

Letting app scaling saturate the DB. Every app server keeps a large connection pool; scaling the app tier saturates the DB. Use PgBouncer or equivalent.

Next Steps

Continue to 06-queues-and-events.md to handle work that doesn't fit in a request/response.