Consistency and Consensus: The Theory That Bites | System Design Tutorial

What "Consistency" Even Means

The word "consistency" is overloaded. Three distinct meanings you'll encounter:

Database consistency (ACID's C). Constraints hold after a transaction. Foreign keys, CHECK constraints. Not what this chapter is about.
Replica consistency. Multiple copies of data agree with each other. Core topic here.
Consistency models. The ordering guarantees a system provides across operations. The rest of this chapter.

This chapter uses "consistency" to mean the last two.

CAP, Briefly and Honestly

The CAP theorem, stated carelessly: "You can only have two of Consistency, Availability, Partition tolerance."

This is technically true and practically misleading. You don't choose between partition tolerance and the other two. Partitions happen in any real distributed system. The real choice is: during a partition, do I give up consistency or availability?

CP systems: during a partition, reject some operations to preserve consistency. Writes might fail; reads might be stale or unavailable.
AP systems: during a partition, keep serving everyone, accept that they might see different data.

When there isn't a partition, most systems give you both. CAP is only a useful frame during a failure.

Examples:

CP   Spanner, etcd, Zookeeper, HBase, single-primary SQL (during failover)
AP   Cassandra, DynamoDB (default), CouchDB, Dynamo-style stores

Many real systems are tunable. DynamoDB has "strongly consistent reads" (CP-ish). Cassandra has QUORUM (CP-ish) and ONE (AP). Pick per operation, not per database.

Partitions are rare. Most of the time, the interesting trade-off is about latency.

PACELC: "If there's a Partition, choose Availability or Consistency; Else, choose Latency or Consistency."

System          During partition   Else
Spanner         CP                 CC  (consistent, higher latency)
Dynamo          AP                 AL  (available, lower latency)
Cassandra       AP                 AL  (or CL with QUORUM)
MongoDB         CP                 CC  (primary reads) or AL (secondary)

The real-world choice you make most often is L vs C: wait for a global quorum (high latency, strong consistency) or serve from the nearest replica (low latency, possibly stale).

Consistency Models, Ranked

From strongest to weakest.

Linearizability (Strong Consistency)

Operations appear to happen instantaneously, in some sequential order that matches the real-time order of non-concurrent operations.

In plain language: every reader sees the latest write immediately, as if there's only one copy of the data.

Expensive. Requires coordination. Examples: Spanner, etcd, a single-primary PostgreSQL.

Sequential Consistency

All operations happen in some total order, and each process sees its own operations in program order. But the global order doesn't have to match real time.

Weaker than linearizability, rarely useful in practice.

Causal Consistency

If operation A causes operation B (the reader of A then writes B), every process sees A before B. Concurrent operations can be seen in any order.

More permissive than linearizability; still rules out the "I see my reply before your message" class of bug. Pragmatic for social apps and collaborative tools.

Eventual Consistency

If no new writes happen, all replicas eventually converge. Anything goes in the meantime.

Weakest useful model. Fine for analytics, counters, status indicators. Dangerous for anything where staleness matters (billing, auth).

Session Guarantees

A middle ground. Within a user's session, you get stronger guarantees:

Read-your-writes: after you write X, you read X.
Monotonic reads: once you've seen a newer version, you don't see an older one.
Monotonic writes: your writes appear in the order you made them.
Writes follow reads: if you read X, any write you make respects that it saw X.

Most useful UX-level consistency lives here. Full linearizability is overkill for most features.

Read-Your-Writes in Practice

Most replica-based systems give eventual consistency by default. Users write, then immediately read, and see the old value. Angry ticket incoming.

Two mitigations:

Route Post-Write Reads to Primary

After a user writes, their next N reads go to the primary (not replicas). Simple, often enough.

def read(user_id):
    if recently_wrote(user_id):
        return primary.query(...)
    return replica.query(...)

Bounded Staleness

Replicas expose their replication lag. Read from any replica whose lag is under a threshold (say 100ms). If none is fresh enough, read from the primary.

Quorum

For systems with N replicas, define:

W: replicas that must ack a write.
R: replicas that must ack a read.

If R + W > N, any read sees any write. That's the quorum condition.

Common settings for N=3:

W=1, R=1    Low latency, eventual consistency. Dynamo default.
W=2, R=2    Quorum. Linearizable-ish, tolerates one failure.
W=3, R=1    Reads are fast, writes are slow. Read-heavy systems.
W=1, R=3    Writes are fast, reads are slow. Write-heavy systems (rare).
W=3, R=3    All three always. No failure tolerance, highest consistency.

Cassandra lets you choose per operation. DynamoDB exposes "strongly consistent reads" (quorum) vs "eventually consistent reads" (any replica).

Consensus: Getting Replicas to Agree

When you need a group of replicas to agree on a sequence of decisions (order of writes, leader election, config changes), you need a consensus algorithm.

Two matter in practice: Paxos and Raft. Raft was designed to be understandable. It is.

Raft, Skeletal

Three roles:

Leader: accepts writes, replicates them.
Follower: replicates from leader.
Candidate: running for leader after detecting leader failure.

Two phases:

Leader election. When the leader is silent for too long (heartbeat timeout), a follower becomes a candidate and requests votes. If a majority grants, it's the new leader.
Log replication. Leader receives a write, appends it to its log, replicates to followers. When a majority has the entry, it's committed and applied.

Why it works: every elected leader has a log at least as up-to-date as the majority. Entries committed by a leader stay committed across leader changes. The algorithm is short; the proof is longer.

What Consensus Gets You

With a Raft cluster of N nodes, you can tolerate (N-1)/2 failures. Three nodes tolerate one. Five tolerate two. The coordinated cluster presents a linearizable view of state.

Cost: writes require a round trip to a majority. In a single datacenter, ~1ms. Across regions, 50+ms. Consensus doesn't come free.

Systems built on Raft: etcd, Consul, CockroachDB, TiKV, parts of Kafka's newer metadata layer. Paxos powers Spanner and Chubby.

You Probably Don't Implement Consensus

Almost every system that needs consensus uses a library (etcd, Consul, Zookeeper) instead of rolling its own. The libraries are battle-tested; rolling your own is a years-long project.

Use consensus when:

You need leader election.
You need distributed locks that actually work.
You need a consistent config store.
You're building a distributed database.

Don't use consensus when:

A relational DB with leader failover suffices (most apps).
You just need "eventually consistent" coordination (counters, caches).

Real Systems, Real Trade-offs

Quick tour of how common systems handle consistency.

PostgreSQL primary             Linearizable writes. Reads from primary: linearizable.
PostgreSQL replica             Async by default, bounded-staleness reads.
DynamoDB (default reads)       Eventual consistency.
DynamoDB (strong reads)        Linearizable, costs ~2x the RCUs.
Cassandra (ONE)                Eventual consistency.
Cassandra (QUORUM)             Linearizable when R+W > N.
MongoDB (primary)              Linearizable.
MongoDB (secondary)            Eventually consistent; can be tuned.
Spanner                        Globally linearizable with TrueTime.
etcd / Consul                  Linearizable (Raft).
S3                             Strong read-after-write consistency since 2020.
Redis (single instance)        Linearizable.
Redis Cluster                  Eventually consistent during failover.
Kafka                          Ordered within partition; messages can duplicate on failure.

Common Pitfalls

Picking strong consistency everywhere. Latency goes up, availability goes down, and most of your operations don't need it. Relax where you can.

Picking eventual consistency everywhere. You'll ship bugs where a user writes and immediately reads the old value. Session guarantees are cheap; apply them.

Assuming the network is reliable. Partitions happen. Retries happen. Duplicates happen. Design for all three.

Trusting "exactly-once" without reading the fine print. Usually at-least-once with idempotency. Build idempotent consumers.

Using a distributed lock where a database transaction suffices. Distributed locks are hard (fencing tokens, liveness, lease expiry). A unique constraint or row lock solves most problems.

Thinking CAP gives you a binary choice. Most systems are tunable per operation. Use that.

Next Steps

Continue to 08-reliability.md to keep the system up when things fail.