Consistency and Consensus: The Theory That Bites
This chapter covers CAP, PACELC, quorum-based replication, the basics of Raft, and the practical flavors of consistency you'll pick between.
What "Consistency" Even Means
The word "consistency" is overloaded. Three distinct meanings you'll encounter:
- Database consistency (ACID's C). Constraints hold after a transaction. Foreign keys, CHECK constraints. Not what this chapter is about.
- Replica consistency. Multiple copies of data agree with each other. Core topic here.
- Consistency models. The ordering guarantees a system provides across operations. The rest of this chapter.
This chapter uses "consistency" to mean the last two.
CAP, Briefly and Honestly
The CAP theorem, stated carelessly: "You can only have two of Consistency, Availability, Partition tolerance."
This is technically true and practically misleading. You don't choose between partition tolerance and the other two. Partitions happen in any real distributed system. The real choice is: during a partition, do I give up consistency or availability?
- CP systems: during a partition, reject some operations to preserve consistency. Writes might fail; reads might be stale or unavailable.
- AP systems: during a partition, keep serving everyone, accept that they might see different data.
When there isn't a partition, most systems give you both. CAP is only a useful frame during a failure.
Examples:
CP Spanner, etcd, Zookeeper, HBase, single-primary SQL (during failover)
AP Cassandra, DynamoDB (default), CouchDB, Dynamo-style stores
Many real systems are tunable. DynamoDB has "strongly consistent reads" (CP-ish). Cassandra has QUORUM (CP-ish) and ONE (AP). Pick per operation, not per database.
PACELC: The Refinement
Partitions are rare. Most of the time, the interesting trade-off is about latency.
PACELC: "If there's a Partition, choose Availability or Consistency; Else, choose Latency or Consistency."
System During partition Else
Spanner CP CC (consistent, higher latency)
Dynamo AP AL (available, lower latency)
Cassandra AP AL (or CL with QUORUM)
MongoDB CP CC (primary reads) or AL (secondary)
The real-world choice you make most often is L vs C: wait for a global quorum (high latency, strong consistency) or serve from the nearest replica (low latency, possibly stale).
Consistency Models, Ranked
From strongest to weakest.
Linearizability (Strong Consistency)
Operations appear to happen instantaneously, in some sequential order that matches the real-time order of non-concurrent operations.
In plain language: every reader sees the latest write immediately, as if there's only one copy of the data.
Expensive. Requires coordination. Examples: Spanner, etcd, a single-primary PostgreSQL.
Sequential Consistency
All operations happen in some total order, and each process sees its own operations in program order. But the global order doesn't have to match real time.
Weaker than linearizability, rarely useful in practice.
Causal Consistency
If operation A causes operation B (the reader of A then writes B), every process sees A before B. Concurrent operations can be seen in any order.
More permissive than linearizability; still rules out the "I see my reply before your message" class of bug. Pragmatic for social apps and collaborative tools.
Eventual Consistency
If no new writes happen, all replicas eventually converge. Anything goes in the meantime.
Weakest useful model. Fine for analytics, counters, status indicators. Dangerous for anything where staleness matters (billing, auth).
Session Guarantees
A middle ground. Within a user's session, you get stronger guarantees:
- Read-your-writes: after you write X, you read X.
- Monotonic reads: once you've seen a newer version, you don't see an older one.
- Monotonic writes: your writes appear in the order you made them.
- Writes follow reads: if you read X, any write you make respects that it saw X.
Most useful UX-level consistency lives here. Full linearizability is overkill for most features.
Read-Your-Writes in Practice
Most replica-based systems give eventual consistency by default. Users write, then immediately read, and see the old value. Angry ticket incoming.
Two mitigations:
Route Post-Write Reads to Primary
After a user writes, their next N reads go to the primary (not replicas). Simple, often enough.
def read(user_id):
if recently_wrote(user_id):
return primary.query(...)
return replica.query(...)
Bounded Staleness
Replicas expose their replication lag. Read from any replica whose lag is under a threshold (say 100ms). If none is fresh enough, read from the primary.
Quorum
For systems with N replicas, define:
- W: replicas that must ack a write.
- R: replicas that must ack a read.
If R + W > N, any read sees any write. That's the quorum condition.
Common settings for N=3:
W=1, R=1 Low latency, eventual consistency. Dynamo default.
W=2, R=2 Quorum. Linearizable-ish, tolerates one failure.
W=3, R=1 Reads are fast, writes are slow. Read-heavy systems.
W=1, R=3 Writes are fast, reads are slow. Write-heavy systems (rare).
W=3, R=3 All three always. No failure tolerance, highest consistency.
Cassandra lets you choose per operation. DynamoDB exposes "strongly consistent reads" (quorum) vs "eventually consistent reads" (any replica).
Consensus: Getting Replicas to Agree
When you need a group of replicas to agree on a sequence of decisions (order of writes, leader election, config changes), you need a consensus algorithm.
Two matter in practice: Paxos and Raft. Raft was designed to be understandable. It is.
Raft, Skeletal
Three roles:
- Leader: accepts writes, replicates them.
- Follower: replicates from leader.
- Candidate: running for leader after detecting leader failure.
Two phases:
- Leader election. When the leader is silent for too long (heartbeat timeout), a follower becomes a candidate and requests votes. If a majority grants, it's the new leader.
- Log replication. Leader receives a write, appends it to its log, replicates to followers. When a majority has the entry, it's committed and applied.
Why it works: every elected leader has a log at least as up-to-date as the majority. Entries committed by a leader stay committed across leader changes. The algorithm is short; the proof is longer.
What Consensus Gets You
With a Raft cluster of N nodes, you can tolerate (N-1)/2 failures. Three nodes tolerate one. Five tolerate two. The coordinated cluster presents a linearizable view of state.
Cost: writes require a round trip to a majority. In a single datacenter, ~1ms. Across regions, 50+ms. Consensus doesn't come free.
Systems built on Raft: etcd, Consul, CockroachDB, TiKV, parts of Kafka's newer metadata layer. Paxos powers Spanner and Chubby.
You Probably Don't Implement Consensus
Almost every system that needs consensus uses a library (etcd, Consul, Zookeeper) instead of rolling its own. The libraries are battle-tested; rolling your own is a years-long project.
Use consensus when:
- You need leader election.
- You need distributed locks that actually work.
- You need a consistent config store.
- You're building a distributed database.
Don't use consensus when:
- A relational DB with leader failover suffices (most apps).
- You just need "eventually consistent" coordination (counters, caches).
Real Systems, Real Trade-offs
Quick tour of how common systems handle consistency.
PostgreSQL primary Linearizable writes. Reads from primary: linearizable.
PostgreSQL replica Async by default, bounded-staleness reads.
DynamoDB (default reads) Eventual consistency.
DynamoDB (strong reads) Linearizable, costs ~2x the RCUs.
Cassandra (ONE) Eventual consistency.
Cassandra (QUORUM) Linearizable when R+W > N.
MongoDB (primary) Linearizable.
MongoDB (secondary) Eventually consistent; can be tuned.
Spanner Globally linearizable with TrueTime.
etcd / Consul Linearizable (Raft).
S3 Strong read-after-write consistency since 2020.
Redis (single instance) Linearizable.
Redis Cluster Eventually consistent during failover.
Kafka Ordered within partition; messages can duplicate on failure.
Common Pitfalls
Picking strong consistency everywhere. Latency goes up, availability goes down, and most of your operations don't need it. Relax where you can.
Picking eventual consistency everywhere. You'll ship bugs where a user writes and immediately reads the old value. Session guarantees are cheap; apply them.
Assuming the network is reliable. Partitions happen. Retries happen. Duplicates happen. Design for all three.
Trusting "exactly-once" without reading the fine print. Usually at-least-once with idempotency. Build idempotent consumers.
Using a distributed lock where a database transaction suffices. Distributed locks are hard (fencing tokens, liveness, lease expiry). A unique constraint or row lock solves most problems.
Thinking CAP gives you a binary choice. Most systems are tunable per operation. Use that.
Next Steps
Continue to 08-reliability.md to keep the system up when things fail.