Reliability: When Things Fail | System Design Tutorial

This chapter covers failure modes, retries, timeouts, circuit breakers, idempotency, bulkheads, and graceful degradation.

Failure Is Normal

In any distributed system, something is always failing somewhere. A disk filled up. A VM got evicted. A certificate expired. A network link flapped. A garbage collector paused the process for 3 seconds. The user's phone lost signal.

Reliable systems don't avoid failure; they absorb it. The question isn't "how do I prevent X from breaking" but "how do I keep the system useful when X breaks".

Failure Modes

Not all failures look alike.

Crash failure: the node stops. Easy to detect (no response, dead connection). Easy-ish to handle (retry, failover).

Omission failure: the node misses a message. Harder: did it arrive and get dropped? Did the ack get lost?

Timing failure: the node responds, but too slowly. The worst kind. A slow node looks alive to health checks but kills latency end-to-end.

Byzantine failure: the node returns wrong answers (corruption, bug, malicious). Expensive to defend against; most systems assume honest failures only.

The day-to-day is crash and timing. Most reliability techniques target these two.

Timeouts

Every network call needs a timeout. Without one, a downstream hang propagates: the caller waits forever, its caller waits, eventually the whole system is frozen.

Typical budgets:

Database query      100 ms to 5 s
Cache lookup         1 ms to 50 ms
Internal HTTP       100 ms to 2 s
External API         1 s to 30 s
Background job       30 s to hours

Set the timeout slightly below the client's own timeout, so you fail before your caller does. If the client gives you 5 seconds, call downstream with 4.5 seconds. That leaves budget for retries and overhead.

Deadlines Over Timeouts

Better than a per-call timeout is a deadline propagated through the call chain. Each service subtracts its own processing from the budget. Late services fail fast.

gRPC and some HTTP frameworks support deadlines natively. Otherwise pass a budget header.

Retries

A call failed. Try again?

Yes, usually. Transient failures (a dropped packet, a momentary unavailability) resolve themselves. Retries are free reliability.

Rules for retries:

Exponential Backoff

Don't retry immediately. Wait. Then wait longer.

attempts = 0
while attempts < MAX:
    try:
        return call()
    except TransientError:
        wait = min(CAP, BASE * 2 ** attempts)
        time.sleep(wait)
        attempts += 1
raise

Without backoff, a spike of failures generates a spike of retries, which generates a new spike of failures. Retry storms.

Add Jitter

If every client retries at the same backoff, they all retry at the same moment. Add random jitter so retries spread out.

wait = random.uniform(0, min(CAP, BASE * 2 ** attempts))

Full jitter (the example above) is often better than "base + small jitter" under heavy load.

Don't Retry Everything

Retry transient failures. Don't retry:

4xx errors (caller's fault; retry won't help).
Idempotent-only operations whose idempotency you can't prove.
Anything past a deadline.

Cap Retries

Retries can't be infinite. Typical: 3 to 5 attempts, total budget bounded by the deadline.

Circuit Breakers

A downstream service is unhealthy. Retrying makes it worse. Circuit breakers stop calling a failed service to give it room to recover.

Three states:

Closed      Normal. Requests flow. Count failures.
Open        Broken. Fail fast without calling. After a cooldown, half-open.
Half-open   Probe. Allow one request. Success closes the circuit. Failure reopens.

Implementations: Hystrix (Java, now in maintenance), resilience4j, Polly (.NET), built-ins in Envoy and Istio.

Set thresholds by observed behavior. "Open after 20 failures in 60 seconds" is a starting point; tune to the service.

Bulkheads

A ship's bulkheads isolate flooding. In software, bulkheads isolate failures.

Examples:

Separate thread pools per downstream. A slow payment service can't consume all your worker threads and starve the rest.
Per-tenant quotas. One noisy customer can't exhaust a shared rate limit.
Separate connection pools. Analytics queries can't drain the connection pool that serves user traffic.
Separate services. Payments in one service, reports in another. One hiccuping doesn't affect the other.

Bulkheads trade total capacity for fault isolation. You might run a little hotter under normal load, but you don't lose everything to one misbehaving caller.

Idempotency

An operation is idempotent if calling it twice has the same effect as calling it once.

Idempotent:

SET balance = 100
PUT /users/42 { "name": "Ada" }
DELETE /orders/42

Not idempotent:

INCREMENT balance
POST /orders (new ID generated each time)

Why it matters: retries and duplicate deliveries happen all the time. An idempotent operation is safe under them; a non-idempotent one silently duplicates.

Making Operations Idempotent

Three patterns:

Idempotency keys. The caller generates a unique ID per logical operation; the server records it with the result.

POST /payments
Idempotency-Key: client-generated-uuid-abc123

{ "amount": 99.00, "source": "card_xyz" }

Server: if the key exists, return the stored result. If not, process and store. Retries with the same key return the same result and don't re-charge.

State-based, not delta-based. SET balance = 100 is safe; ADD 50 to balance is not.

Upsert at the sink. Database INSERT ... ON CONFLICT DO UPDATE swallows duplicates harmlessly.

Every write endpoint that can be retried should support idempotency. Most real APIs don't. Yours should.

Graceful Degradation

When a dependency fails, what can you still do?

Options, in order of preference:

Cache the answer. If the cache has a stale copy, use it.
Return a default. Show the user a sensible placeholder.
Hide the broken part. Skip the comments section; the article still renders.
Fail the feature, not the app. A broken "recommendations" widget shouldn't take down the product page.
Error message. Last resort. Tell the user what's broken and what to do.

"All or nothing" is rarely the right design. Core flows should work even when half the services are down.

Failover

When a component dies, automatically route traffic elsewhere.

Flavors:

Active-passive. Primary handles everything; standby is idle. On failure, standby takes over. Simple. Wastes the standby.
Active-active. Multiple replicas handle traffic simultaneously. On failure, survivors absorb the load. More efficient, more complex.
Multi-region. Active-active across regions. Expensive and worth it at scale.

Key question: is failover automatic or manual? Automatic is fast but can misfire (flap). Manual is reliable but slow and needs someone awake.

Most production databases have automated failover (RDS Multi-AZ, Patroni, etcd-coordinated PostgreSQL). Automated failover needs good health checks and careful thresholds.

Chaos Engineering

Don't wait for real failures to find out what breaks. Inject failures in production deliberately.

Kill a random instance every day.
Inject network latency.
Drop packets at the load balancer.
Cut a zone off from others.

Tools: Netflix's Chaos Monkey, Gremlin, Litmus. On Kubernetes, Chaos Mesh is popular.

Start small: a staging environment, one failure mode, with someone watching. Build up to "it runs in production on a schedule, and nobody notices".

Observability Before Reliability

You can't fix what you can't see. Before investing heavily in reliability patterns, make sure you can answer:

Which service is failing right now?
What's the error rate per endpoint?
What's the p99 latency per call?
Which dependency is slow?

Chapter 9 covers this in depth. Reliability without observability is fumbling in the dark.

Common Pitfalls

No timeouts. The single most common reliability bug. Add timeouts everywhere.

Retries without backoff. Spikes of failure generate spikes of retries. Exponential backoff plus jitter, always.

Same pool for everything. A hiccup in one caller drains resources; everyone suffers. Use bulkheads.

Not testing failure paths. The exception handler has never run in production. It's probably broken. Test it.

Trusting the happy path in reviews. "What happens when the database is slow?" is the question that exposes real design flaws.

Alerting on retries, not on failures. A noisy downstream causes retry alerts to fire all night. Alert on user-visible outcomes.

Next Steps

Continue to 09-observability.md to see what the system is actually doing.