Scaling: More Traffic Than One Box
This chapter covers vertical vs horizontal scaling, why stateless services matter, and the load balancing strategies that actually route traffic.
Vertical vs Horizontal
Two ways to handle more load.
Vertical scaling: put it on a bigger box. More CPU, more RAM, bigger disk. Same software, same architecture.
Horizontal scaling: put it on more boxes. Same software, same architecture, but now N copies running behind a load balancer.
Vertical Wins When
- The single-box limit is far away.
- Your workload fits in RAM.
- You'd rather not deal with distributed coordination.
- The problem is CPU-bound per request (a heavy computation per call, not many tiny calls).
Modern cloud VMs go to 192 vCPUs and 2TB of RAM. That's a lot. A well-tuned PostgreSQL on a beefy box handles most companies' workloads for years. Vertical scaling is the fastest path to "good enough" for most services.
Horizontal Wins When
- You've outgrown any single box.
- You need fault tolerance (if one box dies, traffic goes to others).
- Your workload fans out across many independent requests.
- Traffic is spiky and you want to add and remove capacity fast.
Horizontal is more complex (you need load balancers, coordination, shared state), but unbounded.
The Honest Recommendation
Scale vertically until you can't. Horizontal is strictly more complex; adopt the complexity when vertical runs out or when you need fault tolerance.
Most services spend years on a single database and a handful of application replicas. The "add a thousand nodes" internet-scale narrative is rare. Don't design for it unless you have evidence you need it.
Stateless Services
A service is stateless when any instance can handle any request. No instance keeps local state that the next request depends on. Session data, cached lookups, in-memory counters all live elsewhere (Redis, a database, a cache layer).
Why It Matters
Stateless services let you:
- Add and remove instances freely (traffic just re-routes).
- Roll out new versions by replacing instances.
- Survive instance death without losing user state.
- Use simple load balancing strategies.
A stateful service has a story at every operation:
- How do I promote a replica when the primary dies?
- How do I migrate a user's session when their instance scales down?
- How do I upgrade? Drain? Take a snapshot?
Push state down, not up. Services become caches or compute; persistence and session state live in dedicated stores (databases, caches, object storage).
What "State" Actually Means
Not all state is bad. Every process has some local state: in-flight request data, connection pools, metrics counters. What makes a service stateful (in the problematic sense) is state that must survive across requests and must be read by a later request on a possibly different instance.
Examples of problematic local state:
Session data kept in memory → move to Redis/DB
Upload in progress, buffered → stream to object storage
Rate limit counters per user → move to Redis with atomic ops
"Last seen" user status → push to shared store
In-memory cache of hot rows → fine, as long as it's not authoritative
The last one is subtle: an in-memory cache is state, but it's OK if it's just a cache (source of truth is elsewhere). Inconsistent caches across instances are usually fine.
Load Balancing Algorithms
Chapter 2 listed them. Here's when to use each.
Round Robin
Each backend takes turns. Simple, fair when backends are identical and requests are uniform.
Problems: doesn't account for backend load. If one request is 100x heavier than the average, round robin sends it to whoever's turn it is, regardless.
Least Connections
Send to the backend with the fewest active connections. Good for workloads with long-lived connections (websockets, long-polling) or variable request durations.
Problem: "fewest connections" doesn't always mean "least loaded". A backend with 10 fast requests can be less busy than one with 5 slow ones.
Weighted
Each backend has a weight. Bigger weight, more requests. Useful when backends are heterogeneous (mixed instance types during a migration).
Consistent Hashing
Hash a key from the request (user ID, customer ID, cache key) and map it to a backend. Same key always goes to the same backend, until the backend pool changes.
Why bother? Two reasons:
- Cache locality. If you hash user ID to an app server that has their session or data cached in memory, hit rates are high.
- Stable state. If the backend maintains per-key state (a cache shard, a websocket connection), consistent hashing keeps keys at the same backend across requests.
When a backend is added or removed, only 1/N of keys re-map (not all of them, which is the naive hash mod N failure). Chapter 4 returns to consistent hashing in the context of Redis and cache clusters.
Random
Pick a random backend. Surprisingly competitive with round robin when backends are uniform. Trivial to implement.
Power of Two Choices
Pick two random backends; send the request to the less-loaded one. Near-optimal in practice, much simpler than maintaining perfect global state. Used by Envoy's least-request load balancing.
Health Checks
A load balancer must know which backends are alive.
Passive: notice failed requests, stop sending traffic.
Active: periodically hit a /health endpoint, remove backends that don't respond.
Endpoints worth exposing:
/livez: am I alive? (Restart if not.)/readyz: am I ready to serve? (Remove from LB if not.)
The distinction matters during startup and shutdown. A pod that's alive but still warming a cache should be "live" but "not ready". The LB skips it; Kubernetes doesn't restart it.
Keep health checks cheap. A check that hits the database makes the database the weak link.
Sticky Sessions
Sticky sessions route a user's requests to the same backend (via cookie, IP hash, or consistent hashing on a session ID).
When they help:
- Legacy apps that keep user state in process memory.
- Websocket connections (must pin to one backend).
- Cache locality, when hit rates matter more than fair distribution.
When they hurt:
- Backend dies; user loses their session.
- One user's traffic pins to one backend that then gets overloaded.
- Scale-down drains a backend; sticky-mapped users all move at once.
Prefer stateless. Add stickiness when you truly need it; treat it as a compromise, not a default.
Auto-Scaling
Add capacity when load is high, remove it when load is low.
Two triggers:
Reactive: scale on current metrics (CPU, requests/sec, queue depth). Simple but lags spikes.
Predictive: scale on forecasts (time of day, known traffic patterns, calendar). Faster for predictable spikes, more complicated to tune.
Rules of thumb:
- Scale up fast, scale down slow. Taking 10 minutes to remove a box is fine. Taking 10 minutes to add one isn't.
- Have a minimum. The cost of one extra instance is usually less than the pain of cold starts.
- Set a maximum. Auto-scaling bugs have broken budgets.
- Watch for the bottleneck. Autoscaling your app tier doesn't help if the database is the limit.
Connection Pools at Scale
Scaling horizontally adds more service instances; each instance has its own connection pools to downstream services (databases, Redis, other services).
A gotcha: if every instance keeps 20 database connections and you scale from 10 to 50 instances, you just went from 200 to 1000 DB connections. Most databases start hurting well before 1000 connections.
Two fixes:
- Reduce the per-instance pool. Often 5 to 10 is enough.
- Put a connection pooler in front of the DB. PgBouncer for PostgreSQL, RDS Proxy, and friends.
Chapter 5 covers this in depth.
Common Pitfalls
Scaling the wrong tier. The app tier is rarely the bottleneck; it's usually the database or a downstream service. Measure first.
Auto-scaling with no ceiling. A bug can turn a spike into a bill you won't forget.
Sticky sessions as a default. Makes scale-in painful and failure recovery awkward.
Health checks that hit the database. Now the DB decides which backends are healthy. If the DB blips, the whole pool goes dark.
Assuming "more replicas" solves "database is slow". It doesn't. You still share the database. Scale reads with replicas, scale writes with partitioning (Chapter 5), reduce load on the DB with caching (Chapter 4).
Next Steps
Continue to 04-caching.md to move reads off your database.