Designing a System: A Worked Example
This chapter walks through designing a URL shortener end to end, with back-of-envelope math, trade-offs made explicit, and the questions a reviewer would ask.
The Problem
"Build a URL shortener." (Think bit.ly, tinyurl, goo.gl.)
That's the whole prompt. The first step of any design is making it concrete.
Step 1: Clarify Requirements
Ask. Always. The answers shape the rest.
Functional
- Shorten a long URL to a short one.
- Redirect from short URL to long URL.
- Let users see analytics (click counts, referrers).
- Custom aliases (optional).
- Expiration (optional).
Non-Functional
- Peak request rate?
- Read/write ratio?
- Latency for the redirect path?
- Availability target?
- Durability: can a shortened URL ever be lost?
A reasonable set of answers for an interview or a real product:
- 100M new URLs per month.
- Read:write ratio of 100:1.
- Redirect p99 under 50ms.
- 99.99% availability.
- URLs must not be lost.
These are the numbers we design against.
Step 2: Back-of-Envelope
The numbers drive the design.
Scale
100M new URLs per month
= 100M / (30 × 86400 s) ≈ 40 writes/sec average
= 40 × (5x peak factor) ≈ 200 writes/sec peak
Reads: 100:1 ratio → 4000 reads/sec average, 20,000 reads/sec peak
200 writes/sec and 20,000 reads/sec peak. Both are modest; one database can handle the writes, caching can handle the reads.
Storage
Per URL record:
short_code 7 bytes (base62)
long_url 200 bytes average
user_id 8 bytes
created_at 8 bytes
expires_at 8 bytes
click_count 8 bytes
metadata ~50 bytes
~300 bytes total
100M/month × 300 bytes = 30 GB/month
10 years = 3.6 TB
3.6 TB fits on one beefy PostgreSQL instance. Sharding isn't needed for a long time.
Analytics Storage
Click events at 20k/sec:
20,000 × 86400 = 1.7 billion events/day
× 100 bytes = 170 GB/day
× 365 = 60 TB/year
Analytics is a different beast. Write to PostgreSQL for billing and SLA reporting; stream to ClickHouse or a data warehouse for analytics. We'll keep analytics simple below and mention the split.
Step 3: API Sketch
Define the surface before the guts.
POST /shorten
Body: { "long_url": "https://example.com/very/long/path", "alias": "optional", "expires_at": "optional" }
Returns: { "short_url": "https://sho.rt/aB3xR7q", "code": "aB3xR7q", ... }
GET /{code}
Returns: 302 redirect to long_url. Async: record a click event.
GET /stats/{code}
Returns: { "click_count": 1234, "last_clicked": "...", "referrers": {...} }
Three endpoints. The second one does 99% of the traffic.
Step 4: High-Level Design
Start with the simplest thing that could work.
Client
↓ HTTPS
Load balancer
↓
App servers (stateless)
↓
Cache (Redis) ← first for reads
↓ (miss)
Database (PostgreSQL)
↓ (async click event)
Kafka / queue
↓
Analytics store
For the hot path (redirect), each request is:
- Cache lookup by short code.
- On hit, redirect.
- On miss, DB lookup, populate cache, redirect.
- Fire-and-forget: emit click event.
The reasoning behind each decision follows.
Step 5: Short Code Generation
Three options:
Hash
MD5(long_url), take first N chars, base62. Deterministic; same URL → same code.
Problem: collisions. Two different URLs could share the first N chars of a hash.
Random
Generate a random 7-character base62 string. 62^7 ≈ 3.5 trillion codes.
Problem: still collisions at scale. Check before insert.
Counter + Encode
Global counter; base62-encode the next integer.
Problem: predictable codes (people can enumerate). Usually undesirable but not always a blocker.
Picked: Random Base62, Check on Insert
62^7 gives us lots of headroom. With 100M URLs over 10 years (1B total), we use 1B / 3.5T ≈ 0.03% of the keyspace. Collisions are rare.
On insert:
INSERT INTO urls (code, long_url, ...) VALUES (?, ?, ...)
ON CONFLICT (code) DO NOTHING
RETURNING code;
If nothing returns, retry with a new random code.
For custom aliases, they live in the same table; the UNIQUE constraint handles conflicts.
Step 6: Data Model
CREATE TABLE urls (
code CHAR(7) PRIMARY KEY,
long_url TEXT NOT NULL,
user_id BIGINT REFERENCES users(id),
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
expires_at TIMESTAMPTZ,
click_count BIGINT NOT NULL DEFAULT 0
);
CREATE INDEX ON urls (user_id, created_at); -- for "my URLs" listing
CREATE TABLE click_events (
id BIGSERIAL PRIMARY KEY,
code CHAR(7) NOT NULL,
ts TIMESTAMPTZ NOT NULL DEFAULT NOW(),
referrer TEXT,
ip_country CHAR(2),
user_agent TEXT
);
CREATE INDEX ON click_events (code, ts DESC);
urls is the hot table. code as PRIMARY KEY gives O(log N) lookups. click_events is append-only and huge; consider partitioning by day (Chapter 5).
Step 7: Caching Strategy
Read hit rate will be very skewed. A small fraction of URLs get most of the clicks. Perfect fit for a cache.
- Cache layer: Redis.
- Pattern: cache-aside (Chapter 4).
- Key:
url:{code}. - Value: the long URL string (that's all redirects need).
- TTL: 1 hour. A tombstone for "not found" with a short TTL (10 seconds) avoids negative cache attacks.
Expected hit rate: 95%+. Database load for the hot path becomes minimal.
Cache Warming
After a deploy, the cache is empty. For a few minutes, all requests hit the DB. Options:
- Warm from top 10,000 hot codes before switching traffic over.
- Accept the cold start. 200 writes/sec vs 20k reads/sec: the DB handles it for a few minutes.
For a real product: warm. For the interview answer: acknowledge it and pick one.
Step 8: Click Analytics
Incrementing click_count on every redirect adds a write per read. Bad.
Three options:
Fire-and-Forget to a Queue
On each redirect, emit an event to Kafka. Consumers batch-update the database.
Redirect → enqueue {code, timestamp, ip, referrer}
→ Kafka
↓ batch every 1s or 10k events
→ consumer
→ UPDATE urls SET click_count = click_count + N WHERE code = ?
→ INSERT INTO click_events ...
Pros: the hot path is one cache hit + one HTTP response. Analytics is async.
Cons: eventual consistency. Stats are a few seconds behind. Acceptable.
Probabilistic Counting
Increment a Redis counter on every hit. Flush to DB every minute. Fast, slightly lossy.
Count in the Data Warehouse
Don't store counts in PostgreSQL at all. Click events go to ClickHouse/BigQuery; counts come from queries there.
Picked: Fire-and-Forget to Kafka
It matches the traffic pattern, doesn't block reads, and gives a clean boundary for the analytics system.
Step 9: Scaling Up
The design as drawn handles the target traffic. What breaks when the traffic grows 10x or 100x?
10x (200k reads/sec)
- Cache is still fine (Redis handles 200k ops/sec easily on one node).
- Database is fine for writes (2k/sec); reads are still cache-served.
- App servers scale horizontally behind the load balancer.
- Kafka scales by adding partitions.
No structural change needed.
100x (2M reads/sec)
- Cache needs a cluster or client-side local cache.
- App servers: fine with more replicas.
- Database writes still manageable (20k/sec); use a connection pooler.
- Network bandwidth at the edge becomes real. Consider a CDN: URLs and their destinations are cacheable (especially for short expiration windows).
At this scale, you also want geographic distribution. A user in Sydney shouldn't make a round trip to us-east-1 for a redirect. Options:
- Regional deployments. Full stack per region; database replicated.
- Edge computing. Cloudflare Workers or similar, reading from a KV store replicated globally.
Edge is elegant for this workload. The redirect is stateless (input: code; output: long URL). It's a great fit for edge serverless.
Step 10: Durability and Availability
Availability target: 99.99%. 52 minutes of downtime per year.
- Database: primary + one sync replica + one async replica. Automated failover via RDS Multi-AZ or Patroni.
- Cache: Redis with replication (failover automatic).
- App tier: stateless, run at least 3 replicas across zones.
- Kafka: three brokers, replication factor 3.
Durability: URLs must not be lost. Sync replica + daily snapshots + point-in-time recovery via WAL archiving. Four nines of durability is achievable on a managed PostgreSQL.
Step 11: Security and Abuse
A URL shortener is a spam magnet. Things to think about:
- URL scanning. Integrate with Google Safe Browsing, PhishTank. Refuse known-bad URLs; blocklist on detection.
- Rate limiting. Per IP, per user. Bots try to create millions of URLs.
- Abuse reporting. Let users report; triage queue; disable with a takedown page.
- Auth. Anonymous shortening allowed? For a public service, yes, with tight rate limits.
These aren't afterthoughts. Half the work of a URL shortener is keeping it from becoming a phishing distribution channel.
What a Reviewer Would Ask
Prepare answers:
- "What happens if the cache dies?" DB load spikes. App servers handle it with increased latency; no outage. After 1-2 minutes, the cache warms back up.
- "What if two clients insert the same code simultaneously?" UNIQUE constraint, ON CONFLICT; retry with a new random code.
- "How do you handle expired URLs?" Check
expires_aton redirect; return 410 Gone. Run a nightly job to clean up expired rows. - "What if analytics falls behind?" It catches up; nothing in the hot path depends on it.
- "How do you prevent enumeration?" Random codes, not sequential. Rate limits on crawling-like patterns.
- "How do you handle a 10x sudden spike?" Auto-scale app servers; cache absorbs reads; database is protected by the cache.
What's Different in a Real Design
A real design has constraints this walkthrough glossed over:
- Existing infrastructure. Your company already uses AWS, GCP, or bare metal; pick tools that fit.
- Team capacity. A one-engineer side project vs a 50-engineer team changes every decision.
- Budget. Not everyone can afford multi-region active-active.
- Integration. If this slots into an existing product, the auth and database are probably decided.
An interview design can be abstract. A real design is always shaped by context.
Next Steps
Continue to 12-best-practices.md for the habits that make designs survive contact with reality.