Designing a System: A Worked Example | System Design Tutorial

This chapter walks through designing a URL shortener end to end, with back-of-envelope math, trade-offs made explicit, and the questions a reviewer would ask.

The Problem

"Build a URL shortener." (Think bit.ly, tinyurl, goo.gl.)

That's the whole prompt. The first step of any design is making it concrete.

Step 1: Clarify Requirements

Ask. Always. The answers shape the rest.

Functional

Shorten a long URL to a short one.
Redirect from short URL to long URL.
Let users see analytics (click counts, referrers).
Custom aliases (optional).
Expiration (optional).

Non-Functional

Peak request rate?
Read/write ratio?
Latency for the redirect path?
Availability target?
Durability: can a shortened URL ever be lost?

A reasonable set of answers for an interview or a real product:

100M new URLs per month.
Read:write ratio of 100:1.
Redirect p99 under 50ms.
99.99% availability.
URLs must not be lost.

These are the numbers we design against.

Step 2: Back-of-Envelope

The numbers drive the design.

Scale

100M new URLs per month
  = 100M / (30 × 86400 s) ≈ 40 writes/sec average
  = 40 × (5x peak factor) ≈ 200 writes/sec peak

Reads: 100:1 ratio → 4000 reads/sec average, 20,000 reads/sec peak

200 writes/sec and 20,000 reads/sec peak. Both are modest; one database can handle the writes, caching can handle the reads.

Storage

Per URL record:
  short_code    7 bytes (base62)
  long_url      200 bytes average
  user_id       8 bytes
  created_at    8 bytes
  expires_at    8 bytes
  click_count   8 bytes
  metadata      ~50 bytes
                ~300 bytes total

100M/month × 300 bytes = 30 GB/month
10 years = 3.6 TB

3.6 TB fits on one beefy PostgreSQL instance. Sharding isn't needed for a long time.

Analytics Storage

Click events at 20k/sec:

20,000 × 86400 = 1.7 billion events/day
  × 100 bytes = 170 GB/day
  × 365       = 60 TB/year

Analytics is a different beast. Write to PostgreSQL for billing and SLA reporting; stream to ClickHouse or a data warehouse for analytics. We'll keep analytics simple below and mention the split.

Step 3: API Sketch

Define the surface before the guts.

POST /shorten
Body: { "long_url": "https://example.com/very/long/path", "alias": "optional", "expires_at": "optional" }
Returns: { "short_url": "https://sho.rt/aB3xR7q", "code": "aB3xR7q", ... }

GET /{code}
Returns: 302 redirect to long_url. Async: record a click event.

GET /stats/{code}
Returns: { "click_count": 1234, "last_clicked": "...", "referrers": {...} }

Three endpoints. The second one does 99% of the traffic.

Step 4: High-Level Design

Start with the simplest thing that could work.

Client
  ↓ HTTPS
Load balancer
  ↓
App servers (stateless)
  ↓
Cache (Redis)          ← first for reads
  ↓ (miss)
Database (PostgreSQL)
  ↓ (async click event)
Kafka / queue
  ↓
Analytics store

For the hot path (redirect), each request is:

Cache lookup by short code.
On hit, redirect.
On miss, DB lookup, populate cache, redirect.
Fire-and-forget: emit click event.

The reasoning behind each decision follows.

Step 5: Short Code Generation

Three options:

Hash

MD5(long_url), take first N chars, base62. Deterministic; same URL → same code.

Problem: collisions. Two different URLs could share the first N chars of a hash.

Random

Generate a random 7-character base62 string. 62^7 ≈ 3.5 trillion codes.

Problem: collisions still happen once the table is large. Check before insert.

Counter + Encode

Global counter; base62-encode the next integer.

Problem: predictable codes (people can enumerate). Usually undesirable but not always a blocker.

Picked: Random Base62, Check on Insert

62^7 gives us lots of headroom. With 100M URLs over 10 years (1B total), we use 1B / 3.5T ≈ 0.03% of the keyspace. Collisions are rare.

On insert:

INSERT INTO urls (code, long_url, ...) VALUES (?, ?, ...)
ON CONFLICT (code) DO NOTHING
RETURNING code;

If nothing returns, retry with a new random code.

For custom aliases, they live in the same table; the UNIQUE constraint handles conflicts.

Step 6: Data Model

CREATE TABLE urls (
    code        CHAR(7) PRIMARY KEY,
    long_url    TEXT NOT NULL,
    user_id     BIGINT REFERENCES users(id),
    created_at  TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    expires_at  TIMESTAMPTZ,
    click_count BIGINT NOT NULL DEFAULT 0
);

CREATE INDEX ON urls (user_id, created_at);    -- for "my URLs" listing

CREATE TABLE click_events (
    id         BIGSERIAL PRIMARY KEY,
    code       CHAR(7) NOT NULL,
    ts         TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    referrer   TEXT,
    ip_country CHAR(2),
    user_agent TEXT
);

CREATE INDEX ON click_events (code, ts DESC);

urls is the hot table. code as PRIMARY KEY gives O(log N) lookups. click_events is append-only and huge; consider partitioning by day (Chapter 5).

Step 7: Caching Strategy

Read hit rate will be very skewed. A small fraction of URLs get most of the clicks. Perfect fit for a cache.

Cache layer: Redis.
Pattern: cache-aside (Chapter 4).
Key: url:{code}.
Value: the long URL string (that's all redirects need).
TTL: 1 hour. A tombstone for "not found" with a short TTL (10 seconds) avoids negative cache attacks.

Expected hit rate: 95%+. Database load for the hot path becomes minimal.

Cache Warming

After a deploy, the cache is empty. For a few minutes, all requests hit the DB. Options:

Warm from top 10,000 hot codes before switching traffic over.
Accept the cold start. 200 writes/sec vs 20k reads/sec: the DB handles it for a few minutes.

For a real product: warm. For the interview answer: acknowledge it and pick one.

Step 8: Click Analytics

Incrementing click_count on every redirect adds a write per read. Bad.

Three options:

Fire-and-Forget to a Queue

On each redirect, emit an event to Kafka. Consumers batch-update the database.

Redirect → enqueue {code, timestamp, ip, referrer}
           → Kafka
           ↓ batch every 1s or 10k events
           → consumer
           → UPDATE urls SET click_count = click_count + N WHERE code = ?
           → INSERT INTO click_events ...

Pros: the hot path is one cache hit + one HTTP response. Analytics is async.

Cons: eventual consistency. Stats are a few seconds behind. Acceptable.

Probabilistic Counting

Increment a Redis counter on every hit. Flush to DB every minute. Fast, slightly lossy.

Count in the Data Warehouse

Don't store counts in PostgreSQL at all. Click events go to ClickHouse/BigQuery; counts come from queries there.

Picked: Fire-and-Forget to Kafka

It matches the traffic pattern, doesn't block reads, and gives a clean boundary for the analytics system.

Step 9: Scaling Up

The design as drawn handles the target traffic. What breaks when the traffic grows 10x or 100x?

10x (200k reads/sec)

Cache is still fine (Redis handles 200k ops/sec easily on one node).
Database is fine for writes (2k/sec); reads are still cache-served.
App servers scale horizontally behind the load balancer.
Kafka scales by adding partitions.

No structural change needed.

100x (2M reads/sec)

Cache needs a cluster or client-side local cache.
App servers: fine with more replicas.
Database writes still manageable (20k/sec); use a connection pooler.
Network bandwidth at the edge becomes real. Consider a CDN: URLs and their destinations are cacheable (especially for short expiration windows).

At this scale, you also want geographic distribution. A user in Sydney shouldn't make a round trip to us-east-1 for a redirect. Options:

Regional deployments. Full stack per region; database replicated.
Edge computing. Cloudflare Workers or similar, reading from a KV store replicated globally.

Edge is elegant for this workload. The redirect is stateless (input: code; output: long URL). It's a great fit for edge serverless.

Step 10: Durability and Availability

Availability target: 99.99%. 52 minutes of downtime per year.

Database: primary + one sync replica + one async replica. Automated failover via RDS Multi-AZ or Patroni.
Cache: Redis with replication (failover automatic).
App tier: stateless, run at least 3 replicas across zones.
Kafka: three brokers, replication factor 3.

Durability: URLs must not be lost. Sync replica + daily snapshots + point-in-time recovery via WAL archiving. Four nines of durability is achievable on a managed PostgreSQL.

Step 11: Security and Abuse

A URL shortener is a spam magnet. Things to think about:

URL scanning. Integrate with Google Safe Browsing, PhishTank. Refuse known-bad URLs; blocklist on detection.
Rate limiting. Per IP, per user. Bots try to create millions of URLs.
Abuse reporting. Let users report; triage queue; disable with a takedown page.
Auth. Anonymous shortening allowed? For a public service, yes, with tight rate limits.

These aren't afterthoughts. Half the work of a URL shortener is keeping it from becoming a phishing distribution channel.

What a Reviewer Would Ask

Prepare answers:

"What happens if the cache dies?" DB load spikes. App servers handle it with increased latency; no outage. After 1-2 minutes, the cache warms back up.
"What if two clients insert the same code simultaneously?" UNIQUE constraint, ON CONFLICT; retry with a new random code.
"How do you handle expired URLs?" Check expires_at on redirect; return 410 Gone. Run a nightly job to clean up expired rows.
"What if analytics falls behind?" It catches up; nothing in the hot path depends on it.
"How do you prevent enumeration?" Random codes, not sequential. Rate limits on crawling-like patterns.
"How do you handle a 10x sudden spike?" Auto-scale app servers; cache absorbs reads; database is protected by the cache.

What's Different in a Real Design

A real design has constraints this walkthrough glossed over:

Existing infrastructure. Your company already uses AWS, GCP, or bare metal; pick tools that fit.
Team capacity. A one-engineer side project vs a 50-engineer team changes every decision.
Budget. Not everyone can afford multi-region active-active.
Integration. If this slots into an existing product, the auth and database are probably decided.

An interview design can be abstract. A real design is always shaped by context.

Next Steps

Continue to 12-best-practices.md for the habits that make designs survive contact with reality.