Best Practices: Habits and Anti-patterns | System Design Tutorial

This chapter collects the evaluation patterns, common traps, and anti-patterns that separate designs that ship from designs that crumble in review.

Start with Requirements

The first question on any design is "what are we actually building?" Every design review that goes badly started with skipping this.

Functional: what does the system do?
Non-functional: how fast, how reliable, how much?
Constraints: what do we have to work with? Cloud provider, budget, team, deadline.
Success criteria: what does "done" look like?

You can't make trade-offs without knowing what you're trading off. A design that reads "it uses Kubernetes and Kafka and a microservice architecture" without grounding in requirements is a résumé, not a design.

Think in Numbers

"It's going to be huge" is not a design input. "20,000 requests per second at peak, 50GB of data per year, p99 latency of 100ms" is.

Every conversation should have at least three numbers:

Request rate.
Data volume.
Latency requirement.

More helps. Cost. Failure rate. User concurrency. Growth rate.

Numbers make decisions defensible. "We picked PostgreSQL because it can handle 10k writes/sec and we need 200" is a real argument. "We picked PostgreSQL because I like it" isn't.

Make Trade-offs Explicit

Every design is a set of trade-offs. A good design explains them out loud. A bad design pretends there aren't any.

"We chose strong consistency here; this adds latency."
"We accept eventual consistency in analytics; stats lag by minutes."
"We use a single region; a region outage is a full outage."
"We shard by user ID; cross-user queries are expensive."

Making these explicit does three things:

Reviewers can challenge them.
You can revisit them when requirements change.
It flags to future engineers what the design assumes.

Prefer Boring Technology

Every piece of novel tech has a learning curve, unknown failure modes, a small community, and possibly no future. Every piece of boring tech (PostgreSQL, Redis, NGINX, Kafka) has been run in production by many teams with many failure modes documented and many people who know it.

Dan McKinley's rule: a company has three "innovation tokens". Spend them carefully. If a PostgreSQL can do the job, don't bring in a new distributed key-value store because it's interesting.

Boring is a compliment. Tools that work reliably in the background are undervalued.

Monitor Before Scaling

The cure for "it's slow" is not "make it faster". It's "find out why".

Before adding a cache, adding replicas, sharding, or switching databases:

Know the current p99 latency.
Know which query or call is slow.
Know how often it's slow.
Know what the bottleneck resource is (CPU, disk, memory, network, lock).

Most performance problems have a cheap fix when you know their cause. Missing index, N+1 query, sync call where async would do. Don't skip straight to "add a Redis cluster".

Design for Failure

Every design review question begins with "what happens when X fails?"

A good mental checklist:

Database primary dies.
Cache layer dies.
A downstream service is slow.
A downstream service returns wrong data.
Network between zones partitions.
Full region outage.
Bad deploy.
Noisy tenant saturates shared resources.

For each, the design should have a story. "We fail over" or "we fail gracefully" or "we accept a few minutes of downtime" are all fine answers. "We didn't think about it" is not.

Run a tabletop exercise: walk through a real incident scenario. Every team should do this quarterly for critical systems.

Design Reviews: What to Look For

A design review is a conversation, not a presentation. The reviewers' job is to surface weaknesses; the author's job is to have answers.

Good questions to ask of any design:

What are the three numbers that matter?
What breaks when load is 10x the plan?
What happens if component X fails?
Which consistency model is this assuming?
Where's the single point of failure?
What's the migration story if we're wrong?
What's the on-call burden going to be?
Who owns this after the project wraps?

If the author can answer these well, the design is probably sound. If not, there's homework.

Incremental Over Big Bang

Every large design should ship in pieces.

MVP first: the narrowest slice that delivers value.
Iterate with real traffic feedback.
Strangler fig when replacing.
Feature flags for every risky path.

"We'll ship version 1 in six months" almost always turns into "we'll ship version 0.7 in 12 months". Make each milestone a shippable thing.

Write It Down

Every non-trivial design should have a written document: context, goals, proposed design, alternatives considered, trade-offs accepted, open questions.

Names for this: RFC, Design Doc, ADR (Architecture Decision Record), tech spec. They all work.

Why bother:

Forces clarity. Writing a thing exposes holes in the thinking.
Lets reviewers disagree. Comments on a doc are cheaper than rework.
Preserves context. Future engineers see why decisions were made.
Catches pattern matches. "We considered X because Y" helps the next team avoid Y.

Keep them short. A 4-page doc beats a 40-page doc. The point is clarity, not exhaustiveness.

The Principles, Condensed

A short list worth posting near your desk:

You probably don't need microservices yet.
Scale vertically until you can't.
Cache everything you can afford to be slightly stale.
Every network call needs a timeout.
Every write endpoint should be idempotent.
Log structured. Alert on user impact. Keep dashboards readable.
Design for failure. Test the failure paths.
Measure before optimizing.
Prefer boring tools.
Ship in pieces.

These don't make you right about any particular design. They make most of your designs less wrong.

Anti-Patterns

Patterns to catch in yourself and others.

"We'll Add a Cache Later"

Without a plan for invalidation. Caches are easy to add; correct caches are not. Design invalidation at the same time you design the cache.

"Microservices from Day One"

You don't know the boundaries yet. You'll draw them wrong. Start monolithic, extract when you know what to extract.

"Shared Database Across Services"

You just built a distributed monolith. Every schema change is a cross-team coordination. Either share a module, or split the database too.

"No Timeouts"

The single most common reliability bug. Add them.

"Retry Storms"

Retries without backoff amplify failures. Exponential backoff plus jitter, always.

"Exactly-Once Delivery"

Sold as a feature, rarely true. At-least-once + idempotency is the working implementation.

"Strong Consistency Everywhere"

Expensive and often unnecessary. Most UX problems need session guarantees (read-your-writes), not linearizability.

"Ignoring the Database Until It's a Problem"

The database is almost always the bottleneck. Index work and query review pay for themselves forever.

"Copy-Paste Architecture"

The last system you built had these components, so the new one does too. Needs of the new system may differ. Review from first principles.

"Big Bang Rewrite"

Two years, then a cutover weekend nobody sleeps through. Strangler fig. Every time.

"No Owner"

A service with no team responsible for it rots until it breaks in production. Every system needs a team on the hook for it.

Where to Go From Here

You have a vocabulary, a set of patterns, a worked example, and a checklist of bad habits to avoid. The next level is depth:

Designing Data-Intensive Applications by Kleppmann: the one canonical book. Read it twice.
The Google SRE Book and SRE Workbook: how reliability is actually operated by teams running planet-scale services.
Release It! by Michael Nygard: reliability patterns with war stories.
AWS Builders' Library: short, specific, well-written field notes.
Papers We Love: the underlying papers, readable and thought-provoking.

Beyond reading: build something real, operate it under load, watch it fail, fix it, and write about what you learned. That's where design stops being a whiteboard exercise and becomes a craft.