Observability: Logs, Metrics, Traces | System Design Tutorial

What Observability Means

Monitoring is "is the system up?". Observability is "why is it broken?".

The distinction: monitoring tells you something is wrong. Observability lets you figure out what. In production, you get paged; you need to understand a system you may not have seen before and fix a problem you didn't predict.

Three pillars support this: logs, metrics, traces. Each answers different questions.

Logs

A log is a timestamped record of an event. "User 42 placed order 987 at 10:15:32." "Database connection timed out."

Unstructured vs Structured

Unstructured:

2026-04-19 10:15:32 INFO user 42 placed order 987 for $99.50

Good for humans reading one event. Useless for machines aggregating across events.

Structured:

{"time":"2026-04-19T10:15:32Z","level":"info","event":"order_placed","user_id":42,"order_id":987,"amount":99.50}

Good for humans, great for machines. You can query "all orders over $100 from user 42 today" instantly.

Always log in structured JSON in production. The tooling to grep "level=error order_id=987 across 50 services" only works if the logs are consistent.

Log Levels

Four useful levels:

DEBUG    Diagnostics. Off in production unless you turned it on.
INFO     Normal events worth recording.
WARN     Unexpected but recoverable.
ERROR    Something failed. Action probably needed.

Five in classical logging (DEBUG, INFO, WARN, ERROR, FATAL). FATAL is rare in services (the process is about to die anyway).

Err on the side of WARN and ERROR telling you something actionable. If every log line is WARN, nothing is.

Log Volume

Logs are the most expensive pillar. A busy service can generate terabytes per day.

Mitigations:

Sample non-essential events (keep 1 in 100 successful requests).
Don't log healthy traffic. A 200 response with zero anomalies doesn't need a log line.
Centralize and retain selectively. Hot retention of a day or two, cold storage for a month.

Common log backends: Loki, OpenSearch, Datadog, Splunk, CloudWatch Logs.

Metrics

A metric is a numerical measurement over time. "Requests per second", "error rate", "p99 latency", "active connections".

Metric Types

Four basic types:

Counter      Monotonically increasing number. Requests served, errors seen.
Gauge        Value that can go up or down. Memory used, queue depth.
Histogram    Distribution of values. Request latency across percentiles.
Summary      Like histogram, computed client-side. Less common.

Histograms are how you measure latency correctly. Don't compute p99 over 10-second buckets; use a histogram that preserves the distribution.

RED Method

For request-driven services, three metrics:

Rate: requests per second.
Errors: error rate (percent or per-second).
Duration: request latency (percentiles).

Three metrics per endpoint. If RED is healthy, the service is probably healthy.

USE Method

For resources (CPU, memory, disks, network):

Utilization: how busy the resource is (CPU%, memory%).
Saturation: queue/backlog size (load average, run queue).
Errors: error events (failed disk ops, network drops).

Use USE on infrastructure; use RED on services.

Cardinality Matters

A metric with label user_id creates one series per user. A million users, a million series. Your metrics backend will explode.

Rules:

Labels should be bounded. Endpoint path, status code, region: yes. User ID, request ID: no.
Dimensions that grow unboundedly belong in logs or traces. Not metrics.

Common backends: Prometheus, Mimir, Datadog, CloudWatch.

Traces

A trace follows a single request across services. Each span in the trace is a piece of work (an HTTP call, a database query); spans are linked in a parent/child tree.

order_service: POST /orders                                   (1200 ms)
├── auth_service: validate_token                              (20 ms)
├── inventory_service: check_stock                            (85 ms)
│   └── postgresql: SELECT * FROM inventory ...               (60 ms)
├── payment_service: charge                                   (1050 ms)
│   └── stripe_api: POST /charges                             (1020 ms)   ← the slow bit
└── publish: OrderPlaced → kafka                              (15 ms)

The Stripe call is the tail. You wouldn't know that from logs alone; you'd know from the trace.

OpenTelemetry

The industry standard for instrumenting code. Libraries for every major language, exporters for every major backend. Use the OpenTelemetry SDK; swap backends (Jaeger, Tempo, Datadog, Honeycomb) without changing code.

Sampling

Tracing every request is expensive. Typical sampling rates:

100% for errors and slow requests (head-based filtering can catch these).
1% to 10% for normal traffic.

Good tracing backends support "tail sampling": keep the trace if any span was an error or slow. You get the problems without the volume.

SLIs, SLOs, and Alerts That Don't Cry Wolf

SLI (Service Level Indicator)

A measurement you care about. "The fraction of requests served successfully under 200ms over the last 5 minutes."

SLIs are quantitative and user-facing. "CPU utilization" is a metric, not an SLI.

SLO (Service Level Objective)

The target value. "99.9% of requests served successfully under 200ms."

An SLO is a product decision. It encodes "how good does this need to be for users to not complain?"

SLA (Service Level Agreement)

A contractual commitment to customers, usually with penalties. "We refund credits if SLO drops below X."

Internal: you care about SLOs. External: SLAs. Most teams never sign SLAs; most teams should have SLOs.

Error Budget

If your SLO is 99.9%, you have a "budget" of 0.1% failure. That's 43 minutes per month.

The budget is your permission to ship. As long as you're within budget, feature work is fine. When you burn through the budget, reliability work takes priority.

This aligns engineering with users. You don't over-engineer reliability past what users need, and you don't starve reliability work when users are suffering.

Alerting That Works

Bad alerting trains engineers to ignore pages. By the time something real happens, everyone has tuned out.

Good alerting has three properties:

Actionable

Every alert describes a problem, what to do about it, and why it was significant enough to wake someone up. Link to a runbook. Include a dashboard.

User-Impact-Based

Alert on SLO burn, not on CPU or disk being "high". A 95% CPU that's not affecting users is not an alert.

Multi-window, multi-burn-rate alerts catch issues reliably without flapping. The SRE Workbook has a detailed recipe.

Bounded in Volume

No team can respond to 100 alerts a day. If you're getting that many, the thresholds are wrong or the system is broken. Cut noise aggressively.

A sane limit: every page should be actionable by the on-call; no more than 1 or 2 a week in steady state.

Two Tiers

Page      Wake someone up. User-facing, urgent, unrecoverable without intervention.
Ticket    Non-urgent problems that should be fixed in business hours.

Mix them up and pages lose meaning. A full disk is often a ticket: it can be handled tomorrow morning. Except when it's the primary database. Make the priority decision deliberate.

Dashboards That Tell Stories

A dashboard is a narrative, not a data dump.

Good dashboards:

Start with the "is it working?" overview at the top (SLOs, error rate, latency).
Drill into subsystems below.
Name charts in plain language ("Checkout error rate", not "http_5xx_rate").
Use consistent time ranges across charts.
Highlight anomalies visually (thresholds, colored bands).

A bad dashboard has 30 charts, all the same size, no story. A good dashboard makes the question "is anything wrong right now?" answerable in 5 seconds.

Common Pitfalls

Logging PII. You just built a GDPR violation. Structured logs make search easy, which means accidental exposure is easy.

Alerting on symptoms, not user impact. "CPU is high" is not an alert. "p99 latency is above SLO" is.

Metrics with user-ID labels. Cardinality explodes. Use logs or traces for per-user context.

"We'll add observability later." Later is never. Instrument as you build; retrofitting is much more expensive.

Ignoring log retention. A terabyte a day adds up. Decide retention up front.

Dashboards nobody reads. Delete ruthlessly. A repo full of abandoned dashboards is worse than a few good ones.

Next Steps

Continue to 10-architecture-patterns.md to arrange the pieces into a whole.