Best Practices: Databases in Production

This chapter collects the habits, patterns, and tooling that keep databases healthy once real users are hitting them.

Migrations

Production schemas change. Every change needs to be reviewable, reversible, and reproducible across environments. That's what a migration tool does.

Pick one for your stack:

  • Flyway and Liquibase are language-agnostic, battle-tested.
  • Alembic for Python / SQLAlchemy.
  • Rails Active Record migrations for Ruby.
  • golang-migrate, or the migration tools in sqlc / atlas for Go.
  • Prisma Migrate or Knex migrations for Node.

What a migration tool gives you:

  1. Numbered migration files checked into source control.
  2. A schema_migrations table tracking which migrations have run.
  3. A command (flyway migrate, alembic upgrade head) that applies pending migrations.

A minimal shape (Flyway-style):

V001__create_users.sql
V002__create_products.sql
V003__create_orders.sql
V004__add_users_phone.sql

Each file is immutable once shipped. You never edit V002 after it's been applied in production; you write V005__fix_products.sql instead.

Online Migrations

On a busy table, some operations lock. The cardinal rules:

  • ALTER TABLE ... ADD COLUMN col TYPE with no default: instant.
  • ALTER TABLE ... ADD COLUMN col TYPE DEFAULT constant: instant in PostgreSQL 11+.
  • ALTER TABLE ... ADD COLUMN col TYPE NOT NULL DEFAULT constant: instant in PostgreSQL 11+.
  • ALTER TABLE ... ADD CONSTRAINT ... CHECK (...): scans the table. Use NOT VALID, then VALIDATE CONSTRAINT separately.
  • CREATE INDEX: locks writes. Use CREATE INDEX CONCURRENTLY.
  • ALTER TABLE ... DROP COLUMN: fast (column is marked dropped, physically removed later).

A safe index creation on a live table:

CREATE INDEX CONCURRENTLY ON orders (user_id, created_at DESC);

CONCURRENTLY doesn't block writes. It's slower but safe. Use it always, unless you're in maintenance.

Backups

Two forms of backup, and you want both.

Logical backups with pg_dump: a SQL-level dump, portable across PostgreSQL versions.

pg_dump --format=custom learning > learning.dump
pg_restore -d new_db learning.dump

Good for: schema snapshots, copying to staging, disaster recovery when the version changes.

Physical backups with pg_basebackup or a tool like pgBackRest / barman: file-level copy plus WAL archiving, enabling point-in-time recovery.

pg_basebackup -D /backup/base -Ft -z -P

Good for: "restore the DB to 14:02 yesterday, just before the bad deploy."

Whatever you choose, the iron rule: a backup is not a backup until you've restored from it. Test restores on a schedule. The thing you don't want to discover at 2am is that your backup has been silently corrupt for three months.

Connection Pooling

Each PostgreSQL connection is a forked process that consumes a few megabytes and a backend slot. Opening connections is slow; idle connections are expensive at scale.

Application-side pools (HikariCP, pgx, SQLAlchemy) handle this for a single app. When you have many apps, or serverless functions, put a pooler in front:

  • PgBouncer is the classic: tiny, fast, transaction-level pooling.
  • Pgpool-II does more (load balancing, connection pooling) but is heavier.

Transaction-level pooling (PgBouncer's default sweet spot) works for most web apps. Session-level features (prepared statements, SET that must persist, advisory locks across calls) need session pooling or careful handling.

Start with max_connections = 100 in Postgres and PgBouncer fanning out to thousands of clients. Don't raise max_connections to solve "too many connections"; put a pooler in the middle.

Security

SQL Injection

Always use parameterized queries. Never concatenate user input into SQL.

# Wrong. SQL injection.
cur.execute("SELECT * FROM users WHERE email = '" + email + "'")

# Right. Parameterized.
cur.execute("SELECT * FROM users WHERE email = %s", (email,))

Every mainstream language library supports parameters. Use them unconditionally. ORM query builders do this by default; raw SQL doesn't unless you force it.

Least Privilege

Create a database role per service with only the permissions it needs.

CREATE ROLE api_service WITH LOGIN PASSWORD 'redacted';
GRANT CONNECT ON DATABASE app TO api_service;
GRANT USAGE ON SCHEMA public TO api_service;
GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public TO api_service;
GRANT USAGE ON ALL SEQUENCES IN SCHEMA public TO api_service;

-- A separate, read-only role for analytics
CREATE ROLE analytics_read WITH LOGIN PASSWORD 'redacted';
GRANT CONNECT ON DATABASE app TO analytics_read;
GRANT USAGE ON SCHEMA public TO analytics_read;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO analytics_read;

The API service can't drop tables. The analytics role can't write at all. An SSRF or compromised credential is much less catastrophic when the role is scoped.

Secrets

The DB password belongs in a secrets manager (Vault, AWS Secrets Manager, Doppler), read at runtime. Not in source control, not in environment variables baked into Docker images.

Rotate credentials on a schedule and when people leave the team. Automate it.

TLS

Connect over TLS. Postgres supports it out of the box. In the connection string:

postgresql://user:pass@host:5432/db?sslmode=require

Use sslmode=verify-full in production; it validates the server certificate. require without verification is a half-measure.

Monitoring

You want to know:

  • Connections. Current vs max. Close to max means the app will fail soon.
  • Replication lag (if you have replicas). Measured in bytes or seconds behind primary.
  • Query latency. p50, p95, p99. Slow queries show up here first.
  • Table bloat. Dead tuples vs live tuples per table.
  • Disk space. Both data and WAL. Running out of disk is a recoverable disaster if you see it coming; a full outage if you don't.

Tools:

  • pg_stat_activity for current queries.
  • pg_stat_statements for aggregated query stats.
  • pg_stat_user_tables and pg_stat_user_indexes for table activity.
  • Prometheus + postgres_exporter + Grafana is the common open-source stack.
  • Cloud-managed (RDS, Cloud SQL, Neon) ship dashboards; still look at them.

Set up alerts on replication lag, connection saturation, and disk usage. Everything else is nice-to-have.

Query Discipline

A handful of habits that keep a codebase from rotting:

  • Name your SQL in code. A constant called FETCH_ORDERS_BY_USER is searchable; an inline literal is not.
  • Prefer views and functions for complex queries. The DB caches the plan. The code stays readable.
  • Use explicit column lists in writes. INSERT INTO users (...) survives ALTER TABLE. INSERT INTO users VALUES (...) breaks the moment someone adds a column.
  • Keep transactions short. Long transactions block vacuum and hold locks. Do network calls outside the transaction.
  • Set a statement timeout. Per-role: ALTER ROLE api_service SET statement_timeout = '30s';. Prevents a runaway query from eating the DB. Pair with retries and clear error handling.

Idempotency and Retries

Networks drop. Clients retry. Design operations so retries are safe.

  • Unique constraints on natural business keys (order number, payment intent ID).
  • ON CONFLICT DO NOTHING for create-if-missing.
  • Client-provided idempotency keys on write endpoints. Store the key with the row; reject duplicates.
CREATE TABLE payments (
    id            SERIAL PRIMARY KEY,
    idem_key      TEXT UNIQUE NOT NULL,
    user_id       INTEGER NOT NULL REFERENCES users(id),
    amount_cents  INTEGER NOT NULL CHECK (amount_cents > 0),
    status        TEXT NOT NULL,
    created_at    TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

INSERT INTO payments (idem_key, user_id, amount_cents, status)
VALUES ($1, $2, $3, 'pending')
ON CONFLICT (idem_key) DO NOTHING
RETURNING id;

A duplicate call returns zero rows. The client knows the original succeeded without double-charging anyone.

Testing Against a Real Database

Mocks lie. An integration test that hits a real PostgreSQL catches:

  • Typo'd column names and constraints you didn't expect.
  • Query plans that work on 10 rows but not 10 million (run a small load test, not just unit-sized fixtures).
  • Migration ordering bugs.
  • Role and permission issues.

testcontainers and similar libraries spin up a Postgres per test run in Docker. Worth the few seconds of setup.

A Checklist for Going Live

Before you ship a new database-backed service:

  1. Schema migrations in source control, with a rollback plan per migration.
  2. Backups configured and a restore tested.
  3. Monitoring dashboards exist. Alerts fire on connections, disk, replication lag.
  4. Connection pooler in front of Postgres (PgBouncer or equivalent).
  5. Secrets in a secrets manager; no passwords in code or images.
  6. Least-privilege roles per service.
  7. Statement timeout set per role.
  8. Indexes on every foreign key.
  9. Every hot query has been run through EXPLAIN ANALYZE on realistic data.
  10. SQL injection review: every query is parameterized.

This is not thrilling work. It is the work. Skip it and you learn all of it at once, at the worst possible time.

Where to Go From Here

You have the toolkit. The next level is depth in specific areas:

Build something. A task tracker, a blog engine, a small store. Migrate the schema three times as requirements change. Load a hundred thousand rows and find the slow query. Write the EXPLAIN ANALYZE you didn't think you'd need. That's where databases stop being theory and start being a tool you reach for without thinking.