Normalization: Where to Put What

This chapter covers the normal forms you actually need in practice, and the cases where breaking them on purpose is the right call.

The Goal

Normalization is a set of rules for organizing data so that each fact is stored in exactly one place. When facts are duplicated, updates can go wrong: you update five copies, miss the sixth, and now the data lies about itself.

Normal forms get a scary reputation because people teach them in order: 1NF, 2NF, 3NF, BCNF, 4NF, 5NF, DKNF. In practice, you need 1NF, 2NF, and 3NF. Most schemas land on 3NF without anyone formally working through the rules.

A Bad Schema to Fix

Start with something obviously wrong:

CREATE TABLE orders_bad (
    id            SERIAL PRIMARY KEY,
    user_email    TEXT,
    user_name     TEXT,
    user_address  TEXT,
    product_skus  TEXT,                -- "SKU-001, SKU-002, SKU-003"
    product_names TEXT,
    total         NUMERIC(10, 2),
    status        TEXT
);

Problems:

  • product_skus is a comma-separated list in a single column.
  • user_email, user_name, user_address are duplicated across every order the user places.
  • If the user's address changes, you update it everywhere. Or you forget to.

1NF: One Value Per Cell

First normal form: every cell holds a single value; there are no repeating groups.

product_skus = "SKU-001, SKU-002" violates 1NF. The fix is a separate row per product:

CREATE TABLE orders (
    id      SERIAL PRIMARY KEY,
    user_id INTEGER NOT NULL,
    total   NUMERIC(10, 2),
    status  TEXT
);

CREATE TABLE order_items (
    order_id   INTEGER NOT NULL REFERENCES orders(id),
    product_id INTEGER NOT NULL,
    quantity   INTEGER NOT NULL,
    price      NUMERIC(10, 2) NOT NULL,
    PRIMARY KEY (order_id, product_id)
);

Now querying "all orders containing SKU-001" is an index lookup, not a LIKE '%SKU-001%'.

A PostgreSQL-specific nuance: arrays and JSONB technically let you store multiple values in one cell. That doesn't automatically violate 1NF. Use them when you treat the whole array as the unit of access (e.g. a list of tags you always fetch together). Avoid them when you need to query individual elements; a separate table is usually cheaper.

2NF: Non-Key Columns Depend on the Whole Key

Second normal form applies to tables with composite keys. Every non-key column must depend on the full key, not just part of it.

Bad:

CREATE TABLE order_items_bad (
    order_id     INTEGER NOT NULL,
    product_id   INTEGER NOT NULL,
    quantity     INTEGER NOT NULL,
    price        NUMERIC(10, 2) NOT NULL,
    product_name TEXT NOT NULL,          -- depends on product_id only
    PRIMARY KEY (order_id, product_id)
);

product_name is a property of product_id, not of the (order_id, product_id) pair. Store it in products and join:

CREATE TABLE products (
    id    SERIAL PRIMARY KEY,
    sku   TEXT UNIQUE NOT NULL,
    name  TEXT NOT NULL,
    price NUMERIC(10, 2) NOT NULL
);

CREATE TABLE order_items (
    order_id   INTEGER NOT NULL REFERENCES orders(id),
    product_id INTEGER NOT NULL REFERENCES products(id),
    quantity   INTEGER NOT NULL,
    price      NUMERIC(10, 2) NOT NULL,     -- price AT ORDER TIME (see below)
    PRIMARY KEY (order_id, product_id)
);

3NF: No Transitive Dependencies

Third normal form: non-key columns don't depend on other non-key columns.

Bad:

CREATE TABLE users_bad (
    id           SERIAL PRIMARY KEY,
    email        TEXT,
    country_code TEXT,
    country_name TEXT     -- depends on country_code, not id
);

If someone corrects a country name, you update every user. Move countries into their own table:

CREATE TABLE countries (
    code TEXT PRIMARY KEY,
    name TEXT NOT NULL
);

CREATE TABLE users (
    id           SERIAL PRIMARY KEY,
    email        TEXT UNIQUE NOT NULL,
    country_code TEXT REFERENCES countries(code)
);

Now the country name is stored once. Queries that need it join.

Denormalization: Breaking the Rules Deliberately

Normalization is a default, not a law. You denormalize when you have a good reason and you know what you're giving up.

Two common reasons:

Historical Snapshots

Prices change. Currency rates change. A user's address changes. When the fact "what did this order cost at the time?" matters, store the value on the order, not a reference to the live product.

CREATE TABLE order_items (
    order_id   INTEGER NOT NULL REFERENCES orders(id),
    product_id INTEGER NOT NULL REFERENCES products(id),
    quantity   INTEGER NOT NULL,
    price      NUMERIC(10, 2) NOT NULL,     -- frozen at order time
    PRIMARY KEY (order_id, product_id)
);

This looks like a 2NF violation. It isn't. The stored price is a different fact from products.price: it's the historical price, not the current one.

Performance

Denormalized read models trade write complexity for read speed. A pre-computed user_stats table with total_orders, lifetime_revenue, last_order_at updated by triggers or background jobs can be 100x faster to query than aggregating live.

Trade-offs:

  • Writes have to update multiple places (easy to get wrong).
  • There's a window where the denormalized value is stale.
  • Rebuilding from source is harder.

Normalize first. Denormalize the specific queries that actually need it, measured. A common mistake: denormalize "just in case" and end up with inconsistent data nobody can trust.

Putting It Together

A normalized schema for our e-commerce example:

Entities
  users       (id, email, name, ...)
  products    (id, sku, name, price, stock, ...)
  orders      (id, user_id, status, total, ...)

Relationships
  order_items (order_id, product_id, quantity, price)

Every fact lives somewhere. The user's email lives on users. The product's current price lives on products. The price at purchase lives on order_items. Updating any one doesn't affect the others.

Common Pitfalls

Over-normalizing for small data. A countries table for three countries is fine but not required. CHECK (country IN ('US', 'CA', 'UK')) on a users.country column is often good enough.

Mixing the live value and the snapshot. Storing products.price on the order and never copying it to order_items.price means every report on historical revenue changes when you change prices today. Decide which you mean and be consistent.

Keeping redundant columns "for convenience". users.name plus users.first_name and users.last_name. They drift. Pick one source of truth.

Treating JSONB as a free pass to skip normalization. JSONB is a great tool for semi-structured data (event payloads, config). It's a bad tool for your entire data model. You lose constraints, foreign keys, and the planner's ability to reason about the shape.

Next Steps

Continue to 10-query-performance.md to make slow queries fast.