Normalization: Where to Put What
This chapter covers the normal forms you actually need in practice, and the cases where breaking them on purpose is the right call.
The Goal
Normalization is a set of rules for organizing data so that each fact is stored in exactly one place. When facts are duplicated, updates can go wrong: you update five copies, miss the sixth, and now the data lies about itself.
Normal forms get a scary reputation because people teach them in order: 1NF, 2NF, 3NF, BCNF, 4NF, 5NF, DKNF. In practice, you need 1NF, 2NF, and 3NF. Most schemas land on 3NF without anyone formally working through the rules.
A Bad Schema to Fix
Start with something obviously wrong:
CREATE TABLE orders_bad (
id SERIAL PRIMARY KEY,
user_email TEXT,
user_name TEXT,
user_address TEXT,
product_skus TEXT, -- "SKU-001, SKU-002, SKU-003"
product_names TEXT,
total NUMERIC(10, 2),
status TEXT
);
Problems:
product_skusis a comma-separated list in a single column.user_email,user_name,user_addressare duplicated across every order the user places.- If the user's address changes, you update it everywhere. Or you forget to.
1NF: One Value Per Cell
First normal form: every cell holds a single value; there are no repeating groups.
product_skus = "SKU-001, SKU-002" violates 1NF. The fix is a separate row per product:
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
user_id INTEGER NOT NULL,
total NUMERIC(10, 2),
status TEXT
);
CREATE TABLE order_items (
order_id INTEGER NOT NULL REFERENCES orders(id),
product_id INTEGER NOT NULL,
quantity INTEGER NOT NULL,
price NUMERIC(10, 2) NOT NULL,
PRIMARY KEY (order_id, product_id)
);
Now querying "all orders containing SKU-001" is an index lookup, not a LIKE '%SKU-001%'.
A PostgreSQL-specific nuance: arrays and JSONB technically let you store multiple values in one cell. That doesn't automatically violate 1NF. Use them when you treat the whole array as the unit of access (e.g. a list of tags you always fetch together). Avoid them when you need to query individual elements; a separate table is usually cheaper.
2NF: Non-Key Columns Depend on the Whole Key
Second normal form applies to tables with composite keys. Every non-key column must depend on the full key, not just part of it.
Bad:
CREATE TABLE order_items_bad (
order_id INTEGER NOT NULL,
product_id INTEGER NOT NULL,
quantity INTEGER NOT NULL,
price NUMERIC(10, 2) NOT NULL,
product_name TEXT NOT NULL, -- depends on product_id only
PRIMARY KEY (order_id, product_id)
);
product_name is a property of product_id, not of the (order_id, product_id) pair. Store it in products and join:
CREATE TABLE products (
id SERIAL PRIMARY KEY,
sku TEXT UNIQUE NOT NULL,
name TEXT NOT NULL,
price NUMERIC(10, 2) NOT NULL
);
CREATE TABLE order_items (
order_id INTEGER NOT NULL REFERENCES orders(id),
product_id INTEGER NOT NULL REFERENCES products(id),
quantity INTEGER NOT NULL,
price NUMERIC(10, 2) NOT NULL, -- price AT ORDER TIME (see below)
PRIMARY KEY (order_id, product_id)
);
3NF: No Transitive Dependencies
Third normal form: non-key columns don't depend on other non-key columns.
Bad:
CREATE TABLE users_bad (
id SERIAL PRIMARY KEY,
email TEXT,
country_code TEXT,
country_name TEXT -- depends on country_code, not id
);
If someone corrects a country name, you update every user. Move countries into their own table:
CREATE TABLE countries (
code TEXT PRIMARY KEY,
name TEXT NOT NULL
);
CREATE TABLE users (
id SERIAL PRIMARY KEY,
email TEXT UNIQUE NOT NULL,
country_code TEXT REFERENCES countries(code)
);
Now the country name is stored once. Queries that need it join.
Denormalization: Breaking the Rules Deliberately
Normalization is a default, not a law. You denormalize when you have a good reason and you know what you're giving up.
Two common reasons:
Historical Snapshots
Prices change. Currency rates change. A user's address changes. When the fact "what did this order cost at the time?" matters, store the value on the order, not a reference to the live product.
CREATE TABLE order_items (
order_id INTEGER NOT NULL REFERENCES orders(id),
product_id INTEGER NOT NULL REFERENCES products(id),
quantity INTEGER NOT NULL,
price NUMERIC(10, 2) NOT NULL, -- frozen at order time
PRIMARY KEY (order_id, product_id)
);
This looks like a 2NF violation. It isn't. The stored price is a different fact from products.price: it's the historical price, not the current one.
Performance
Denormalized read models trade write complexity for read speed. A pre-computed user_stats table with total_orders, lifetime_revenue, last_order_at updated by triggers or background jobs can be 100x faster to query than aggregating live.
Trade-offs:
- Writes have to update multiple places (easy to get wrong).
- There's a window where the denormalized value is stale.
- Rebuilding from source is harder.
Normalize first. Denormalize the specific queries that actually need it, measured. A common mistake: denormalize "just in case" and end up with inconsistent data nobody can trust.
Putting It Together
A normalized schema for our e-commerce example:
Entities
users (id, email, name, ...)
products (id, sku, name, price, stock, ...)
orders (id, user_id, status, total, ...)
Relationships
order_items (order_id, product_id, quantity, price)
Every fact lives somewhere. The user's email lives on users. The product's current price lives on products. The price at purchase lives on order_items. Updating any one doesn't affect the others.
Common Pitfalls
Over-normalizing for small data. A countries table for three countries is fine but not required. CHECK (country IN ('US', 'CA', 'UK')) on a users.country column is often good enough.
Mixing the live value and the snapshot. Storing products.price on the order and never copying it to order_items.price means every report on historical revenue changes when you change prices today. Decide which you mean and be consistent.
Keeping redundant columns "for convenience". users.name plus users.first_name and users.last_name. They drift. Pick one source of truth.
Treating JSONB as a free pass to skip normalization. JSONB is a great tool for semi-structured data (event payloads, config). It's a bad tool for your entire data model. You lose constraints, foreign keys, and the planner's ability to reason about the shape.
Next Steps
Continue to 10-query-performance.md to make slow queries fast.