Joins: Combining Tables

This chapter equips you to pull data from multiple tables correctly, including the joins nobody explains properly.

Why Joins Exist

Relational schemas split data across tables to avoid duplication. A user has many orders. An order has many items. An item references a product. When you want "email and order total for every paid order last month", that data lives in three tables. A join is how you reassemble it.

Setup

Using the schema from Chapter 3. Seed data:

INSERT INTO users (email, name) VALUES
    ('ada@example.com',   'Ada'),
    ('grace@example.com', 'Grace'),
    ('alan@example.com',  'Alan');

INSERT INTO products (sku, name, price, stock) VALUES
    ('KB-01', 'Keyboard',  79.00, 50),
    ('MS-01', 'Mouse',     29.00, 80),
    ('SC-01', 'Screen',   399.00, 12);

INSERT INTO orders (user_id, status, total) VALUES
    (1, 'paid',    108.00),
    (1, 'pending',  29.00),
    (2, 'paid',   399.00);

INSERT INTO order_items (order_id, product_id, quantity, price) VALUES
    (1, 1, 1,  79.00),
    (1, 2, 1,  29.00),
    (2, 2, 1,  29.00),
    (3, 3, 1, 399.00);

INNER JOIN

An INNER JOIN returns rows where the join condition is true on both sides. Rows without a match drop out.

SELECT u.name, o.id AS order_id, o.total
FROM orders o
INNER JOIN users u ON u.id = o.user_id
WHERE o.status = 'paid';

Output:

 name  | order_id | total  
-------+----------+--------
 Ada   |        1 | 108.00
 Grace |        3 | 399.00

INNER is the default; you can write JOIN and mean the same thing. Use table aliases (orders o, users u) to keep things readable.

LEFT JOIN

A LEFT JOIN returns every row from the left table, matched to a row from the right table where possible, with NULL where there's no match.

"Every user, with their order count (including zero)":

SELECT u.name, COUNT(o.id) AS order_count
FROM users u
LEFT JOIN orders o ON o.user_id = u.id
GROUP BY u.id, u.name
ORDER BY order_count DESC;

Output:

 name  | order_count 
-------+-------------
 Ada   |           2
 Grace |           1
 Alan  |           0

Alan has no orders but still appears with a count of zero. Use COUNT(o.id), not COUNT(*), or the zero becomes a one: COUNT(*) counts the outer row, which is present.

RIGHT JOIN and FULL JOIN

RIGHT JOIN is a LEFT JOIN with the tables swapped. Most people never write it; they just flip the FROM order.

FULL JOIN returns every row from both sides, with NULL on whichever side has no match. Useful for reconciling two tables:

SELECT a.id AS left_id, b.id AS right_id
FROM table_a a
FULL JOIN table_b b ON a.id = b.id
WHERE a.id IS NULL OR b.id IS NULL;

That's every row that's in one table but not the other.

Multi-Table Joins

You can keep joining. Real queries often touch four or five tables.

"Email, order ID, product name, and line total for every paid order":

SELECT u.email,
       o.id                    AS order_id,
       p.name                  AS product,
       oi.quantity,
       oi.quantity * oi.price  AS line_total
FROM orders o
JOIN users u        ON u.id = o.user_id
JOIN order_items oi ON oi.order_id = o.id
JOIN products p     ON p.id = oi.product_id
WHERE o.status = 'paid'
ORDER BY o.id, p.name;

Rule of thumb: write the joins in the order a person would read the data ("start with orders, bring in the user, bring in each item, bring in each product"). Readers will thank you.

Self Joins

A self join joins a table to itself, usually via an alias. Classic use: hierarchies.

CREATE TABLE employees (
    id         SERIAL PRIMARY KEY,
    name       TEXT NOT NULL,
    manager_id INTEGER REFERENCES employees(id)
);

-- Each employee with their manager's name
SELECT e.name AS employee, m.name AS manager
FROM employees e
LEFT JOIN employees m ON m.id = e.manager_id;

The aliases e and m are how you tell PostgreSQL which copy of the table you mean in each column reference.

Subqueries

A subquery is a SELECT inside another query. Three common shapes:

Scalar subquery (returns one value):

SELECT name, (SELECT COUNT(*) FROM orders) AS total_orders
FROM users;

IN subquery (returns a list of values):

SELECT name
FROM products
WHERE id IN (
    SELECT product_id FROM order_items WHERE quantity > 5
);

EXISTS subquery (returns true/false):

SELECT u.name
FROM users u
WHERE EXISTS (
    SELECT 1 FROM orders o
    WHERE o.user_id = u.id AND o.status = 'paid'
);

EXISTS is often faster than IN for the "has at least one" question, because it can stop scanning as soon as it finds a match.

CTEs (Common Table Expressions)

A CTE names a subquery so you can reference it like a table. Use them when nesting gets deep.

WITH paid_orders AS (
    SELECT * FROM orders WHERE status = 'paid'
),
order_totals AS (
    SELECT user_id, SUM(total) AS revenue
    FROM paid_orders
    GROUP BY user_id
)
SELECT u.name, t.revenue
FROM users u
JOIN order_totals t ON t.user_id = u.id
ORDER BY t.revenue DESC;

CTEs read top-to-bottom. The query above is "define paid orders, sum them by user, then join back to names." Much cleaner than the equivalent nested subquery.

In PostgreSQL 12+, CTEs are inlined by default (same performance as subqueries). Older versions materialized them, which sometimes hurt plans.

Common Pitfalls

Cartesian product from a missing ON. Forget ON user_id = id and you'll get every row from A paired with every row from B. If your query returns millions of rows it shouldn't, this is the first thing to check.

COUNT(*) with LEFT JOIN gives the wrong zero. Use COUNT(column_from_right_table).

NULL on the joined column filters unexpectedly. WHERE o.status = 'paid' on a LEFT JOIN drops the rows with no orders (where o.status IS NULL). If you want to keep them, move the condition into the ON clause: LEFT JOIN orders o ON o.user_id = u.id AND o.status = 'paid'.

Joining on the wrong column. ON u.id = o.id instead of ON u.id = o.user_id. Syntactically valid, returns only rows where the user ID coincidentally matches the order ID. Read the ON carefully.

Next Steps

Continue to 06-aggregation.md to summarize data.