Aggregation: Summaries and Windows

This chapter shows you how to summarize data with GROUP BY and slice it with window functions.

Aggregate Functions

An aggregate function collapses a set of rows into one value.

SELECT COUNT(*)        AS user_count,
       MIN(created_at) AS oldest,
       MAX(created_at) AS newest
FROM users;

The common aggregates:

COUNT(*)               how many rows
COUNT(col)             how many non-null values of col
COUNT(DISTINCT col)    how many distinct non-null values
SUM(col)               total
AVG(col)               mean
MIN(col) / MAX(col)    minimum, maximum
STRING_AGG(col, ',')   concatenate (with separator)
ARRAY_AGG(col)         collect into an array

GROUP BY

Aggregates over the whole table are rarely what you want. You want "per user", "per status", "per day". That's what GROUP BY is for.

SELECT status, COUNT(*) AS orders, SUM(total) AS revenue
FROM orders
GROUP BY status;

Output:

  status  | orders | revenue 
----------+--------+---------
 paid     |      2 |  507.00
 pending  |      1 |   29.00

The rule: every column in the SELECT must either be in the GROUP BY or inside an aggregate function. "Show me the user ID and the sum" works; "show me the user name and the sum, grouped by ID" doesn't, unless you include name in the GROUP BY too (fine when name is functionally dependent on the ID, which in users it is).

SELECT u.id, u.name, COUNT(o.id) AS order_count, SUM(o.total) AS revenue
FROM users u
LEFT JOIN orders o ON o.user_id = u.id
GROUP BY u.id, u.name
ORDER BY revenue DESC NULLS LAST;

HAVING

HAVING filters groups, the way WHERE filters rows. Use it for conditions that reference aggregates.

SELECT user_id, COUNT(*) AS order_count
FROM orders
GROUP BY user_id
HAVING COUNT(*) > 5;

WHERE runs before aggregation; HAVING runs after. Filter with WHERE where you can (it's cheaper), and reserve HAVING for post-aggregation conditions.

SELECT user_id, COUNT(*) AS paid_orders
FROM orders
WHERE status = 'paid'           -- filter rows
GROUP BY user_id
HAVING COUNT(*) >= 3;            -- filter groups

Multiple Grouping Keys

You can group by more than one column:

SELECT
    DATE_TRUNC('day', created_at) AS day,
    status,
    COUNT(*)   AS orders,
    SUM(total) AS revenue
FROM orders
GROUP BY day, status
ORDER BY day, status;

DATE_TRUNC('day', ts) rounds a timestamp down to the start of the day. Also useful: 'hour', 'week', 'month', 'year'.

ROLLUP, CUBE, GROUPING SETS

Three extensions of GROUP BY for multi-level summaries. You see them in reporting.

GROUPING SETS lets you list multiple groupings in one query:

SELECT status,
       DATE_TRUNC('month', created_at) AS month,
       SUM(total) AS revenue
FROM orders
GROUP BY GROUPING SETS ((status, month), (status), (month), ())
ORDER BY status NULLS LAST, month NULLS LAST;

That returns revenue by status-and-month, by status only, by month only, and grand total, all in one query.

ROLLUP(a, b, c) is shorthand for GROUPING SETS ((a, b, c), (a, b), (a), ()). CUBE(a, b) is every combination.

Use these when a BI tool asks for subtotals. Otherwise it's often simpler to run the queries separately.

Window Functions

Aggregate functions collapse rows. Window functions don't. They compute a value for each row using a "window" of other rows.

This is the most useful feature of modern SQL. Readers who haven't seen it end up writing the same query five times with self-joins.

Running Totals

SELECT
    id,
    created_at,
    total,
    SUM(total) OVER (ORDER BY created_at) AS running_total
FROM orders;

Every row now has the running sum up to and including that row. No self-join.

Per-Group Calculations with PARTITION BY

SELECT
    user_id,
    id AS order_id,
    total,
    SUM(total) OVER (PARTITION BY user_id ORDER BY created_at) AS user_running_total
FROM orders;

PARTITION BY user_id resets the running total for each user.

Ranking

SELECT
    name,
    price,
    RANK()       OVER (ORDER BY price DESC) AS rank,
    DENSE_RANK() OVER (ORDER BY price DESC) AS dense_rank,
    ROW_NUMBER() OVER (ORDER BY price DESC) AS row_num
FROM products;

Three different ways to number rows:

  • ROW_NUMBER: always 1, 2, 3, ... even if values tie.
  • RANK: ties share a rank, next rank skips (1, 1, 3).
  • DENSE_RANK: ties share a rank, next rank continues (1, 1, 2).

LEAD and LAG

Pull a value from the next or previous row:

SELECT
    id,
    created_at,
    total,
    LAG(total)  OVER (ORDER BY created_at) AS previous_total,
    LEAD(total) OVER (ORDER BY created_at) AS next_total
FROM orders;

Useful for deltas, retention, "was this event preceded by X".

First Value Per Group

"Most recent order per user":

SELECT DISTINCT ON (user_id) user_id, id, created_at, total
FROM orders
ORDER BY user_id, created_at DESC;

PostgreSQL's DISTINCT ON is the shortcut. The portable version uses ROW_NUMBER:

WITH ranked AS (
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC) AS rn
    FROM orders
)
SELECT * FROM ranked WHERE rn = 1;

Both read the table once. Both are fine.

Common Pitfalls

Forgetting a column in GROUP BY. "column must appear in the GROUP BY clause or be used in an aggregate function." This error always means the same thing: add the column to GROUP BY or wrap it in MAX() / MIN() / etc.

Using COUNT(col) when you meant COUNT(*). COUNT(col) skips NULLs. If col is sometimes null, your counts will be low.

WHERE on an aggregate. WHERE COUNT(*) > 5 is a syntax error. You want HAVING.

Sorting by a window function column that doesn't make sense. RANK() OVER (ORDER BY price) without ORDER BY rank at the statement level returns ranked rows in arbitrary order. Sort the outer query too.

Next Steps

Continue to 07-indexing.md to make these queries fast.