Lately in couple of places I recommended people that they can solve their problem with queries using LATERAL. In some cases recipient of such suggestion indicated that they had no idea what LATERAL is. Which made me think that it might be good idea to write more about them (lateral queries)…
Also – I know that some of the examples I shown in here can be done differently, I just wanted to show how one can use LATERAL, and am terrible with coming up with better usecases.
Lateral queries are with us for quite some time now – specifically since Pg 9.3 – that's around 10 years.
So, what are these?
Broadly speaking – lateral subquery (sometimes also called lateral join) is a way for developer to make PostgreSQL to generate many rows based on single row of data.
Simplest example: let's assume table that contains some events as two columns (I know I could used range datatype, but I want to keep it simple): event_start and event_end. Like this:
=$ CREATE TABLE events ( id int8 GENERATED ALWAYS AS IDENTITY PRIMARY KEY, event_start DATE NOT NULL, event_end DATE NOT NULL, CHECK (event_end >= event_start) );
Now, let's add there some events:
=$ WITH event_starts AS ( SELECT now() - '2 weeks'::INTERVAL * random() AS START FROM generate_series(1,5) i ) INSERT INTO events (event_start, event_end) SELECT START, START + '3 days'::INTERVAL + random() * '4 days'::INTERVAL FROM event_starts;
This gave me some nice events I can work with:
=$ SELECT * FROM events; id | event_start | event_end ----+-------------+------------ 1 | 2022-09-15 | 2022-09-22 2 | 2022-09-06 | 2022-09-11 3 | 2022-09-06 | 2022-09-10 4 | 2022-09-12 | 2022-09-18 5 | 2022-09-05 | 2022-09-10 (5 ROWS)
Now, let's assume I'd want to get list of all dates that have some events on them, with count of how many events there are on each of these days.
I can start with simple select * from events, and then use lateral to get list of all days. Let's see:
=$ SELECT e.*, l.* FROM events e, lateral ( SELECT x::DATE FROM generate_series(e.event_start, e.event_end, '1 day'::INTERVAL) AS x ) AS l id | event_start | event_end | x ----+-------------+------------+------------ 1 | 2022-09-15 | 2022-09-22 | 2022-09-15 1 | 2022-09-15 | 2022-09-22 | 2022-09-16 1 | 2022-09-15 | 2022-09-22 | 2022-09-17 1 | 2022-09-15 | 2022-09-22 | 2022-09-18 1 | 2022-09-15 | 2022-09-22 | 2022-09-19 1 | 2022-09-15 | 2022-09-22 | 2022-09-20 1 | 2022-09-15 | 2022-09-22 | 2022-09-21 1 | 2022-09-15 | 2022-09-22 | 2022-09-22 2 | 2022-09-06 | 2022-09-11 | 2022-09-06 2 | 2022-09-06 | 2022-09-11 | 2022-09-07 2 | 2022-09-06 | 2022-09-11 | 2022-09-08 2 | 2022-09-06 | 2022-09-11 | 2022-09-09 2 | 2022-09-06 | 2022-09-11 | 2022-09-10 2 | 2022-09-06 | 2022-09-11 | 2022-09-11 3 | 2022-09-06 | 2022-09-10 | 2022-09-06 3 | 2022-09-06 | 2022-09-10 | 2022-09-07 3 | 2022-09-06 | 2022-09-10 | 2022-09-08 3 | 2022-09-06 | 2022-09-10 | 2022-09-09 3 | 2022-09-06 | 2022-09-10 | 2022-09-10 4 | 2022-09-12 | 2022-09-18 | 2022-09-12 4 | 2022-09-12 | 2022-09-18 | 2022-09-13 4 | 2022-09-12 | 2022-09-18 | 2022-09-14 4 | 2022-09-12 | 2022-09-18 | 2022-09-15 4 | 2022-09-12 | 2022-09-18 | 2022-09-16 4 | 2022-09-12 | 2022-09-18 | 2022-09-17 4 | 2022-09-12 | 2022-09-18 | 2022-09-18 5 | 2022-09-05 | 2022-09-10 | 2022-09-05 5 | 2022-09-05 | 2022-09-10 | 2022-09-06 5 | 2022-09-05 | 2022-09-10 | 2022-09-07 5 | 2022-09-05 | 2022-09-10 | 2022-09-08 5 | 2022-09-05 | 2022-09-10 | 2022-09-09 5 | 2022-09-05 | 2022-09-10 | 2022-09-10 (32 ROWS)
Please note that data in first 3 columns is repeated – because it's the same row in events table, just the 4th column is added thanks to lateral magic.
Lateral, in this example, called generate_series() function, which generated set of timestamps, which then got cast to date datatype, so that we'll get only dates.
Now, to get all days and their counts, I just need to group by l.x, and get count:
=$ SELECT l.x, COUNT(*) FROM events e, lateral ( SELECT x::DATE FROM generate_series(e.event_start, e.event_end, '1 day'::INTERVAL) AS x ) AS l GROUP BY l.x ORDER BY l.x x | COUNT ------------+------- 2022-09-05 | 1 2022-09-06 | 3 2022-09-07 | 3 2022-09-08 | 3 2022-09-09 | 3 2022-09-10 | 3 2022-09-11 | 1 2022-09-12 | 1 2022-09-13 | 1 2022-09-14 | 1 2022-09-15 | 2 2022-09-16 | 2 2022-09-17 | 2 2022-09-18 | 2 2022-09-19 | 1 2022-09-20 | 1 2022-09-21 | 1 2022-09-22 | 1 (18 ROWS)
The important part is that for every row in the source (events) the query inside lateral is called with full access to normal logic (including where clauses, functions, grouping, aggregates, ordering, limiting, anything), and resulting recordset is available to our query.
This can be used, for example, to get things like, for example, get top five employee, per department, sorted by salary:
SELECT d.*, le.* FROM departments d, lateral ( SELECT * FROM employees e WHERE e.dept_id = d.id ORDER BY e.salary DESC LIMIT 5 ) AS le
Generally, lateral is godsend whenever you need to extract information from complex data (json?) or make non-obvious modification to joined dataset (like limiting number of rows in the example above).
There is one more thing I'd like to show. Specifically – while I don't particularly like calling lateral queries as “lateral joins", it is a fact that you can use join syntax. Which can be useful in cases where lateral query doesn't return anything.
Given the events table, let's try to get number of events for each day in September.
To get all dates in September I can simply:
=$ SELECT d::DATE AS DAY FROM generate_series('2022-09-01', '2022-09-30', '1 day'::INTERVAL) d; DAY ------------ 2022-09-01 2022-09-02 2022-09-03 2022-09-04 ... 2022-09-30 (30 ROWS)
Now, I can add lateral subquery to get events that were happening on this day:
=$ SELECT d::DATE AS DAY, l.* FROM generate_series('2022-09-01', '2022-09-30', '1 day'::INTERVAL) d, lateral ( SELECT * FROM events e WHERE d::DATE BETWEEN e.event_start AND e.event_end ) AS l; DAY | id | event_start | event_end ------------+----+-------------+------------ 2022-09-05 | 5 | 2022-09-05 | 2022-09-10 2022-09-06 | 2 | 2022-09-06 | 2022-09-11 2022-09-06 | 3 | 2022-09-06 | 2022-09-10 2022-09-06 | 5 | 2022-09-05 | 2022-09-10 2022-09-07 | 2 | 2022-09-06 | 2022-09-11 2022-09-07 | 3 | 2022-09-06 | 2022-09-10 2022-09-07 | 5 | 2022-09-05 | 2022-09-10 2022-09-08 | 2 | 2022-09-06 | 2022-09-11 2022-09-08 | 3 | 2022-09-06 | 2022-09-10 2022-09-08 | 5 | 2022-09-05 | 2022-09-10 2022-09-09 | 2 | 2022-09-06 | 2022-09-11 2022-09-09 | 3 | 2022-09-06 | 2022-09-10 2022-09-09 | 5 | 2022-09-05 | 2022-09-10 2022-09-10 | 2 | 2022-09-06 | 2022-09-11 2022-09-10 | 3 | 2022-09-06 | 2022-09-10 2022-09-10 | 5 | 2022-09-05 | 2022-09-10 2022-09-11 | 2 | 2022-09-06 | 2022-09-11 2022-09-12 | 4 | 2022-09-12 | 2022-09-18 2022-09-13 | 4 | 2022-09-12 | 2022-09-18 2022-09-14 | 4 | 2022-09-12 | 2022-09-18 2022-09-15 | 1 | 2022-09-15 | 2022-09-22 2022-09-15 | 4 | 2022-09-12 | 2022-09-18 2022-09-16 | 1 | 2022-09-15 | 2022-09-22 2022-09-16 | 4 | 2022-09-12 | 2022-09-18 2022-09-17 | 1 | 2022-09-15 | 2022-09-22 2022-09-17 | 4 | 2022-09-12 | 2022-09-18 2022-09-18 | 1 | 2022-09-15 | 2022-09-22 2022-09-18 | 4 | 2022-09-12 | 2022-09-18 2022-09-19 | 1 | 2022-09-15 | 2022-09-22 2022-09-20 | 1 | 2022-09-15 | 2022-09-22 2022-09-21 | 1 | 2022-09-15 | 2022-09-22 2022-09-22 | 1 | 2022-09-15 | 2022-09-22 (32 ROWS)
This worked, obviously, but if I'd get counts now, I would be missing some days:
=$ SELECT d::DATE AS DAY, COUNT(l.id) AS events FROM generate_series('2022-09-01', '2022-09-30', '1 day'::INTERVAL) d, lateral ( SELECT * FROM events e WHERE d::DATE BETWEEN e.event_start AND e.event_end ) AS l GROUP BY d.date ORDER BY d.date DAY | events ------------+-------- 2022-09-05 | 1 2022-09-06 | 3 2022-09-07 | 3 2022-09-08 | 3 2022-09-09 | 3 2022-09-10 | 3 2022-09-11 | 1 2022-09-12 | 1 2022-09-13 | 1 2022-09-14 | 1 2022-09-15 | 2 2022-09-16 | 2 2022-09-17 | 2 2022-09-18 | 2 2022-09-19 | 1 2022-09-20 | 1 2022-09-21 | 1 2022-09-22 | 1 (18 ROWS)
Specifically – I have no data for days where there were 0 rows. This is because lateral for these days returned 0 rows, and this “cancelled" row from generate_series.
We can work around it thanks to doing left join to lateral:
=$ SELECT d::DATE AS DAY, COUNT(l.id) AS events FROM generate_series('2022-09-01', '2022-09-30', '1 day'::INTERVAL) d LEFT JOIN lateral ( SELECT * FROM events e WHERE d::DATE BETWEEN e.event_start AND e.event_end ) AS l ON (TRUE) GROUP BY d.date ORDER BY d.date DAY | events ------------+-------- 2022-09-01 | 0 2022-09-02 | 0 2022-09-03 | 0 2022-09-04 | 0 2022-09-05 | 1 2022-09-06 | 3 2022-09-07 | 3 2022-09-08 | 3 2022-09-09 | 3 2022-09-10 | 3 2022-09-11 | 1 2022-09-12 | 1 2022-09-13 | 1 2022-09-14 | 1 2022-09-15 | 2 2022-09-16 | 2 2022-09-17 | 2 2022-09-18 | 2 2022-09-19 | 1 2022-09-20 | 1 2022-09-21 | 1 2022-09-22 | 1 2022-09-23 | 0 2022-09-24 | 0 2022-09-25 | 0 2022-09-26 | 0 2022-09-27 | 0 2022-09-28 | 0 2022-09-29 | 0 2022-09-30 | 0 (30 ROWS)
This brought back 0 counts, and this is because all columns returned by lateral, for days without events, were NULL. Which means that count(l.id) was not incrementing the count (count(…) increments the count only for non-null values).
For the left join I used weird join condition: on (true). This is because JOINs need join condition. In our case the real condition is built inside the lateral query (where d::date …), but still, syntax requires something to join on. Since any row returned by lateral will be “ok to join", then we're joining on condition that is always true.
Hope it was useful to at least one person, and if something isn't clear, or you'd like to know more, please ask 🙂