In PostgreSQL 8.4 we got CTE – Common Table Expressions. Since then we have this great tool available, but apparently for some people it's still black magic. CuTE, but still magic. I'll try to make it a bit less magical, and more understandable.
The very basic way you can use CTE, is with query like:
WITH base_info AS (
SELECT
oid::regclass AS TABLE_NAME, pg_table_size(oid) AS SIZE
FROM
pg_class
WHERE
relkind = 'r'
)
SELECT *
FROM base_info
ORDER BY SIZE DESC LIMIT 10;
Result of the query looks like this:
TABLE_NAME | SIZE ----------------+-------- pg_proc | 548864 pg_rewrite | 507904 pg_depend | 417792 test | 393216 pg_attribute | 376832 pg_description | 286720 pg_statistic | 221184 pg_operator | 147456 pg_class | 98304 pg_type | 98304 (10 ROWS)
Let's analyze what's inside the query.
- Line #1 – starts with word “WITH" – this is the tell-tale sign of CTEs. Every query that uses them has to start using WITH
- Still line #1 – “base_info" – name of the CTE that will be defined after “as", and within parentheses. Any name is acceptable – basically it is just like table or view name
- Lines 2-7 – definition of the CTE. This is a query, that can return any number of rows – you can think of it as inlined VIEW (though there are differences)
- Line 8 – parentheses finishing current CTE. We could have more than one in given query, using syntax like: with a as (…), b as (…), c as (…), but for now single CTE is enough
- Lines 9-11 – actual query, that is using the CTE just as it would any table/view – just providing its name in FROM clause.
Hope that's not scary.
Now you could ask: well, OK. But what is the point of using CTE when you can simply have VIEWs?
Answer has two elements. First – because it can be simpler. Instead of going the whole route of think about view, create it, grant privileges, and so on – you just embed the view in your query, and you're done.
But the other, much more important (in my opinion) thing is that CTEs, unlike VIEWs, are treated as separate entities, and behave like “optimization wall".
What wall? Let me show you simple example:
DROP TABLE IF EXISTS test; CREATE TABLE test ( id INT4 NOT NULL, some_val INT4 NOT NULL); INSERT INTO test (id, some_val) SELECT i, CASE WHEN random() < 0.5 THEN 1 ELSE 2 END FROM generate_series(1,1000) i; INSERT INTO test (id, some_val) SELECT i, CAST(2 + random() * 100 AS INT4) FROM generate_series(1001, 10000) i; ALTER TABLE test ADD PRIMARY KEY (id); CREATE INDEX i1 ON test (some_val); analyze test;
This is relatively simple table, with 10000 rows. Each row has 2 columns: id – primary key, integer from 1 to 10000, and some_val, which is basically random.
What do I mean by “basically". Well, the values range from 1 to 102, but values 1 and 2 are much more common (~ 5 times as common), but 1 happens only in the first 1000 records.
Now, let's see simple query that is looking for some_val = 1:
EXPLAIN analyze SELECT * FROM test WHERE some_val = 1; QUERY PLAN ---------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan ON test (cost=11.97..62.94 ROWS=478 width=8) (actual TIME=0.113..0.208 ROWS=478 loops=1) Recheck Cond: (some_val = 1) -> Bitmap INDEX Scan ON i1 (cost=0.00..11.85 ROWS=478 width=0) (actual TIME=0.089..0.089 ROWS=478 loops=1) INDEX Cond: (some_val = 1) Total runtime: 0.286 ms (5 ROWS)
Works OK – it did use index on some_val column, fetched all rows (478 of them).
What would happen if I'd want just 5 newest rows with some_val = 1?
EXPLAIN analyze SELECT * FROM test WHERE some_val = 1 ORDER BY id DESC LIMIT 5; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------- LIMIT (cost=0.00..3.59 ROWS=5 width=8) (actual TIME=3.117..3.121 ROWS=5 loops=1) -> INDEX Scan Backward USING test_pkey ON test (cost=0.00..343.26 ROWS=478 width=8) (actual TIME=3.115..3.117 ROWS=5 loops=1) FILTER: (some_val = 1) ROWS Removed BY FILTER: 9001 Total runtime: 3.163 ms (5 ROWS)
Please note that planner did something rather stupid in here: instead of getting 478 rows that match where condition, it instead scanned almost whole table, discarded first found 9001 rows, and then, finally, found 5 newest rows with some_val = 1.
( side note: this example is one of the reasons I'm firm believer that we should have planner HINTs, despite all the opposition from some developers )
Why did it happen? Well, without going into too much detail – PostgreSQL assumed some things that are not true, and ran the query as if they were. So it was slower than it could be.
Let's see how I can make PostgreSQL behave if I'll use CTE optimization barrier to fight stupid decisions:
EXPLAIN analyze WITH i_know_what_im_doing AS ( SELECT * FROM test WHERE some_val = 1 ) SELECT * FROM i_know_what_im_doing ORDER BY id DESC LIMIT 5; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------ LIMIT (cost=80.44..80.45 ROWS=5 width=8) (actual TIME=0.526..0.527 ROWS=5 loops=1) CTE i_know_what_im_doing -> Bitmap Heap Scan ON test (cost=11.97..62.94 ROWS=478 width=8) (actual TIME=0.101..0.206 ROWS=478 loops=1) Recheck Cond: (some_val = 1) -> Bitmap INDEX Scan ON i1 (cost=0.00..11.85 ROWS=478 width=0) (actual TIME=0.080..0.080 ROWS=478 loops=1) INDEX Cond: (some_val = 1) -> Sort (cost=17.50..18.69 ROWS=478 width=8) (actual TIME=0.523..0.523 ROWS=5 loops=1) Sort KEY: i_know_what_im_doing.id Sort Method: top-N heapsort Memory: 25kB -> CTE Scan ON i_know_what_im_doing (cost=0.00..9.56 ROWS=478 width=8) (actual TIME=0.106..0.360 ROWS=478 loops=1) Total runtime: 0.596 ms (11 ROWS)
Now. Please note that PostgreSQL did use i1 index, to find matching rows. Afterwards, these rows were sorted using normal top-N heapsort. It's faster.
How would a view work in here?
CREATE VIEW im_lost AS SELECT * FROM test WHERE some_val = 1; EXPLAIN analyze SELECT * FROM im_lost ORDER BY id DESC LIMIT 5; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------- LIMIT (cost=0.00..3.59 ROWS=5 width=8) (actual TIME=2.946..2.948 ROWS=5 loops=1) -> INDEX Scan Backward USING test_pkey ON test (cost=0.00..343.26 ROWS=478 width=8) (actual TIME=2.943..2.945 ROWS=5 loops=1) FILTER: (some_val = 1) ROWS Removed BY FILTER: 9001 Total runtime: 2.997 ms (5 ROWS)
As you can see – adding view made Pg use bad plan. That's because views are internally rewrite rules.
Another benefit is that CTE data is calculated once.
What does it mean?
Let's make a test function that takes long time to calculate something. In our case it will “calculate" random number, in range 0-1000:
CREATE FUNCTION hard_work() RETURNS int4 AS $$ BEGIN perform pg_sleep(0.1); RETURN FLOOR( random() * 1000 ); END; $$ LANGUAGE plpgsql;
Of course this function doesn't make any sense, it's just here to show you, in simple, numeric, way some facts.
Now, let's assume that we want to call it 10 times:
select hard_work() from generate_series(1,10);
But, let's assume we don't know what it returns, so we also want to get sum of the values it did return.
Let's start with a view approach:
CREATE VIEW lots_of_work AS SELECT hard_work() AS h FROM generate_series(1,10);
Now, I can write simple query:
SELECT *, (SELECT SUM(h) FROM lots_of_work) AS SUM FROM lots_of_work; h | SUM -----+------ 519 | 4111 729 | 4111 140 | 4111 945 | 4111 69 | 4111 765 | 4111 179 | 4111 751 | 4111 714 | 4111 462 | 4111 (10 ROWS) TIME: 2007.982 ms
Before I will analyze it, let's see how it works in case of CTE:
WITH less_work AS ( SELECT hard_work() AS h FROM generate_series(1,10) ) SELECT h, (SELECT SUM(h) FROM less_work) AS SUM FROM less_work; h | SUM -----+------ 319 | 3510 43 | 3510 26 | 3510 484 | 3510 455 | 3510 75 | 3510 473 | 3510 69 | 3510 851 | 3510 715 | 3510 (10 ROWS) TIME: 1005.168 ms
Values are of course different, but the point is different. First of all – please notice that the runtime of CTE version is half of VIEW version.
And also – please note that while sum in case of CTE is correct ( 319 + 43 + 26 + 484 + 455 + 75 + 473 + 69 + 851 + 715 = 3510 ), in case of VIEW – the sum is incorrect ( 519 + 729 + 140 + 945 + 69 + 765 + 179 + 751 + 714 + 462 = 5273 ).
Why is that so? Very simple – view rewrites the query, and CTE is “materialized". And this means that in VIEW version our slow function has to be called 20 times (10 times to return records, and 10 times to be used for summing of return values). In case of CTE – function is called 10 times, and the returned values are available to reuse for summing.
This can be seen in these explain analyze plans:
EXPLAIN analyze SELECT *, (SELECT SUM(h) FROM lots_of_work) AS SUM FROM lots_of_work; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------- Subquery Scan ON lots_of_work (cost=272.51..542.51 ROWS=1000 width=4) (actual TIME=1104.304..2006.659 ROWS=10 loops=1) InitPlan 1 (RETURNS $0) -> Aggregate (cost=272.50..272.51 ROWS=1 width=4) (actual TIME=1002.646..1002.646 ROWS=1 loops=1) -> FUNCTION Scan ON generate_series generate_series_1 (cost=0.00..260.00 ROWS=1000 width=0) (actual TIME=100.268..1002.633 ROWS=10 loops=1) -> FUNCTION Scan ON generate_series (cost=0.00..260.00 ROWS=1000 width=0) (actual TIME=101.654..1004.003 ROWS=10 loops=1) Total runtime: 2006.778 ms (6 ROWS)
You can see now that there are two separate calls to generate_series(). With CTE, just one:
EXPLAIN analyze WITH less_work AS ( SELECT hard_work() AS h FROM generate_series(1,10) ) SELECT h, (SELECT SUM(h) FROM less_work) AS SUM FROM less_work; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------ CTE Scan ON less_work (cost=282.51..302.51 ROWS=1000 width=4) (actual TIME=1004.037..1004.041 ROWS=10 loops=1) CTE less_work -> FUNCTION Scan ON generate_series (cost=0.00..260.00 ROWS=1000 width=0) (actual TIME=101.640..1003.990 ROWS=10 loops=1) InitPlan 2 (RETURNS $1) -> Aggregate (cost=22.50..22.51 ROWS=1 width=4) (actual TIME=902.390..902.390 ROWS=1 loops=1) -> CTE Scan ON less_work less_work_1 (cost=0.00..20.00 ROWS=1000 width=4) (actual TIME=0.001..902.377 ROWS=10 loops=1) Total runtime: 1004.142 ms (7 ROWS)
Does that mean that you should use CTE always and never views. Well, no.
Let's see how many rows there are for every some_val in our temp table. First, I'll do it by defining a view:
CREATE VIEW smart_view AS SELECT some_val, COUNT(*) FROM test GROUP BY some_val;
and afterwards, I can use it in a query:
EXPLAIN analyze SELECT * FROM smart_view WHERE some_val = 3; QUERY PLAN ------------------------------------------------------------------------------------------------------------------- GroupAggregate (cost=4.94..51.49 ROWS=1 width=4) (actual TIME=0.288..0.288 ROWS=1 loops=1) -> Bitmap Heap Scan ON test (cost=4.94..51.04 ROWS=88 width=4) (actual TIME=0.132..0.271 ROWS=81 loops=1) Recheck Cond: (some_val = 3) -> Bitmap INDEX Scan ON i1 (cost=0.00..4.92 ROWS=88 width=0) (actual TIME=0.116..0.116 ROWS=81 loops=1) INDEX Cond: (some_val = 3) Total runtime: 0.396 ms (6 ROWS)
Now, let's see how it would work with view definition as CTE:
EXPLAIN analyze WITH not_so_smart_cte AS ( SELECT some_val, COUNT(*) FROM test GROUP BY some_val ) SELECT * FROM not_so_smart_cte WHERE some_val = 3; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------- CTE Scan ON not_so_smart_cte (cost=196.02..198.31 ROWS=1 width=12) (actual TIME=6.165..6.232 ROWS=1 loops=1) FILTER: (some_val = 3) ROWS Removed BY FILTER: 101 CTE not_so_smart_cte -> HashAggregate (cost=195.00..196.02 ROWS=102 width=4) (actual TIME=6.145..6.164 ROWS=102 loops=1) -> Seq Scan ON test (cost=0.00..145.00 ROWS=10000 width=4) (actual TIME=0.014..1.827 ROWS=10000 loops=1) Total runtime: 6.334 ms (7 ROWS)
The difference should be clearly visible – view approach was able to use the “some_val = 3" condition to fetch less rows from table, and then count just them.
CTE approach is different – first it takes all records (rows=10000), then it aggregates it, and then, from the aggregate takes single row.
As you can see – there is no silver bullet. There are cases where CTEs are better, and there are cases when CTEs are worse. I like to use them, because it lets me write my query in a sort of “block-by-block" way. Like: first, I need some rows from here, so I'll build CTE that does it. Then, based on these rows I need something else, so I add another CTE, and so on.
All I have written up to now is about simple CTEs – just a “views in disguise". But CTEs have one big trick up their sleeve: recursive cuteness.
We all know that to understand recursion you have to first understand recursion. How can I explain it, then?
First, let's see some simple recursive CTE query:
WITH recursive i AS ( SELECT 1 AS x UNION ALL SELECT x+1 FROM i WHERE x < 10 ) SELECT * FROM i; x ---- 1 2 3 4 5 6 7 8 9 10 (10 ROWS)
That's it. The key things to see/understand:
- recursive CTE needs to be marked using “WITH RECURSIVE"
- recursive CTE consists of two queries joined with UNION or UNION ALL. In my case it is “SELECT 1 as x" and “SELECT x+1 FROM i WHERE x < 10
You might ask: what if I have multiple CTEs in single query, and one of them, and not the first one, is recursive, while others are not? It's simple – as long as any of the CTEs are recursive, you have to use “WITH RECURSIVE".
Let's get a bit more complicated query, so we can see what's happening and how.
For my tests, I will be using copy of DMOZ category structure that I downloaded long time ago. You can get the data in sql-dump format from here so you can follow with my examples.
To have rather small, but interesting sample data, I chose these categories:
id | parent_id | label -------+-----------+-------------------------------------------------------- 92987 | 92960 | Turkey 92988 | 92987 | Bilkent_University 92990 | 92987 | Bogazici_University 93001 | 92987 | Middle_East_Technical_University 92989 | 92988 | Departments 92991 | 92990 | Faculties 92996 | 92990 | Institutes 92998 | 92990 | Schools 92992 | 92991 | Arts_and_Sciences 92993 | 92991 | Economics_and_Administrative_Sciences 92994 | 92991 | Education 92995 | 92991 | Engineering 92997 | 92996 | Kandilli_Observatory_Earthquake_and_Research_Institute 92999 | 92998 | Applied_Disciplines 93000 | 92998 | Foreign_Languages (15 ROWS)
When you'll create paths, based on id/parent_id relations, you'll get:
id | parent_id | path -------+-----------+---------------------------------------------------------------------------------------------- 92987 | 92960 | Turkey 92988 | 92987 | Turkey/Bilkent_University 92989 | 92988 | Turkey/Bilkent_University/Departments 92990 | 92987 | Turkey/Bogazici_University 92991 | 92990 | Turkey/Bogazici_University/Faculties 92992 | 92991 | Turkey/Bogazici_University/Faculties/Arts_and_Sciences 92993 | 92991 | Turkey/Bogazici_University/Faculties/Economics_and_Administrative_Sciences 92994 | 92991 | Turkey/Bogazici_University/Faculties/Education 92995 | 92991 | Turkey/Bogazici_University/Faculties/Engineering 92996 | 92990 | Turkey/Bogazici_University/Institutes 92997 | 92996 | Turkey/Bogazici_University/Institutes/Kandilli_Observatory_Earthquake_and_Research_Institute 92998 | 92990 | Turkey/Bogazici_University/Schools 92999 | 92998 | Turkey/Bogazici_University/Schools/Applied_Disciplines 93000 | 92998 | Turkey/Bogazici_University/Schools/Foreign_Languages 93001 | 92987 | Turkey/Middle_East_Technical_University (15 ROWS)
The exercise is to get all kids for id = 92987 using recursive CTE.
We start by getting first element – the one that will be start of our recursion. So, clearly it should be:
$ SELECT * FROM dmoz WHERE id = 92987; id | parent_id | label -------+-----------+-------- 92987 | 92960 | Turkey (1 ROW)
So far, so good. Now, we need to add kids. First, think about just getting direct subcategories, which can be done obviously with:
$ SELECT * FROM dmoz WHERE parent_id = 92987; id | parent_id | label -------+-----------+---------------------------------- 92988 | 92987 | Bilkent_University 92990 | 92987 | Bogazici_University 93001 | 92987 | Middle_East_Technical_University (3 ROWS)
So far, so good. We could then continue, without using CTE, by doing something like:
$ SELECT * FROM dmoz WHERE parent_id IN ( SELECT id FROM dmoz WHERE parent_id = 92987 ); id | parent_id | label -------+-----------+------------- 92989 | 92988 | Departments 92991 | 92990 | Faculties 92996 | 92990 | Institutes 92998 | 92990 | Schools (4 ROWS)
But this gets tedious pretty soon. And, there is also another problem. Let's assume you write the above 3 queries as one, using UNION ALL:
$ SELECT * FROM dmoz WHERE id = 92987 UNION ALL SELECT * FROM dmoz WHERE parent_id = 92987 UNION ALL SELECT * FROM dmoz WHERE parent_id IN ( SELECT id FROM dmoz WHERE parent_id = 92987 ); id | parent_id | label -------+-----------+---------------------------------- 92987 | 92960 | Turkey 92988 | 92987 | Bilkent_University 92990 | 92987 | Bogazici_University 93001 | 92987 | Middle_East_Technical_University 92989 | 92988 | Departments 92991 | 92990 | Faculties 92996 | 92990 | Institutes 92998 | 92990 | Schools (8 ROWS)
Of course, it works, but please note that we would have to add another query just to get 4 levels that we need in our case, but technically – we should be adding much more queries if we didn't know how deep the structure goes.
Recursive CTE is great solution, because You can rewrite potentially infinite query (UNION-ALL style, like above) to something way simpler:
$ WITH RECURSIVE all_kids AS (
SELECT * FROM dmoz WHERE id = 92987
UNION ALL
SELECT d.*
FROM dmoz d
JOIN all_kids ak ON d.parent_id = ak.id
)
SELECT * FROM all_kids;
id | parent_id | label
-------+-----------+--------------------------------------------------------
92987 | 92960 | Turkey
92988 | 92987 | Bilkent_University
92990 | 92987 | Bogazici_University
93001 | 92987 | Middle_East_Technical_University
92989 | 92988 | Departments
92991 | 92990 | Faculties
92996 | 92990 | Institutes
92998 | 92990 | Schools
92992 | 92991 | Arts_and_Sciences
92993 | 92991 | Economics_and_Administrative_Sciences
92994 | 92991 | Education
92995 | 92991 | Engineering
92997 | 92996 | Kandilli_Observatory_Earthquake_and_Research_Institute
92999 | 92998 | Applied_Disciplines
93000 | 92998 | Foreign_Languages
(15 ROWS)
That was fast. In writing at least 🙂
Let's see how it works:
- Line #1 – marks that we will be doing some recursion (WITH RECURSIVE), and starts “all_kids" cte.
- Line #2 – it's the initial row for recursion – the row that we know how to get by id
- Line #3 – UNION ALL – all recursive queries, as I said – as two queries with UNION or UNION ALL in between
- Lines #4-#5 – we will be returning all columns from table dmoz
- Line #6 – rows to return should have parent_id that is listed in all_kids cte
- Line #7 – ends the CTE
- Line #8 – returns rows from all_kids
You might wonder – well, but on first loop – the 2nd query (after UNION ALL), will be equivalent to: select * from dmoz where parent_id = 92987, because that's the only row there. What happens on subsequent rows, though? Which rows are taken into consideration? Of course new ones (with ids 92989, 92991, 92996, 92998, but is 92987 (top level) still taken into consideration)? No. On each run only newly added rows are “visible" to recursive part. We can see it by adding simple window function:
$ WITH RECURSIVE all_kids AS ( SELECT *, '{}'::int4[] AS parents FROM dmoz WHERE id = 92987 UNION ALL SELECT d.*, array_agg(ak.id) OVER () FROM dmoz d JOIN all_kids ak ON d.parent_id = ak.id ) SELECT id, parent_id, label, array( SELECT DISTINCT unnest( parents ) ) AS parents FROM all_kids; id | parent_id | label | parents -------+-----------+--------------------------------------------------------+--------------------- 92987 | 92960 | Turkey | {} 92988 | 92987 | Bilkent_University | {92987} 92990 | 92987 | Bogazici_University | {92987} 93001 | 92987 | Middle_East_Technical_University | {92987} 92989 | 92988 | Departments | {92988,92990} 92991 | 92990 | Faculties | {92988,92990} 92996 | 92990 | Institutes | {92988,92990} 92998 | 92990 | Schools | {92988,92990} 92992 | 92991 | Arts_and_Sciences | {92998,92996,92991} 92993 | 92991 | Economics_and_Administrative_Sciences | {92998,92996,92991} 92994 | 92991 | Education | {92998,92996,92991} 92995 | 92991 | Engineering | {92998,92996,92991} 92997 | 92996 | Kandilli_Observatory_Earthquake_and_Research_Institute | {92998,92996,92991} 92999 | 92998 | Applied_Disciplines | {92998,92996,92991} 93000 | 92998 | Foreign_Languages | {92998,92996,92991} (15 ROWS)
(The additional logic is adding array_agg() in CTE, and then array() call to make the list of parents distinct)
As you can see, each subsequent recurrence is seeing in all_kids just the rows that were added there by previous run.
Of course recursive CTE is not limited to working on trees. It can be used to do many things – solving travelling salesman problem, finding cheap and many others.
Finally, since 9.1 we have writable CTE. This is basically normal CTE, but where source of rows is not SELECT clause, but rather some write operation (INSERT, UPDATE, DELETE) with RETURNING clause.
This is rather simple, we can do stuff like:
$ WITH ids_to_delete AS ( SELECT id FROM dmoz ORDER BY id DESC LIMIT 5 ), delete_it AS ( DELETE FROM dmoz WHERE id IN (SELECT id FROM ids_to_delete ) returning * ) SELECT string_agg(label, ', ') FROM delete_it; string_agg ---------------------------------------------------------- Vietnamese, Vinodam, Ukrainian, Vaarthaa_Patrikalu, Thai (1 ROW)
In here, I used first normal CTE (ids_to_delete) to get some ids of rows to delete, then I used delete_it as writable cte – that is – it did remove the rows, and delete_it CTE contained deleted rows. And finally – I got labels of all deleted rows as single column.
I hope this will make this CuTE feature of PostgreSQL simpler to understand and use. As always – if anything is not clear, please let me know so I can fix it.
I’d just like to note that this “optimization wall” thing has frequently hit me from the opposite side of the equation. I often like to use CTEs to make my query clearer and more organized–but because of this effect, it can result in an unworkable query plan. (As in: going from less than a second to potentially multiple hours. Not just getting a little less efficient.)
That kind of situation, where the structure of the query must be changed in order to influence the query plan, is exactly what I never want in my DBMS. So sign me up for “hints would be a very good thing to have, particularly if they don’t have to be in-line in the body of the query”, and a desire that the query planner also do a better job of making a larger variety of similar queries behave the same way.
That is so cute