Yesterday on irc someone asked:
Hi, how do I get top 5 values from a column group by another column??
From further discussion, I learned that:
total rows in table is 2 million. It'll have unique words of less than 1 million.. (approx count)
I didn't have time yesterday, but decided to write a solution, or two, to the problem.
First, let's think about solutions. Before 8.4, where you didn't have window functions nor common table expressions the only way was to make a subselect to count number of elements “before" given one.
Something along the lines of:
SELECT u.username, u.some_ts, u.random_value FROM test u WHERE 4 > ( SELECT COUNT(*) FROM test su WHERE su.username = u.username AND su.some_ts > u.some_ts ) ORDER BY u.username ASC, u.some_ts DESC
Problem with this approach is that it doesn't scale nicely. Explain shows:
QUERY PLAN ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Sort (cost=10000000835.91..10000000835.99 ROWS=33 width=20) (actual TIME=1.460..1.463 ROWS=44 loops=1) Sort KEY: u.username, u.some_ts Sort Method: quicksort Memory: 28kB -> Seq Scan ON test u (cost=10000000000.00..10000000835.08 ROWS=33 width=20) (actual TIME=0.058..1.289 ROWS=44 loops=1) FILTER: (4 > (SubPlan 1)) ROWS Removed BY FILTER: 56 SubPlan 1 -> Aggregate (cost=8.32..8.33 ROWS=1 width=0) (actual TIME=0.011..0.012 ROWS=1 loops=100) -> INDEX ONLY Scan USING i ON test su (cost=0.00..8.31 ROWS=3 width=0) (actual TIME=0.008..0.010 ROWS=5 loops=100) INDEX Cond: ((username = u.username) AND (some_ts > u.some_ts)) Heap Fetches: 451 Total runtime: 1.575 ms
Costs over 10000000000 are because I forcibly disabled seq scan, as in my example the table has only 100 rows, and without such disable Pg chose this plan:
QUERY PLAN ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Sort (cost=254.83..254.91 ROWS=33 width=20) (actual TIME=4.108..4.112 ROWS=44 loops=1) Sort KEY: u.username, u.some_ts Sort Method: quicksort Memory: 28kB -> Seq Scan ON test u (cost=0.00..254.00 ROWS=33 width=20) (actual TIME=0.069..3.881 ROWS=44 loops=1) FILTER: (4 > (SubPlan 1)) ROWS Removed BY FILTER: 56 SubPlan 1 -> Aggregate (cost=2.51..2.52 ROWS=1 width=0) (actual TIME=0.037..0.037 ROWS=1 loops=100) -> Seq Scan ON test su (cost=0.00..2.50 ROWS=3 width=0) (actual TIME=0.013..0.035 ROWS=5 loops=100) FILTER: ((some_ts > u.some_ts) AND (username = u.username)) ROWS Removed BY FILTER: 95 Total runtime: 4.215 ms (12 ROWS)
Which is more or less the same, but instead of doing 100x index only scan, it does 100x Seq Scan.
Anyway. Cost of running such query is:
- Seq Scan over the table
- For every row Index Only scan (or Seq Scan)
- Finally order by over the groups – in this case 44 rows (4 rows for every user, 11 users)
That's a lot.
Can it be done faster? Yes. But how – it depends on your case.
There are two very different cases:
- You have lots of “groups" (users in my example) – so much that you're fetching large portion of the table anyway
- You have relatively few groups, and you're just getting small subset of the table
Situation that the guy on irc had is clearly the first one, so I will write something for it first.
Of course, I need some “playground":
CREATE TABLE test ( username TEXT, some_ts timestamptz, random_value INT4 );
Now some data in there:
INSERT INTO test (username, some_ts, random_value) WITH users AS ( SELECT 'user #' || i AS username, CAST( 1+ random() * random() * random() * 10 AS INT4) AS ROW_COUNT FROM generate_series(1,1000000) i ) SELECT u.username, now() - '1 year'::INTERVAL * random(), CAST(random() * 100000000 AS INT4) FROM users u, generate_series(1,10) i(c) WHERE i.c <= u.row_count;
And check if the data match his (the guy from irc) description:
SELECT COUNT(DISTINCT username), COUNT(*) FROM test; COUNT │ COUNT ─────────┼───────── 1000000 │ 2206611 (1 ROW)
Looks about right. I have slightly more rows, but it's not that big of a difference.
Last thing – do I even have any users with more than 5 rows?
WITH x AS ( SELECT username, COUNT(*) AS rows_per_user FROM test GROUP BY username ) SELECT rows_per_user, COUNT(*) AS different_users FROM x GROUP BY rows_per_user ORDER BY rows_per_user; rows_per_user │ different_users ───────────────┼───────────────── 1 │ 424296 2 │ 280339 3 │ 132580 4 │ 72984 5 │ 43071 6 │ 23940 7 │ 13186 8 │ 6406 9 │ 2564 10 │ 634 (10 ROWS)
Most of the users have 1, 2 or 3 rows, but we even have some users with 10 rows. All as planned.
For sanity check, I can calculate how many rows I should be getting, which is:
- 1 * 424296 +
- 2 * 280339 +
- 3 * 132580 +
- 4 * 72984 +
- 5 * 43071 +
- 5 * 23940 +
- 5 * 13186 +
- 5 * 6406 +
- 5 * 2564 +
- 5 * 634
for a total of 2123655 rows. Just 82956 rows will be not included.
And now, with the playground ready, I can finally write the query:
EXPLAIN analyze WITH numbered AS ( SELECT *, ROW_NUMBER() OVER (partition BY username ORDER BY some_ts) AS rowno FROM test ) SELECT * FROM numbered WHERE rowno <= 5 QUERY PLAN ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── CTE Scan ON numbered (cost=450693.32..500342.07 ROWS=735537 width=52) (actual TIME=17803.640..21914.244 ROWS=2123655 loops=1) FILTER: (rowno <= 5) ROWS Removed BY FILTER: 82956 CTE numbered -> WindowAgg (cost=406561.10..450693.32 ROWS=2206611 width=24) (actual TIME=17803.629..20773.791 ROWS=2206611 loops=1) -> Sort (cost=406561.10..412077.63 ROWS=2206611 width=24) (actual TIME=17803.613..19666.841 ROWS=2206611 loops=1) Sort KEY: test.username, test.some_ts Sort Method: external MERGE Disk: 81960kB -> Seq Scan ON test (cost=0.00..38292.11 ROWS=2206611 width=24) (actual TIME=0.007..208.733 ROWS=2206611 loops=1) Total runtime: 22044.126 ms (10 ROWS)
This explain is uploaded to explain.depesz.com.
Couple of notes:
- There is only one table scan – Seq Scan
- Most of the time is spent sorting the data – but this can be “optimized" by simply adding more work_mem. For example, after upping work_mem to ‘1GB', I got “Sort Method: quicksort Memory: 270696kB" and it was ~ 5 seconds faster
- While we should add explicit order by at the end of query, it's not needed now (in current 9.3devel) because data is sorted correctly for window function row_number()
How it compares to the query I had earlier tested for small dataset?
Explain for this query is:
QUERY PLAN ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Sort (cost=28078918.96..28080757.81 ROWS=735537 width=24) (actual TIME=43410.638..45760.144 ROWS=2123655 loops=1) Sort KEY: u.username, u.some_ts Sort Method: external MERGE Disk: 78864kB -> Seq Scan ON test u (cost=0.00..27977076.63 ROWS=735537 width=24) (actual TIME=0.111..25967.798 ROWS=2123655 loops=1) FILTER: (5 > (SubPlan 1)) ROWS Removed BY FILTER: 82956 SubPlan 1 -> Aggregate (cost=12.65..12.66 ROWS=1 width=0) (actual TIME=0.011..0.011 ROWS=1 loops=2206611) -> INDEX ONLY Scan USING i ON test su (cost=0.00..12.65 ROWS=1 width=0) (actual TIME=0.010..0.010 ROWS=1 loops=2206611) INDEX Cond: ((username = u.username) AND (some_ts > u.some_ts)) Heap Fetches: 2482901 Total runtime: 45850.787 ms (12 ROWS)
As you can see (mostly on the explain.depesz.com page – all those index scans (2.2 million of them) take actually more time than Sort. And we still need sort for correct ordering of output rows.
This solves the first problem, described yesterday, but what about case with low number of groups?
Some time ago I wrote about getting list of unique elements, in a fast way, using function that implemented scan technique which doesn't exist in PostgreSQL – Skip Scan/Loose indexscan.
Now, thanks to existence of recursive CTE, it should be possible to do without functions.
But first, I need some tidying to my playground, and a different set of data. This time it will be easier to write:
DROP TABLE test; DROP TABLE CREATE TABLE test ( username TEXT, some_ts timestamptz, random_value INT4 ); CREATE TABLE INSERT INTO test (username, some_ts, random_value) SELECT 'user #' || CAST(FLOOR(random() * 10) AS int4), now() - '1 year'::INTERVAL * random(), CAST(random() * 100000000 AS INT4) FROM generate_series(1,2000000); INSERT 0 2000000 CREATE INDEX i ON test (username, some_ts); CREATE INDEX analyze test; ANALYZE SELECT username, COUNT(*) FROM test GROUP BY username ORDER BY username; username │ COUNT ──────────┼──────── USER #0 │ 199871 USER #1 │ 199939 USER #2 │ 200388 USER #3 │ 199849 USER #4 │ 200329 USER #5 │ 199504 USER #6 │ 199903 USER #7 │ 200799 USER #8 │ 199487 USER #9 │ 199931 (10 ROWS)
I have 10 users, each with ~ 200,000 rows. I will be selecting 5 rows for every user, so I should be getting 50 rows in output.
Since this is now actually relatively important, let's assume I will be getting newest 5 rows (order by some_ts desc) per each user.
WITH RECURSIVE skip AS (
( SELECT t.username FROM test AS t ORDER BY t.username LIMIT 1 )
UNION ALL
(
SELECT
(
SELECT MIN( t2.username )
FROM test t2
WHERE t2.username > s.username
)
FROM skip s
WHERE s.username IS NOT NULL
)
),
with_data AS (
SELECT array(
SELECT t
FROM test t
WHERE t.username = s.username
ORDER BY t.some_ts DESC LIMIT 5
) AS ROWS
FROM skip s
WHERE s.username IS NOT NULL
)
SELECT (unnest( ROWS )).* FROM with_data
Did you just thought “my eyes, they hurt"?
Well, it might be ugly, but it's fast. Explain shows that it took just 2.3ms to get the results.
But how does it work?
- lines 1-14 generate list of unique usernames – just usernames. This is done using recursive CTE:
- Line 2 gets first, smallest username
- Lines 5-12 are called recursively, and they fetch next username each time it gets called
The only issue with it is that we will get 11 rows – final row will contain NULL. But this can be filtered out later on.
- Lines 15-23 get actual data for all rows
- Lines 16-21 get an array of rows (because we can't generate more rows than we had from “skip" CTE). Each array contains 5 newest rows for given user. I don't have to SELECT username separately, because it's part of the row that is being compacted into array.
- Line 23 removed this additional NULL row that I mentioned above
- Line 25 generates final resultset. It gets (from with_data) 10 rows, each row contains array. Each array has 5 elements, and each of these elements contains a row from original test table. Now, we just need to:
- Unnest array – it generates 50 rows, each with single value, being row representation
- Unpack rows – by using (row_variable).* syntax
Final result:
username │ some_ts │ random_value ──────────┼───────────────────────────────┼────────────── USER #0 │ 2012-10-05 14:02:40.850461+02 │ 38340423 USER #0 │ 2012-10-05 13:59:36.991261+02 │ 47539361 USER #0 │ 2012-10-05 13:58:13.788061+02 │ 90133298 USER #0 │ 2012-10-05 13:57:43.461661+02 │ 39151654 USER #0 │ 2012-10-05 13:55:19.519261+02 │ 23971279 USER #1 │ 2012-10-05 14:09:48.184861+02 │ 68225136 USER #1 │ 2012-10-05 14:07:50.853661+02 │ 23783242 USER #1 │ 2012-10-05 14:02:39.036061+02 │ 45015932 USER #1 │ 2012-10-05 13:58:33.400861+02 │ 856787 USER #1 │ 2012-10-05 13:57:44.066461+02 │ 97875281 USER #2 │ 2012-10-05 14:06:49.941661+02 │ 44353630 USER #2 │ 2012-10-05 14:05:41.426461+02 │ 56720972 USER #2 │ 2012-10-05 14:02:29.100061+02 │ 47825604 USER #2 │ 2012-10-05 14:00:09.391261+02 │ 75955113 USER #2 │ 2012-10-05 13:59:13.663261+02 │ 14216525 USER #3 │ 2012-10-05 14:01:00.367261+02 │ 31505609 USER #3 │ 2012-10-05 13:59:47.445661+02 │ 60049729 USER #3 │ 2012-10-05 13:55:49.240861+02 │ 89598870 USER #3 │ 2012-10-05 13:53:57.266461+02 │ 33846849 USER #3 │ 2012-10-05 13:53:33.852061+02 │ 82449747 USER #4 │ 2012-10-05 14:06:02.853661+02 │ 63597247 USER #4 │ 2012-10-05 14:05:51.708061+02 │ 54494125 USER #4 │ 2012-10-05 14:02:59.599261+02 │ 29407125 USER #4 │ 2012-10-05 14:01:06.415261+02 │ 17218964 USER #4 │ 2012-10-05 14:00:17.080861+02 │ 62982517 USER #5 │ 2012-10-05 14:07:27.612061+02 │ 32170514 USER #5 │ 2012-10-05 14:01:53.589661+02 │ 7295343 USER #5 │ 2012-10-05 14:01:11.512861+02 │ 27237756 USER #5 │ 2012-10-05 14:00:58.639261+02 │ 26827360 USER #5 │ 2012-10-05 14:00:55.096861+02 │ 41067140 USER #6 │ 2012-10-05 14:09:33.324061+02 │ 95999924 USER #6 │ 2012-10-05 14:07:10.072861+02 │ 72678 USER #6 │ 2012-10-05 14:06:41.474461+02 │ 37179557 USER #6 │ 2012-10-05 13:54:30.444061+02 │ 54959828 USER #6 │ 2012-10-05 13:49:53.532061+02 │ 90795817 USER #7 │ 2012-10-05 14:08:57.900061+02 │ 83603366 USER #7 │ 2012-10-05 14:07:47.656861+02 │ 16851893 USER #7 │ 2012-10-05 14:04:29.109661+02 │ 45867068 USER #7 │ 2012-10-05 14:01:53.762461+02 │ 89712017 USER #7 │ 2012-10-05 13:56:15.074461+02 │ 86761399 USER #8 │ 2012-10-05 14:04:05.436061+02 │ 46245277 USER #8 │ 2012-10-05 14:01:53.330461+02 │ 48946212 USER #8 │ 2012-10-05 14:00:02.306461+02 │ 77759370 USER #8 │ 2012-10-05 13:52:22.831261+02 │ 23588197 USER #8 │ 2012-10-05 13:51:34.533661+02 │ 88615369 USER #9 │ 2012-10-05 14:05:37.192861+02 │ 66530600 USER #9 │ 2012-10-05 14:01:28.360861+02 │ 68077186 USER #9 │ 2012-10-05 14:00:46.024861+02 │ 30766930 USER #9 │ 2012-10-05 14:00:12.328861+02 │ 34347000 USER #9 │ 2012-10-05 13:59:38.892061+02 │ 59314056 (50 ROWS)
There are some issues with it, though. If the returned row would be too wide, using arrays would become problematic.
In such case you can/should use solution by RhodiumToad (2nd example).
That would be it. I mean – you still can use function approach, and in 9.3 you will be able to use LATERAL for such cases:
WITH RECURSIVE skip AS ( ( SELECT t.username FROM test AS t ORDER BY t.username LIMIT 1 ) UNION ALL ( SELECT (SELECT MIN( t2.username ) FROM test t2 WHERE t2.username > s.username ) FROM skip s WHERE s.username IS NOT NULL ) ) SELECT x.* FROM skip s, lateral ( SELECT t.* FROM test t WHERE t.username = s.username ORDER BY t.some_ts DESC LIMIT 5 ) AS x
As a final though, I would like to thank RhodiumToad for his help, and writing very interesting wiki pages.
And now, I'm off to analyze what he wrote there, and try to understand it.