Getting top-N rows per group – select * from depesz;

Yesterday on irc someone asked:

Hi, how do I get top 5 values from a column group by another column??

From further discussion, I learned that:

total rows in table is 2 million. It'll have unique words of less than 1 million.. (approx count)

I didn't have time yesterday, but decided to write a solution, or two, to the problem.

First, let's think about solutions. Before 8.4, where you didn't have window functions nor common table expressions the only way was to make a subselect to count number of elements “before" given one.

Something along the lines of:

SELECT
    u.username,
    u.some_ts,
    u.random_value
FROM
    test u
WHERE
    4 > (
        SELECT count(*)
        FROM test su
        WHERE su.username = u.username
        AND su.some_ts > u.some_ts
    )
ORDER BY u.username asc, u.some_ts desc

Problem with this approach is that it doesn't scale nicely. Explain shows:

                                                              QUERY PLAN
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Sort  (cost=10000000835.91..10000000835.99 rows=33 width=20) (actual time=1.460..1.463 rows=44 loops=1)
   Sort Key: u.username, u.some_ts
   Sort Method: quicksort  Memory: 28kB
   ->  Seq Scan on test u  (cost=10000000000.00..10000000835.08 rows=33 width=20) (actual time=0.058..1.289 rows=44 loops=1)
         Filter: (4 > (SubPlan 1))
         Rows Removed by Filter: 56
         SubPlan 1
           ->  Aggregate  (cost=8.32..8.33 rows=1 width=0) (actual time=0.011..0.012 rows=1 loops=100)
                 ->  Index Only Scan using i on test su  (cost=0.00..8.31 rows=3 width=0) (actual time=0.008..0.010 rows=5 loops=100)
                       Index Cond: ((username = u.username) AND (some_ts > u.some_ts))
                       Heap Fetches: 451
 Total runtime: 1.575 ms

Costs over 10000000000 are because I forcibly disabled seq scan, as in my example the table has only 100 rows, and without such disable Pg chose this plan:

                                                      QUERY PLAN
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Sort  (cost=254.83..254.91 rows=33 width=20) (actual time=4.108..4.112 rows=44 loops=1)
   Sort Key: u.username, u.some_ts
   Sort Method: quicksort  Memory: 28kB
   ->  Seq Scan on test u  (cost=0.00..254.00 rows=33 width=20) (actual time=0.069..3.881 rows=44 loops=1)
         Filter: (4 > (SubPlan 1))
         Rows Removed by Filter: 56
         SubPlan 1
           ->  Aggregate  (cost=2.51..2.52 rows=1 width=0) (actual time=0.037..0.037 rows=1 loops=100)
                 ->  Seq Scan on test su  (cost=0.00..2.50 rows=3 width=0) (actual time=0.013..0.035 rows=5 loops=100)
                       Filter: ((some_ts > u.some_ts) AND (username = u.username))
                       Rows Removed by Filter: 95
 Total runtime: 4.215 ms
(12 rows)

Which is more or less the same, but instead of doing 100x index only scan, it does 100x Seq Scan.

Anyway. Cost of running such query is:

Seq Scan over the table
For every row Index Only scan (or Seq Scan)
Finally order by over the groups – in this case 44 rows (4 rows for every user, 11 users)

That's a lot.

Can it be done faster? Yes. But how – it depends on your case.

There are two very different cases:

You have lots of “groups" (users in my example) – so much that you're fetching large portion of the table anyway
You have relatively few groups, and you're just getting small subset of the table

Situation that the guy on irc had is clearly the first one, so I will write something for it first.

Of course, I need some “playground":

CREATE TABLE test (
    username TEXT,
    some_ts timestamptz,
    random_value INT4
);

Now some data in there:

INSERT INTO test (username, some_ts, random_value)
with users as (
    SELECT
        'user #' || i as username,
        cast( 1+ random() * random() * random() * 10 as INT4) as row_count
    FROM
        generate_series(1,1000000) i
)
SELECT
    u.username,
    now() - '1 year'::INTERVAL * random(),
    cast(random() * 100000000 as INT4)
FROM
    users u,
    generate_series(1,10) i(c)
WHERE
    i.c <= u.row_count;

And check if the data match his (the guy from irc) description:

SELECT
    count(distinct username),
    count(*)
FROM test;
  count  │  count
─────────┼─────────
 1000000 │ 2206611
(1 row)

Looks about right. I have slightly more rows, but it's not that big of a difference.

Last thing – do I even have any users with more than 5 rows?

WITH x as (
    SELECT
        username,
        count(*) as rows_per_user
    FROM
        test
    GROUP BY username
)
SELECT
    rows_per_user,
    count(*) as different_users
FROM x
GROUP BY rows_per_user
ORDER BY rows_per_user;
 rows_per_user │ different_users
───────────────┼─────────────────
             1 │          424296
             2 │          280339
             3 │          132580
             4 │           72984
             5 │           43071
             6 │           23940
             7 │           13186
             8 │            6406
             9 │            2564
            10 │             634
(10 rows)

Most of the users have 1, 2 or 3 rows, but we even have some users with 10 rows. All as planned.

For sanity check, I can calculate how many rows I should be getting, which is:

1 * 424296 +
2 * 280339 +
3 * 132580 +
4 * 72984 +
5 * 43071 +
5 * 23940 +
5 * 13186 +
5 * 6406 +
5 * 2564 +
5 * 634

for a total of 2123655 rows. Just 82956 rows will be not included.

And now, with the playground ready, I can finally write the query:

explain analyze
WITH numbered as (
    SELECT
        *,
        row_number() over (partition BY username ORDER BY some_ts) as rowno
    FROM
        test
)
SELECT *
FROM numbered
WHERE rowno <= 5
                                                             QUERY PLAN
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 CTE Scan on numbered  (cost=450693.32..500342.07 rows=735537 width=52) (actual time=17803.640..21914.244 rows=2123655 loops=1)
   Filter: (rowno <= 5)
   Rows Removed by Filter: 82956
   CTE numbered
     ->  WindowAgg  (cost=406561.10..450693.32 rows=2206611 width=24) (actual time=17803.629..20773.791 rows=2206611 loops=1)
           ->  Sort  (cost=406561.10..412077.63 rows=2206611 width=24) (actual time=17803.613..19666.841 rows=2206611 loops=1)
                 Sort Key: test.username, test.some_ts
                 Sort Method: external merge  Disk: 81960kB
                 ->  Seq Scan on test  (cost=0.00..38292.11 rows=2206611 width=24) (actual time=0.007..208.733 rows=2206611 loops=1)
 Total runtime: 22044.126 ms
(10 rows)

This explain is uploaded to explain.depesz.com.

Couple of notes:

There is only one table scan – Seq Scan
Most of the time is spent sorting the data – but this can be “optimized" by simply adding more work_mem. For example, after upping work_mem to ‘1GB', I got “Sort Method: quicksort Memory: 270696kB" and it was ~ 5 seconds faster
While we should add explicit order by at the end of query, it's not needed now (in current 9.3devel) because data is sorted correctly for window function row_number()

How it compares to the query I had earlier tested for small dataset?

Explain for this query is:

                                                                QUERY PLAN
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Sort  (cost=28078918.96..28080757.81 rows=735537 width=24) (actual time=43410.638..45760.144 rows=2123655 loops=1)
   Sort Key: u.username, u.some_ts
   Sort Method: external merge  Disk: 78864kB
   ->  Seq Scan on test u  (cost=0.00..27977076.63 rows=735537 width=24) (actual time=0.111..25967.798 rows=2123655 loops=1)
         Filter: (5 > (SubPlan 1))
         Rows Removed by Filter: 82956
         SubPlan 1
           ->  Aggregate  (cost=12.65..12.66 rows=1 width=0) (actual time=0.011..0.011 rows=1 loops=2206611)
                 ->  Index Only Scan using i on test su  (cost=0.00..12.65 rows=1 width=0) (actual time=0.010..0.010 rows=1 loops=2206611)
                       Index Cond: ((username = u.username) AND (some_ts > u.some_ts))
                       Heap Fetches: 2482901
 Total runtime: 45850.787 ms
(12 rows)

As you can see (mostly on the explain.depesz.com page – all those index scans (2.2 million of them) take actually more time than Sort. And we still need sort for correct ordering of output rows.

This solves the first problem, described yesterday, but what about case with low number of groups?

Some time ago I wrote about getting list of unique elements, in a fast way, using function that implemented scan technique which doesn't exist in PostgreSQL – Skip Scan/Loose indexscan.

Now, thanks to existence of recursive CTE, it should be possible to do without functions.

But first, I need some tidying to my playground, and a different set of data. This time it will be easier to write:

DROP TABLE test;
DROP TABLE
 
CREATE TABLE test (
    username TEXT,
    some_ts timestamptz,
    random_value INT4
);
CREATE TABLE
 
INSERT INTO test (username, some_ts, random_value)
SELECT
    'user #' || cast(floor(random() * 10) as int4),
    now() - '1 year'::INTERVAL * random(),
    cast(random() * 100000000 as INT4)
FROM
    generate_series(1,2000000);
INSERT 0 2000000
 
CREATE INDEX i on test (username, some_ts);
CREATE INDEX
 
analyze test;
ANALYZE
 
SELECT
    username,
    count(*)
FROM test
group by username
order by username;
 username │ count
──────────┼────────
 user #0  │ 199871
 user #1  │ 199939
 user #2  │ 200388
 user #3  │ 199849
 user #4  │ 200329
 user #5  │ 199504
 user #6  │ 199903
 user #7  │ 200799
 user #8  │ 199487
 user #9  │ 199931
(10 rows)

I have 10 users, each with ~ 200,000 rows. I will be selecting 5 rows for every user, so I should be getting 50 rows in output.

Since this is now actually relatively important, let's assume I will be getting newest 5 rows (order by some_ts desc) per each user.

```
WITH RECURSIVE skip AS (
```

    ( SELECT t.username FROM test as t ORDER BY t.username limit 1 )

```
    union all
```
```
    (
```
```
        SELECT
```
```
            (
```

                SELECT min( t2.username )

```
                FROM test t2
```

                WHERE t2.username > s.username

```
            )
```
```
        FROM skip s
```
```
        WHERE s.username IS NOT NULL
```
```
    )
```
```
),
```
```
with_data as (
```
```
    SELECT array(
```
```
        SELECT t
```
```
        FROM test t
```
```
        WHERE t.username = s.username
```
```
        ORDER BY t.some_ts desc LIMIT 5
```
```
    ) as rows
```
```
    FROM skip s
```
```
    WHERE s.username IS NOT NULL
```
```
)
```

SELECT (unnest( rows )).* FROM with_data

Did you just thought “my eyes, they hurt"?

Well, it might be ugly, but it's fast. Explain shows that it took just 2.3ms to get the results.

But how does it work?

lines 1-14 generate list of unique usernames – just usernames. This is done using recursive CTE:
- Line 2 gets first, smallest username
- Lines 5-12 are called recursively, and they fetch next username each time it gets called
The only issue with it is that we will get 11 rows – final row will contain NULL. But this can be filtered out later on.
Lines 15-23 get actual data for all rows
- Lines 16-21 get an array of rows (because we can't generate more rows than we had from “skip" CTE). Each array contains 5 newest rows for given user. I don't have to SELECT username separately, because it's part of the row that is being compacted into array.
- Line 23 removed this additional NULL row that I mentioned above
Line 25 generates final resultset. It gets (from with_data) 10 rows, each row contains array. Each array has 5 elements, and each of these elements contains a row from original test table. Now, we just need to:
- Unnest array – it generates 50 rows, each with single value, being row representation
- Unpack rows – by using (row_variable).* syntax

Final result:

 username │            some_ts            │ random_value
──────────┼───────────────────────────────┼──────────────
 user #0  │ 2012-10-05 14:02:40.850461+02 │     38340423
 user #0  │ 2012-10-05 13:59:36.991261+02 │     47539361
 user #0  │ 2012-10-05 13:58:13.788061+02 │     90133298
 user #0  │ 2012-10-05 13:57:43.461661+02 │     39151654
 user #0  │ 2012-10-05 13:55:19.519261+02 │     23971279
 user #1  │ 2012-10-05 14:09:48.184861+02 │     68225136
 user #1  │ 2012-10-05 14:07:50.853661+02 │     23783242
 user #1  │ 2012-10-05 14:02:39.036061+02 │     45015932
 user #1  │ 2012-10-05 13:58:33.400861+02 │       856787
 user #1  │ 2012-10-05 13:57:44.066461+02 │     97875281
 user #2  │ 2012-10-05 14:06:49.941661+02 │     44353630
 user #2  │ 2012-10-05 14:05:41.426461+02 │     56720972
 user #2  │ 2012-10-05 14:02:29.100061+02 │     47825604
 user #2  │ 2012-10-05 14:00:09.391261+02 │     75955113
 user #2  │ 2012-10-05 13:59:13.663261+02 │     14216525
 user #3  │ 2012-10-05 14:01:00.367261+02 │     31505609
 user #3  │ 2012-10-05 13:59:47.445661+02 │     60049729
 user #3  │ 2012-10-05 13:55:49.240861+02 │     89598870
 user #3  │ 2012-10-05 13:53:57.266461+02 │     33846849
 user #3  │ 2012-10-05 13:53:33.852061+02 │     82449747
 user #4  │ 2012-10-05 14:06:02.853661+02 │     63597247
 user #4  │ 2012-10-05 14:05:51.708061+02 │     54494125
 user #4  │ 2012-10-05 14:02:59.599261+02 │     29407125
 user #4  │ 2012-10-05 14:01:06.415261+02 │     17218964
 user #4  │ 2012-10-05 14:00:17.080861+02 │     62982517
 user #5  │ 2012-10-05 14:07:27.612061+02 │     32170514
 user #5  │ 2012-10-05 14:01:53.589661+02 │      7295343
 user #5  │ 2012-10-05 14:01:11.512861+02 │     27237756
 user #5  │ 2012-10-05 14:00:58.639261+02 │     26827360
 user #5  │ 2012-10-05 14:00:55.096861+02 │     41067140
 user #6  │ 2012-10-05 14:09:33.324061+02 │     95999924
 user #6  │ 2012-10-05 14:07:10.072861+02 │        72678
 user #6  │ 2012-10-05 14:06:41.474461+02 │     37179557
 user #6  │ 2012-10-05 13:54:30.444061+02 │     54959828
 user #6  │ 2012-10-05 13:49:53.532061+02 │     90795817
 user #7  │ 2012-10-05 14:08:57.900061+02 │     83603366
 user #7  │ 2012-10-05 14:07:47.656861+02 │     16851893
 user #7  │ 2012-10-05 14:04:29.109661+02 │     45867068
 user #7  │ 2012-10-05 14:01:53.762461+02 │     89712017
 user #7  │ 2012-10-05 13:56:15.074461+02 │     86761399
 user #8  │ 2012-10-05 14:04:05.436061+02 │     46245277
 user #8  │ 2012-10-05 14:01:53.330461+02 │     48946212
 user #8  │ 2012-10-05 14:00:02.306461+02 │     77759370
 user #8  │ 2012-10-05 13:52:22.831261+02 │     23588197
 user #8  │ 2012-10-05 13:51:34.533661+02 │     88615369
 user #9  │ 2012-10-05 14:05:37.192861+02 │     66530600
 user #9  │ 2012-10-05 14:01:28.360861+02 │     68077186
 user #9  │ 2012-10-05 14:00:46.024861+02 │     30766930
 user #9  │ 2012-10-05 14:00:12.328861+02 │     34347000
 user #9  │ 2012-10-05 13:59:38.892061+02 │     59314056
(50 rows)

There are some issues with it, though. If the returned row would be too wide, using arrays would become problematic.

In such case you can/should use solution by RhodiumToad (2nd example).

That would be it. I mean – you still can use function approach, and in 9.3 you will be able to use LATERAL for such cases:

WITH RECURSIVE skip AS (
    ( SELECT t.username FROM test as t ORDER BY t.username limit 1 )
    union all
    ( SELECT (SELECT min( t2.username ) FROM test t2 WHERE t2.username > s.username ) FROM skip s WHERE s.username IS NOT NULL )
)
SELECT
    x.*
FROM
    skip s,
    lateral (
        SELECT t.*
        FROM test t
        WHERE t.username = s.username
        ORDER BY t.some_ts desc LIMIT 5
    ) as x

As a final though, I would like to thank RhodiumToad for his help, and writing very interesting wiki pages.

And now, I'm off to analyze what he wrote there, and try to understand it.