Yesterday someone on irc asked:
i've a query that returns sequential numbers with gaps (generate_series + join) and my question is: can is somehow construct ranges out of the returned values? sort of range_agg or something?
There was no further discussion, aside from me saying
sure you can. not trivial task, but possible.
you'd need window functions.
but it got me thinking …
First, it would be good to have test data. To avoid retyping it for every query let's make a table with some integers:
=$ CREATE TABLE test (i int4); CREATE TABLE =$ INSERT INTO test (i) VALUES (1), (2), (5), (6), (7), (8), (9), (10), (20), (30); INSERT 0 10
This gives us nice table input data:
=$ SELECT i FROM test ORDER BY i; i ---- 1 2 5 6 7 8 9 10 20 30 (10 ROWS)
Based on data above, we'd like to get ranges:
- 1 to 2
- 5 to 10
- 20
- 30
First, we need a way to tell if given row starts new range. This can be trivially done by checking whether previous value of i smaller by 1:
=$ SELECT i, i - lag(i) OVER (ORDER BY i) AS diff FROM test ORDER BY i; i | diff ----+-------- 1 | [NULL] 2 | 1 5 | 3 6 | 1 7 | 1 8 | 1 9 | 1 10 | 1 20 | 10 30 | 10 (10 ROWS)
Nice. If diff is null or > 1 then current i starts new range.
Now, let's make it so that all rows in given range will have the same group id. To do so, I'll assign group_id (being just copy of value from i) for first rows in range, and in other cases, I'll make it null:
=$ WITH with_group_id AS ( SELECT i, CASE WHEN i - lag(i) OVER (ORDER BY 1) < 2 THEN NULL ELSE i END AS group_id FROM test ) SELECT * FROM with_group_id ORDER BY i; i | group_id ----+---------- 1 | 1 2 | [NULL] 5 | 5 6 | [NULL] 7 | [NULL] 8 | [NULL] 9 | [NULL] 10 | [NULL] 20 | 20 30 | 30 (10 ROWS)
Great. Now, let's fill group_id for non-first rows:
=$ WITH with_group_id AS ( SELECT i, CASE WHEN i - lag(i) OVER (ORDER BY 1) < 2 THEN NULL ELSE i END AS group_id FROM test ) SELECT i, MAX(group_id) OVER (ORDER BY i) AS group_id FROM with_group_id ORDER BY i; i | group_id ----+---------- 1 | 1 2 | 1 5 | 5 6 | 5 7 | 5 8 | 5 9 | 5 10 | 5 20 | 20 30 | 30 (10 ROWS)
Great. Now, all I need is to get ranges using normal grouping:
=$ WITH with_group_id AS ( SELECT i, CASE WHEN i - lag(i) OVER (ORDER BY 1) < 2 THEN NULL ELSE i END AS group_id FROM test ), full_group_id AS ( SELECT i, MAX(group_id) OVER (ORDER BY i) AS group_id FROM with_group_id ) SELECT MIN(i) AS range_from, MAX(i) AS range_to FROM full_group_id GROUP BY group_id ORDER BY group_id; range_from | range_to ------------+---------- 1 | 2 5 | 10 20 | 20 30 | 30 (4 ROWS)
This can be also used to generate proper PostgreSQL ranges:
=$ WITH with_group_id AS ( SELECT i, CASE WHEN i - lag(i) OVER (ORDER BY 1) < 2 THEN NULL ELSE i END AS group_id FROM test ), full_group_id AS ( SELECT i, MAX(group_id) OVER (ORDER BY i) AS group_id FROM with_group_id ) SELECT int4range( MIN(i), MAX(i), '[]' ) AS range FROM full_group_id GROUP BY group_id ORDER BY group_id; range --------- [1,3) [5,11) [20,21) [30,31) (4 ROWS)
Pretty cool, isn't it?
Long time no write.
I had to solve exactly this problem in a work related task. At the time, the approach I took was direct SQL, no CTEs at all. I must admit the CTEs definitely make it easier to explain the solution here.
The approach I took was (using your table):
This has the nice side effect of reducing the amount of work to be done when there are large contiguous ranges.
I did a similar thing a while back, but for dates. The var `dates` was a `date[]` for this code:
I believe that this solution is simplest (sorry, no idea where you are getting your nice formatting):
@Dan:
This is beautiful. Abuses certain facts about data, but works, and is much simpler. Thanks.
Thanks. 🙂
As I look at it again, it appears that I really could have dropped the second “ORDER BY i” inside the subselect. After all, the aggregate functions don’t care what order the rows of t appear in. I probably put it there to aid with verifying the approach, as I was working on it. So, if I am correct, it could be made slightly simpler, still.