Yesterday someone on irc asked:
i've a query that returns sequential numbers with gaps (generate_series + join) and my question is: can is somehow construct ranges out of the returned values? sort of range_agg or something?
There was no further discussion, aside from me saying
sure you can. not trivial task, but possible.
you'd need window functions.
but it got me thinking …
First, it would be good to have test data. To avoid retyping it for every query let's make a table with some integers:
=$ create table test (i int4); CREATE TABLE =$ insert into test (i) values (1), (2), (5), (6), (7), (8), (9), (10), (20), (30); INSERT 0 10
This gives us nice table input data:
=$ select i from test order by i; i ---- 1 2 5 6 7 8 9 10 20 30 (10 rows)
Based on data above, we'd like to get ranges:
- 1 to 2
- 5 to 10
- 20
- 30
First, we need a way to tell if given row starts new range. This can be trivially done by checking whether previous value of i smaller by 1:
=$ select i, i - lag(i) over (order by i) as diff from test order by i; i | diff ----+-------- 1 | [null] 2 | 1 5 | 3 6 | 1 7 | 1 8 | 1 9 | 1 10 | 1 20 | 10 30 | 10 (10 rows)
Nice. If diff is null or > 1 then current i starts new range.
Now, let's make it so that all rows in given range will have the same group id. To do so, I'll assign group_id (being just copy of value from i) for first rows in range, and in other cases, I'll make it null:
=$ with with_group_id as ( select i, case when i - lag(i) over (order by 1) < 2 then null else i end as group_id from test ) select * from with_group_id order by i; i | group_id ----+---------- 1 | 1 2 | [null] 5 | 5 6 | [null] 7 | [null] 8 | [null] 9 | [null] 10 | [null] 20 | 20 30 | 30 (10 rows)
Great. Now, let's fill group_id for non-first rows:
=$ with with_group_id as ( select i, case when i - lag(i) over (order by 1) < 2 then null else i end as group_id from test ) select i, max(group_id) over (order by i) as group_id from with_group_id order by i; i | group_id ----+---------- 1 | 1 2 | 1 5 | 5 6 | 5 7 | 5 8 | 5 9 | 5 10 | 5 20 | 20 30 | 30 (10 rows)
Great. Now, all I need is to get ranges using normal grouping:
=$ with with_group_id as ( select i, case when i - lag(i) over (order by 1) < 2 then null else i end as group_id from test ), full_group_id as ( select i, max(group_id) over (order by i) as group_id from with_group_id ) select min(i) as range_from, max(i) as range_to from full_group_id group by group_id order by group_id; range_from | range_to ------------+---------- 1 | 2 5 | 10 20 | 20 30 | 30 (4 rows)
This can be also used to generate proper PostgreSQL ranges:
=$ with with_group_id as ( select i, case when i - lag(i) over (order by 1) < 2 then null else i end as group_id from test ), full_group_id as ( select i, max(group_id) over (order by i) as group_id from with_group_id ) select int4range( min(i), max(i), '[]' ) as range from full_group_id group by group_id order by group_id; range --------- [1,3) [5,11) [20,21) [30,31) (4 rows)
Pretty cool, isn't it?
Long time no write.
I had to solve exactly this problem in a work related task. At the time, the approach I took was direct SQL, no CTEs at all. I must admit the CTEs definitely make it easier to explain the solution here.
The approach I took was (using your table):
This has the nice side effect of reducing the amount of work to be done when there are large contiguous ranges.
I did a similar thing a while back, but for dates. The var `dates` was a `date[]` for this code:
I believe that this solution is simplest (sorry, no idea where you are getting your nice formatting):
@Dan:
This is beautiful. Abuses certain facts about data, but works, and is much simpler. Thanks.
Thanks. 🙂
As I look at it again, it appears that I really could have dropped the second “ORDER BY i” inside the subselect. After all, the aggregate functions don’t care what order the rows of t appear in. I probably put it there to aid with verifying the approach, as I was working on it. So, if I am correct, it could be made slightly simpler, still.