Someone asked today on irc about grouping data, that contains timestamps, into “partitions".
Usually when someone wants something like this, you can do grouping by date_trunc(), but this time, this person, wanted to group data that all timestamps are within given interval from each other.
I'm not sure I understood him/her right, but I think he/she wanted something like this:
TIMESTAMP | group_no -------------------------------+---------- 2015-12-08 16:30:09.294074+01 | 1 2015-12-08 16:32:03.022866+01 | 1 2015-12-08 16:34:38.478861+01 | 1 2015-12-08 16:35:40.820158+01 | 2 2015-12-08 16:36:54.450505+01 | 2 2015-12-08 16:45:31.501734+01 | 3 2015-12-08 16:49:20.073042+01 | 3 2015-12-08 16:51:18.36115+01 | 4 2015-12-08 16:55:16.186156+01 | 4
That is – I start with first value, and assign it number “1". Then I increment once I will depart more than 5 minutes from start of current group.
Let's see if I can do it.
First, I need some repeatable data to test my queries, so:
$ CREATE TABLE test AS SELECT now() - random() * '1 hour'::INTERVAL AS ts FROM generate_series(1,20); SELECT 20
Data looks like:
$ SELECT ts FROM test ORDER BY ts; ts ------------------------------- 2015-12-08 16:31:25.61225+01 2015-12-08 16:31:50.029059+01 2015-12-08 16:35:58.050782+01 2015-12-08 16:38:20.358847+01 2015-12-08 16:49:08.231987+01 2015-12-08 16:56:22.783962+01 2015-12-08 16:58:01.790472+01 2015-12-08 17:00:14.5963+01 2015-12-08 17:06:18.879723+01 2015-12-08 17:07:02.803871+01 2015-12-08 17:08:43.971144+01 2015-12-08 17:09:43.666333+01 2015-12-08 17:10:57.011821+01 2015-12-08 17:14:51.061448+01 2015-12-08 17:18:13.671139+01 2015-12-08 17:20:43.678185+01 2015-12-08 17:23:01.854288+01 2015-12-08 17:25:10.01336+01 2015-12-08 17:26:14.63925+01 2015-12-08 17:26:22.252227+01 (20 ROWS)
Assigning group 1 for first row is trivial:
$ SELECT ts, CASE WHEN lag(ts) OVER (ORDER BY ts) IS NULL THEN 1 END AS group_no FROM test ORDER BY ts; ts | group_no -------------------------------+---------- 2015-12-08 16:31:25.61225+01 | 1 2015-12-08 16:31:50.029059+01 | [NULL] 2015-12-08 16:35:58.050782+01 | [NULL] 2015-12-08 16:38:20.358847+01 | [NULL] 2015-12-08 16:49:08.231987+01 | [NULL] 2015-12-08 16:56:22.783962+01 | [NULL] 2015-12-08 16:58:01.790472+01 | [NULL] 2015-12-08 17:00:14.5963+01 | [NULL] 2015-12-08 17:06:18.879723+01 | [NULL] 2015-12-08 17:07:02.803871+01 | [NULL] 2015-12-08 17:08:43.971144+01 | [NULL] 2015-12-08 17:09:43.666333+01 | [NULL] 2015-12-08 17:10:57.011821+01 | [NULL] 2015-12-08 17:14:51.061448+01 | [NULL] 2015-12-08 17:18:13.671139+01 | [NULL] 2015-12-08 17:20:43.678185+01 | [NULL] 2015-12-08 17:23:01.854288+01 | [NULL] 2015-12-08 17:25:10.01336+01 | [NULL] 2015-12-08 17:26:14.63925+01 | [NULL] 2015-12-08 17:26:22.252227+01 | [NULL] (20 ROWS)
So, now I just need to check “previous" group_no. But how? I somehow feel that there should be some cool trick with “unbound preceeding", but I can't figure it out.
But, this kind of problems can be easily solved by recursive queries.
As first step, let's find rows that will start new groups – that is, 1st row in the dataset, and then first that has ts over 5 minutes “after" previous group starter:
$ WITH recursive group_starters AS ( SELECT MIN(ts) AS ts, 1 AS group_no FROM test UNION ALL ( SELECT t.ts, gs.group_no + 1 FROM test t JOIN group_starters gs ON t.ts > gs.ts + '5 minutes'::INTERVAL ORDER BY t.ts ASC LIMIT 1 ) ) SELECT * FROM group_starters; ts | group_no -------------------------------+---------- 2015-12-08 16:31:25.61225+01 | 1 2015-12-08 16:38:20.358847+01 | 2 2015-12-08 16:49:08.231987+01 | 3 2015-12-08 16:56:22.783962+01 | 4 2015-12-08 17:06:18.879723+01 | 5 2015-12-08 17:14:51.061448+01 | 6 2015-12-08 17:20:43.678185+01 | 7 2015-12-08 17:26:14.63925+01 | 8 (8 ROWS)
This looks promising.
Now, this data can be joined with test again, to have all rows, not just starters:
$ WITH recursive group_starters AS ( SELECT MIN(ts) AS ts, 1 AS group_no FROM test UNION ALL ( SELECT t.ts, gs.group_no + 1 FROM test t JOIN group_starters gs ON t.ts > gs.ts + '5 minutes'::INTERVAL ORDER BY t.ts ASC LIMIT 1 ) ) SELECT t.ts, gs.group_no FROM test t LEFT OUTER JOIN group_starters gs ON t.ts = gs.ts ORDER BY t.ts; ts | group_no -------------------------------+---------- 2015-12-08 16:31:25.61225+01 | 1 2015-12-08 16:31:50.029059+01 | [NULL] 2015-12-08 16:35:58.050782+01 | [NULL] 2015-12-08 16:38:20.358847+01 | 2 2015-12-08 16:49:08.231987+01 | 3 2015-12-08 16:56:22.783962+01 | 4 2015-12-08 16:58:01.790472+01 | [NULL] 2015-12-08 17:00:14.5963+01 | [NULL] 2015-12-08 17:06:18.879723+01 | 5 2015-12-08 17:07:02.803871+01 | [NULL] 2015-12-08 17:08:43.971144+01 | [NULL] 2015-12-08 17:09:43.666333+01 | [NULL] 2015-12-08 17:10:57.011821+01 | [NULL] 2015-12-08 17:14:51.061448+01 | 6 2015-12-08 17:18:13.671139+01 | [NULL] 2015-12-08 17:20:43.678185+01 | 7 2015-12-08 17:23:01.854288+01 | [NULL] 2015-12-08 17:25:10.01336+01 | [NULL] 2015-12-08 17:26:14.63925+01 | 8 2015-12-08 17:26:22.252227+01 | [NULL] (20 ROWS)
And now, the final touch, is to assign group no for remaining rows:
$ WITH recursive group_starters AS ( SELECT MIN(ts) AS ts, 1 AS group_no FROM test UNION ALL ( SELECT t.ts, gs.group_no + 1 FROM test t JOIN group_starters gs ON t.ts > gs.ts + '5 minutes'::INTERVAL ORDER BY t.ts ASC LIMIT 1 ) ) SELECT t.ts, MAX(gs.group_no) OVER (ORDER BY t.ts) AS group_no FROM test t LEFT OUTER JOIN group_starters gs ON t.ts = gs.ts ORDER BY t.ts; ts | group_no -------------------------------+---------- 2015-12-08 16:31:25.61225+01 | 1 2015-12-08 16:31:50.029059+01 | 1 2015-12-08 16:35:58.050782+01 | 1 2015-12-08 16:38:20.358847+01 | 2 2015-12-08 16:49:08.231987+01 | 3 2015-12-08 16:56:22.783962+01 | 4 2015-12-08 16:58:01.790472+01 | 4 2015-12-08 17:00:14.5963+01 | 4 2015-12-08 17:06:18.879723+01 | 5 2015-12-08 17:07:02.803871+01 | 5 2015-12-08 17:08:43.971144+01 | 5 2015-12-08 17:09:43.666333+01 | 5 2015-12-08 17:10:57.011821+01 | 5 2015-12-08 17:14:51.061448+01 | 6 2015-12-08 17:18:13.671139+01 | 6 2015-12-08 17:20:43.678185+01 | 7 2015-12-08 17:23:01.854288+01 | 7 2015-12-08 17:25:10.01336+01 | 7 2015-12-08 17:26:14.63925+01 | 8 2015-12-08 17:26:22.252227+01 | 8 (20 ROWS)
And that's it.
I can't shake the feeling that it should be possible to do easier, but as of now, I can't really think of another solution, so figured I'll post this one, and maybe someone in comments will show me better approach.
Side note – this can be also done using function, like:
$ CREATE OR REPLACE FUNCTION ts_grouped( OUT ts timestamptz, OUT group_no int4 ) RETURNS setof record AS $$ DECLARE out_ts alias FOR $1; out_group_no alias FOR $2; v_start timestamptz := NULL; BEGIN out_group_no := 1; FOR out_ts IN SELECT t.ts FROM test t ORDER BY t.ts LOOP IF v_start IS NULL THEN v_start := out_ts; END IF; IF out_ts > v_start + '5 minutes'::INTERVAL THEN out_group_no := out_group_no + 1; v_start := out_ts; END IF; RETURN NEXT; END loop; RETURN; END; $$ LANGUAGE plpgsql; CREATE FUNCTION $ SELECT * FROM ts_grouped() ORDER BY ts; ts | group_no -------------------------------+---------- 2015-12-08 16:31:25.61225+01 | 1 2015-12-08 16:31:50.029059+01 | 1 2015-12-08 16:35:58.050782+01 | 1 2015-12-08 16:38:20.358847+01 | 2 2015-12-08 16:49:08.231987+01 | 3 2015-12-08 16:56:22.783962+01 | 4 2015-12-08 16:58:01.790472+01 | 4 2015-12-08 17:00:14.5963+01 | 4 2015-12-08 17:06:18.879723+01 | 5 2015-12-08 17:07:02.803871+01 | 5 2015-12-08 17:08:43.971144+01 | 5 2015-12-08 17:09:43.666333+01 | 5 2015-12-08 17:10:57.011821+01 | 5 2015-12-08 17:14:51.061448+01 | 6 2015-12-08 17:18:13.671139+01 | 6 2015-12-08 17:20:43.678185+01 | 7 2015-12-08 17:23:01.854288+01 | 7 2015-12-08 17:25:10.01336+01 | 7 2015-12-08 17:26:14.63925+01 | 8 2015-12-08 17:26:22.252227+01 | 8 (20 ROWS)
I checked performance, and for 1 million rows, function call used ~ 1.7s, while the recursive query .. well, I killed it after 1 minute.
Pherhaps not exactly the same, but approximates it pretty well I think:
SELECT
ts,
DENSE_RANK() OVER (ORDER BY EXTRACT(epoch FROM ts)::bigint – (EXTRACT(epoch FROM ts)::bigint % EXTRACT(epoch FROM ‘5 minutes’::interval)::int)) AS rank
FROM
(SELECT now() – random() * ‘1 hour’::interval AS ts FROM generate_series(1,1000000)) t;
It orders them in “buckets” of interval time since epoch. This makes the result much more stable (since it’s not relative to the timestamps of the current set of starters).
@Justin:
yes, but that’s *exactly* what we’re not after.
Doing this is, as you can see, pretty simple. doing division into “x minutes” frames starting at some point not related to absolute beginning, is more complicated.
This non-recursive query meets a slightly different requirement: Change the group_no bucket when there is a 5+ minute gap between successive timestamps:
…
with source as ( select ts, case when lag(ts) over (order by ts) is null then 1 when (lag(ts) over (order by ts)) < ts – interval '5 minutes' then 1 else 0 end as add_me from test order by ts ) select ts, sum(add_me) over (order by ts) as group_no;
Another solution and perhaps faster :
select ts,extract(epoch from (ts-(select min(ts) from test)))::int/60/5+1 from test order by 1;
What do you think of it ?
Bruno
@Bruno, @RobJ: nice, but not doing what the blogpost described.
Hi,
I believe that solution can be much easier:
SELECT ts, dense_rank() OVER (ORDER BY CEIL(ROUND(extract(EPOCH from TS))/60/5)) from test order by 1;
What do you think?
Łukasz
@Łukasz:
it beautifully solves a problem. But different problem than the one blogpost is about.
Guys, really – 4 suggestions, all brilliant, ans simple, and all of them solving different problem than described.
Do I really suck that much at describing the thing?
No, you described problem in two different paragraphs 🙂
Problem is in fact very hard to solve with one query. In could be done by aggregate function with restart or with simple set_var/get_var functions. e.g.
SELECT ts, dense_rank() OVER (ORDER BY ts) from (
SELECT ts,
CASE
WHEN ts ‘5 minute’::interval THEN set_var(‘asd’::text,ts::timestamp)
ELSE get_var(‘asd’) END
FROM (
select ts from test order by ts
) as xxx
) as xxx1
ORDER BY ts;
It is much worse than yours but it can be done. I love your brain teasers 🙂