Every so often someone needs solution to getting first (or couple of first) values for given column. Or last.
For some cases (when there is not many groups) you can use recursive queries. But it's not always the best choice.
Let's try to implement first() and last() aggregates, so these could be easily used by anybody.
Let's start with simplest possible version – returning just one value.
First, we need a function. Technically functions require parameter datatypes, which would make it pretty long, but luckily, we can use “ANYELEMENT" pseudo datatype:
$ CREATE OR REPLACE FUNCTION agg_first( IN p_state anyelement, IN p_new_element anyelement ) RETURNS anyelement LANGUAGE SQL AS $$ SELECT COALESCE( p_state, p_new_element ); $$; CREATE FUNCTION
This functions looks really simple, but that's because it should be simple. Now, let's make the aggregate:
$ CREATE AGGREGATE FIRST(anyelement) ( sfunc = agg_first, stype = anyelement ); CREATE AGGREGATE
Now, we'll need some sample data, so let's create simple table with random data:
$ CREATE TABLE test ( group_id int4, random_int int4, random_ts timestamptz, random_text text ); CREATE TABLE $ INSERT INTO test (group_id, random_int, random_ts, random_text) SELECT FLOOR(random() * 3), FLOOR(random() * 1000000), now() - '2 years'::INTERVAL * random(), md5(random()::text) FROM generate_series(1,20); INSERT 0 20
Data is pretty simple:
$ SELECT * FROM test ORDER BY group_id; group_id | random_int | random_ts | random_text ----------+------------+-------------------------------+---------------------------------- 0 | 493798 | 2015-12-19 17:41:12.110158+01 | 4b9ac4c4ad6c2a806c0db801e9a03373 0 | 424242 | 2015-10-07 21:35:20.318158+02 | 58d3a010b6fcd2200dd078b445132281 0 | 202836 | 2015-06-11 05:07:16.526158+02 | 6e99221b0cce929ba0ca41bc8dfaaf27 0 | 635961 | 2016-12-05 06:18:10.257358+01 | f18d9774a1b74f4422116ec1ceec8b6a 0 | 722122 | 2015-09-13 20:07:48.580558+02 | 17192f91bf1e94f2e8af7c9bd2c15d07 0 | 484963 | 2015-07-03 13:17:44.366158+02 | d17fcf48649fe6be0bf639c888bd9c5b 0 | 510350 | 2016-09-21 13:37:46.708558+02 | 7560516244c289dc84bcb30f04ed35a8 0 | 895509 | 2016-08-08 07:07:43.713358+02 | f56fbaeb6119963a4bdb6758608d2b8f 1 | 58672 | 2017-02-02 08:53:27.978958+01 | f58ddb16c746175c57ed7859274ba120 1 | 780823 | 2015-05-04 07:26:56.807758+02 | 2e6bc5ce1c1d2965f00be96373638e59 1 | 485881 | 2015-11-26 03:23:53.092558+01 | 2a09584e0035379e795cfa70f6615eec 1 | 609786 | 2015-08-01 23:00:41.505358+02 | 16d0845557d4d246cd431b74356422f6 1 | 682945 | 2015-05-02 10:07:26.260558+02 | 776c602308bb93fad0d9753b125d8727 1 | 82612 | 2016-11-08 18:29:11.822158+01 | ab90846f963d2d24980f9ce9d8a658bb 1 | 668048 | 2015-06-25 23:31:54.398158+02 | bf6bafca69ffe00dcc5f701889bcdc97 2 | 53543 | 2016-10-16 15:31:52.180558+02 | 6771fa4d76e866ea7f0cd062e5608f65 2 | 276975 | 2015-07-02 22:18:54.263758+02 | 5582f058c2f800a5c74ff6621afea48e 2 | 619981 | 2016-02-07 09:22:35.678158+01 | d3177fd39ce9566bc568dab3607622cd 2 | 292590 | 2016-02-01 13:36:33.959758+01 | cf4a398869f666acc0399c37c68e4c17 2 | 60572 | 2015-07-23 05:26:34.199758+02 | 019a33e73a3de44198ed8e9c55b38092 (20 ROWS)
So, let's see if we can get “first" of anything:
$ SELECT group_id, FIRST(random_int) AS first_int, FIRST(random_ts) AS first_ts, FIRST(random_text) AS first_text FROM test GROUP BY group_id ORDER BY group_id; group_id | first_int | first_ts | first_text ----------+-----------+-------------------------------+---------------------------------- 0 | 895509 | 2016-08-08 07:07:43.713358+02 | f56fbaeb6119963a4bdb6758608d2b8f 1 | 58672 | 2017-02-02 08:53:27.978958+01 | f58ddb16c746175c57ed7859274ba120 2 | 619981 | 2016-02-07 09:22:35.678158+01 | d3177fd39ce9566bc568dab3607622cd (3 ROWS)
Because of PostgreSQL power, we can also do something smarter:
$ SELECT group_id, FIRST(random_int ORDER BY random_int) AS first_int, FIRST(random_ts ORDER BY random_int) AS first_ts, FIRST(random_text ORDER BY random_int) AS first_text FROM test GROUP BY group_id ORDER BY group_id; group_id | first_int | first_ts | first_text ----------+-----------+-------------------------------+---------------------------------- 0 | 202836 | 2015-06-11 05:07:16.526158+02 | 6e99221b0cce929ba0ca41bc8dfaaf27 1 | 58672 | 2017-02-02 08:53:27.978958+01 | f58ddb16c746175c57ed7859274ba120 2 | 53543 | 2016-10-16 15:31:52.180558+02 | 6771fa4d76e866ea7f0cd062e5608f65 (3 ROWS)
In this case, I got values for first row, in a group, if it was sorted by random_int.
Now, that we have this in place, let's write last():
$ CREATE OR REPLACE FUNCTION agg_last( IN p_state anyelement, IN p_new_element anyelement ) RETURNS anyelement LANGUAGE SQL AS $$ SELECT COALESCE( p_new_element, p_state ); $$; CREATE FUNCTION $ CREATE AGGREGATE LAST(anyelement) ( sfunc = agg_last, stype = anyelement ); CREATE AGGREGATE
Let's try it:
$ SELECT group_id, LAST(random_int ORDER BY random_int) AS last_int, LAST(random_ts ORDER BY random_int) AS last_ts, LAST(random_text ORDER BY random_int) AS last_text FROM test GROUP BY group_id ORDER BY group_id; group_id | last_int | last_ts | last_text ----------+----------+-------------------------------+---------------------------------- 0 | 895509 | 2016-08-08 07:07:43.713358+02 | f56fbaeb6119963a4bdb6758608d2b8f 1 | 780823 | 2015-05-04 07:26:56.807758+02 | 2e6bc5ce1c1d2965f00be96373638e59 2 | 619981 | 2016-02-07 09:22:35.678158+01 | d3177fd39ce9566bc568dab3607622cd (3 ROWS)
Cool. Looks like working. Now, how about getting couple of first/last elements? This will be slightly more complicated:
$ CREATE OR REPLACE FUNCTION agg_first( IN p_state anyarray, IN p_new_element anyelement, IN p_limit int4 ) RETURNS anyarray LANGUAGE SQL AS $$ SELECT CASE WHEN COALESCE( array_length( p_state, 1 ), 0 ) < p_limit THEN p_state || p_new_element ELSE p_state END; $$; CREATE FUNCTION $ CREATE AGGREGATE FIRST(anyelement, int4) ( sfunc = agg_first, stype = anyarray, initcond = '{}' ); CREATE AGGREGATE
And quick test:
$ SELECT group_id, FIRST(random_int, 2 ORDER BY random_int) AS first_ints, FIRST(random_ts, 2 ORDER BY random_ts) AS first_tses, FIRST(random_text, 2 ORDER BY random_text) AS first_texts FROM test GROUP BY group_id ORDER BY group_id; -[ RECORD 1 ]-------------------------------------------------------------------- group_id | 0 first_ints | {202836,424242} first_tses | {"2015-06-11 05:07:16.526158+02","2015-07-03 13:17:44.366158+02"} first_texts | {17192f91bf1e94f2e8af7c9bd2c15d07,4b9ac4c4ad6c2a806c0db801e9a03373} -[ RECORD 2 ]-------------------------------------------------------------------- group_id | 1 first_ints | {58672,82612} first_tses | {"2015-05-02 10:07:26.260558+02","2015-05-04 07:26:56.807758+02"} first_texts | {16d0845557d4d246cd431b74356422f6,2a09584e0035379e795cfa70f6615eec} -[ RECORD 3 ]-------------------------------------------------------------------- group_id | 2 first_ints | {53543,60572} first_tses | {"2015-07-02 22:18:54.263758+02","2015-07-23 05:26:34.199758+02"} first_texts | {019a33e73a3de44198ed8e9c55b38092,5582f058c2f800a5c74ff6621afea48e}
last function will be slightly more complex
$ CREATE OR REPLACE FUNCTION agg_last( IN p_state anyarray, IN p_new_element anyelement, IN p_limit int4 ) RETURNS anyarray LANGUAGE SQL AS $$ SELECT ( CASE WHEN array_length( p_state, 1 ) >= p_limit THEN p_state[(array_upper(p_state, 1) - p_limit + 2):] ELSE p_state END ) || p_new_element; $$; CREATE FUNCTION $ CREATE AGGREGATE LAST(anyelement, int4) ( sfunc = agg_last, stype = anyarray, initcond = '{}' ); CREATE AGGREGATE
So, let's see if it really works:
$ SELECT group_id, LAST(random_int, 2 ORDER BY random_int) AS last_ints, LAST(random_ts, 2 ORDER BY random_ts) AS last_tses, LAST(random_text, 2 ORDER BY random_text) AS last_texts FROM test GROUP BY group_id ORDER BY group_id; -[ RECORD 1 ]------------------------------------------------------------------- group_id | 0 last_ints | {722122,895509} last_tses | {"2016-09-21 13:37:46.708558+02","2016-12-05 06:18:10.257358+01"} last_texts | {f18d9774a1b74f4422116ec1ceec8b6a,f56fbaeb6119963a4bdb6758608d2b8f} -[ RECORD 2 ]------------------------------------------------------------------- group_id | 1 last_ints | {682945,780823} last_tses | {"2016-11-08 18:29:11.822158+01","2017-02-02 08:53:27.978958+01"} last_texts | {bf6bafca69ffe00dcc5f701889bcdc97,f58ddb16c746175c57ed7859274ba120} -[ RECORD 3 ]------------------------------------------------------------------- group_id | 2 last_ints | {292590,619981} last_tses | {"2016-02-07 09:22:35.678158+01","2016-10-16 15:31:52.180558+02"} last_texts | {cf4a398869f666acc0399c37c68e4c17,d3177fd39ce9566bc568dab3607622cd}
Comparing it with larger dump above can be complicated, so let's generate the same results using window functions, and compare:
$ WITH numbered_rows AS ( SELECT group_id, ROW_NUMBER() OVER (partition BY group_id ORDER BY random_int DESC) AS row_no, random_int, random_ts, random_text FROM test ) SELECT group_id, array_agg(random_int ORDER BY random_int) AS first_ints, array_agg(random_ts ORDER BY random_ts) AS first_tses, array_agg(random_text ORDER BY random_text) AS first_texts FROM numbered_rows WHERE row_no < 3 GROUP BY group_id ORDER BY group_id; -[ RECORD 1 ]-------------------------------------------------------------------- group_id | 0 first_ints | {722122,895509} first_tses | {"2015-09-13 20:07:48.580558+02","2016-08-08 07:07:43.713358+02"} first_texts | {17192f91bf1e94f2e8af7c9bd2c15d07,f56fbaeb6119963a4bdb6758608d2b8f} -[ RECORD 2 ]-------------------------------------------------------------------- group_id | 1 first_ints | {682945,780823} first_tses | {"2015-05-02 10:07:26.260558+02","2015-05-04 07:26:56.807758+02"} first_texts | {2e6bc5ce1c1d2965f00be96373638e59,776c602308bb93fad0d9753b125d8727} -[ RECORD 3 ]-------------------------------------------------------------------- group_id | 2 first_ints | {292590,619981} first_tses | {"2016-02-01 13:36:33.959758+01","2016-02-07 09:22:35.678158+01"} first_texts | {cf4a398869f666acc0399c37c68e4c17,d3177fd39ce9566bc568dab3607622cd}
Data looks the same to me. And query with first/last aggregates seem to be much simpler to write, read, and understand.
Just like with short_ids lately, there is no point in copy/pasting the code from blog – you can download it as usable extension. Use either GitHub or PGXN.
For simple cases, one can use ‘DISTINCT ON’ (beware, this is error prone). I’ve been using row_number() approach before, now there is first_value() function which helps a lot (trying to avoid last_value() as it’s not straightforward to use).
The aggregate functions provided by https://github.com/bashtanov/argm extension can do the same faster, as they allows hash aggregation, and are written in C.