one of the nice (for some things) features of oracle is rownum() function.
i assume you know what it does – if you don't – i think a quick google will show it.
today i will show how to make rownum in postgresql. if you're not interested in rownum itself – please continue reading anyway, as some functions shown here have other uses as well 🙂
ok, to add rownum, one usually is told to create temp sequence, and add “nextval()" to your query. this is how it works:
first, our test dataset:
# \d test TABLE "public.test" COLUMN | TYPE | Modifiers -----------+------+----------- some_word | text | NOT NULL Indexes: "test_pkey" PRIMARY KEY, btree (some_word) # SELECT * FROM test; some_word ----------- a AND at been feast great have languages OF scraps stolen the they (13 ROWS)
so, let's add temp sequence and test the approach:
# CREATE temp SEQUENCE q; CREATE SEQUENCE # SELECT NEXTVAL('q'), some_word FROM test; NEXTVAL | some_word ---------+----------- 1 | a 2 | AND 3 | at 4 | been 5 | feast 6 | great 7 | have 8 | languages 9 | OF 10 | scraps 11 | stolen 12 | the 13 | they (13 ROWS)
nice. it works great. does it? let's try again, but with some non-obvious order by:
# SELECT NEXTVAL('q'), some_word FROM test ORDER BY LENGTH(some_word) DESC; NEXTVAL | some_word ---------+----------- 21 | languages 23 | scraps 24 | stolen 19 | great 18 | feast 26 | they 17 | been 20 | have 15 | AND 25 | the 22 | OF 16 | at 14 | a (13 ROWS)
oops. now we have a problem. or actually – 2 problems:
- sequence has to be reset after each statement! otherwise it will continue the count. which is not really useful
- order of the nextval is somehow random
order can be easily fixed:
# SELECT NEXTVAL('q'), * FROM ( SELECT some_word FROM test ORDER BY LENGTH(some_word) DESC ) x; NEXTVAL | some_word ---------+----------- 27 | languages 28 | scraps 29 | stolen 30 | great 31 | feast 32 | they 33 | been 34 | have 35 | AND 36 | the 37 | OF 38 | at 39 | a (13 ROWS)
but resetting counter after each usage seems to be … not nice.
any better options?
basically we could write some plperl function that would show us the counter and store current_value in %_SHARED, but it would also not help us with the issue of resetting the counter.
the best option would be if we could somehow get “trigger before select". which we can't.
but, there is an option to “know" when we start new statement – we should just check transaction (and statement) id numbers.
so, let's do it.
<24 hours later>
hmm, there is nothing that really works like statement id!.
but, as tom lane suggested i can use statement timestamp. it is perfectly fine for me for 2 reasons:
- i will use it “per-backend", so concurrent queries on another backends do not worry me
- it is pretty precise – up to 0.000001 second. so – as long as you're not running 1 milion of selects per second – you should be fine 🙂
so, there goes some .c code:
#include "postgres.h" #include "access/xact.h" #include "fmgr.h" #include "utils/timestamp.h" PG_MODULE_MAGIC; Datum get_statement_timestamp(PG_FUNCTION_ARGS); PG_FUNCTION_INFO_V1(get_statement_timestamp); Datum get_statement_timestamp(PG_FUNCTION_ARGS) { PG_RETURN_TIMESTAMPTZ( GetCurrentStatementStartTimestamp() ); }
i compiled it to .so, and saved as /home/pgdba/statementid.so.
then i can create sql-accessible function for it:
CREATE FUNCTION get_statement_timestamp() RETURNS timestamptz AS '/home/pgdba/statementid.so', 'get_statement_timestamp' LANGUAGE C RETURNS NULL ON NULL INPUT VOLATILE;
now, that i have a way to tell whether previous call was for different statement i can finally write my rownum() code:
CREATE OR REPLACE FUNCTION rownum() RETURNS INT4 LANGUAGE plpgsql AS $BODY$ DECLARE current_id TEXT; current_rownum INT4; settings_id TEXT; BEGIN current_id := get_statement_timestamp()::TEXT; settings_id := current_setting('depesz.rownum_name'); IF settings_id IS DISTINCT FROM current_id THEN PERFORM set_config('depesz.rownum_name', current_id::TEXT, FALSE); current_rownum := 0; ELSE current_rownum := current_setting('depesz.rownum_count')::INT4; END IF; current_rownum := current_rownum + 1; PERFORM set_config('depesz.rownum_count', current_rownum::TEXT, FALSE); RETURN current_rownum; END; $BODY$;
as you can see it's pretty simple. one important thing is that i use custom variable class – you have to define it in your postgresql.conf. as for how – check my previous posting.
so, let's check how it works:
# SELECT rownum(), * FROM (SELECT some_word FROM test ORDER BY LENGTH(some_word) DESC ) x; rownum | some_word --------+----------- 1 | languages 2 | scraps 3 | stolen 4 | great 5 | feast 6 | they 7 | been 8 | have 9 | AND 10 | the 11 | OF 12 | at 13 | a (13 ROWS)
second call within the same session? works nicely as well (starts numbering from 1, not from 14).
hmmm .. but what if i'd like to have multiple rownums per statement? for example – i would like the rownum to restart for each length of some_word?
let's do something like this:
CREATE OR REPLACE FUNCTION rownum() RETURNS INT4 LANGUAGE plpgsql AS $BODY$ DECLARE BEGIN RETURN rownum(''); END; $BODY$;
this function works instead of previous rownum(), but it calls rownum(TEXT). which looks like:
CREATE OR REPLACE FUNCTION rownum(in_code TEXT) RETURNS INT4 LANGUAGE plpgsql AS $BODY$ DECLARE current_id TEXT; current_rownum INT4; settings_id TEXT; BEGIN current_id := get_statement_timestamp()::TEXT || in_code; settings_id := current_setting('depesz.rownum_name'); IF settings_id IS DISTINCT FROM current_id THEN PERFORM set_config('depesz.rownum_name', current_id::TEXT, FALSE); current_rownum := 0; ELSE current_rownum := current_setting('depesz.rownum_count')::INT4; END IF; current_rownum := current_rownum + 1; PERFORM set_config('depesz.rownum_count', current_rownum::TEXT, FALSE); RETURN current_rownum; END; $BODY$;
so, what good it is for?
first, let's check standard call:
# SELECT rownum(), * FROM (SELECT some_word FROM test ORDER BY LENGTH(some_word) DESC ) x; rownum | some_word --------+----------- 1 | languages 2 | scraps 3 | stolen 4 | great 5 | feast 6 | they 7 | been 8 | have 9 | AND 10 | the 11 | OF 12 | at 13 | a (13 ROWS)
ok, so it works. and now, let's change invocation, by explicitly calling rownum(TEXT):
# SELECT rownum(LENGTH(x.some_word)::text), * FROM (SELECT some_word FROM test ORDER BY LENGTH(some_word) DESC ) x; rownum | some_word --------+----------- 1 | languages 1 | scraps 2 | stolen 1 | great 2 | feast 1 | they 2 | been 3 | have 1 | AND 2 | the 1 | OF 2 | at 1 | a (13 ROWS)
now rownum restarts whenever i change rownum(TEXT) argument – length of some_word.
as i mentioned in the beginning – this trick has more uses.
i once wrote an article about cumulative sum. then i wrote another one with jan wiecks method.
all methods described there required some extra queries to make the output (preparation or sum clearer). but now, since we know how to differentiate function calls between statements, we can do it better.
let's use again the table that was described there:
# SELECT * FROM test; id | user_id | VALUE ----+---------+------- 1 | 2222 | 120 2 | 2222 | 1566 3 | 1111 | 10 4 | 2222 | 123 5 | 3333 | 1567 6 | 2222 | 1234 7 | 1111 | 4 8 | 1111 | 1 9 | 1111 | 3 10 | 1111 | 5 11 | 1111 | 2 12 | 3333 | 1588 (12 ROWS)
now, let's write cumulative_sum function using statement_timestamp:
CREATE OR REPLACE FUNCTION cumulative_sum(in_code TEXT, VALUE INT4) RETURNS INT4 LANGUAGE plpgsql AS $BODY$ DECLARE current_id TEXT; current_sum INT4; settings_id TEXT; BEGIN current_id := get_statement_timestamp()::TEXT || in_code; settings_id := current_setting('depesz.cumulative_sum_name'); IF settings_id IS DISTINCT FROM current_id THEN PERFORM set_config('depesz.cumulative_sum_name', current_id::TEXT, FALSE); current_sum := 0; ELSE current_sum := current_setting('depesz.cumulative_sum_count')::INT4; END IF; current_sum := current_sum + COALESCE(VALUE, 0); PERFORM set_config('depesz.cumulative_sum_count', current_sum::TEXT, FALSE); RETURN current_sum; END; $BODY$;
and let's test if it works:
# SELECT x.*, cumulative_sum(user_id::text, VALUE) FROM (SELECT * FROM test ORDER BY user_id ) x; id | user_id | VALUE | cumulative_sum ----+---------+-------+---------------- 7 | 1111 | 4 | 4 8 | 1111 | 1 | 5 9 | 1111 | 3 | 8 10 | 1111 | 5 | 13 11 | 1111 | 2 | 15 3 | 1111 | 10 | 25 1 | 2222 | 120 | 120 2 | 2222 | 1566 | 1686 4 | 2222 | 123 | 1809 6 | 2222 | 1234 | 3043 5 | 3333 | 1567 | 1567 12 | 3333 | 1588 | 3155 (12 ROWS)
great 🙂 i think that there are literally hundreds of other uses, but finding and describing them all would be too much 🙂 do some experiments yourself. have fun.
Maybe we should have sth like pg_statement_timestamp() in next postgres release?
There’s statement_timestamp() in 8.2.
@alec:
yes, i just learned 🙂
by the way – thanks pjmodos for hint.
It looks like the beginning of
the implementation of OLAP extensions such as CUBE/ROLLUP
for GROUP BY?
That would be great:-)
@Fabien Coelho:
i’m not familia with olap. can you direct me to wherever i can read about it? let’s start with the extension that you find the simplest.
These are extensions to keep intermediate sums when grouping stuff. It is not what you are doing here, but some issues are similar as one may(?) have to keep additionnal counters alive to store partial results. See for instance:
http://www.research.ibm.com/journal/sj/414/colossi.html
Or Oracle doc:
http://download.oracle.com/docs/cd/B12037_01/server.101/b10736/aggreg.htm
I do not think that it is simple to implement in Pg, as it may have to impact the actual aggregation implementation and possibly the query plan.
Nice read.
Having worked with Oracle’s rownum I have to say I hate it. The fact that the rownum is effectively random unless you use a subquery is just plain painful. You have to use it if you want to do the Oracle equivelant of the LIMIT/OFFSET query of PostgreSQL.
There have been a few posts recently that have reminded me of the work I did using arrays to do related things (in my case I wanted a moving set of the last X rows for stock marketet moving averages and the like)
There is a lot that can be done just using pgPL and SQL functions – though of course C is going to be fast enough.
My articles on moving data sets using arrays is here:
http://frakkle.com/archives/archive_2006-m04.php
Very cool ! Satisfied by results and in postgres 8.2.* I don’t need C part as it has statement_timestamp() by default !
I´ve failed to compile in 8.1.4… 🙁
There is no GetCurrentStatementStartTimestamp()
Now i see… its 8.2 related.
http://www.postgresql.org/docs/8.2/static/release-8-2.html
You can also use self JOIN to portably do this (row number and running total).
How could I do a select on the result?
What I’m trying to do is to create a report of registrations for a particular month.
It shows the number of registrations per day and the cumulated registrations up to that day (since ever).
So I use your function to compute the running total, so I get
Date ———— Registrations ——– Cumulated
2008-04-28 —– 5 ——————– 5
2008-04-29 —– 1 ——————– 6
2008-04-30 —– 2 ——————– 8
2008-05-01 —– 6 ——————– 14
2008-05-02 —– 3 ——————– 17
And then I want to select on the values from May, so I put a select around (wrapping) my previous query and want to get
Date ———— Registrations ——– Cumulated
2008-05-01 —– 6 ——————– 14
2008-05-02 —– 3 ——————– 17
Instead I get
Date ———— Registrations ——– Cumulated
2008-05-01 —– 6 ——————– 6
2008-05-02 —– 3 ——————– 9
Do you see any way of solving this?
Thanks,
Peter
@Peter:
can you show me both queries and their explain analyze output?
hi depesz,
i had a similar problem and read your both articles, this one and the one you posted some time before. in your first article about cumulative sums you used
a perl function, but with the problem of no auto reset the counter for the next query. here you provided another way to get cumulative sum using get_statement_timestamp. hm, the plperl-function is 2 times faster then this written in plpgsql, so i want to use the faster one ;-).
i have found no way in plperl to access statement_timestamp() so i used the following workaround giving the current statements timestamp as param:
select x.*, cumulative_sum(user_id::text, value, statement_timestamp()) from (select * from test order by user_id) x;
CREATE OR REPLACE FUNCTION cumulative_sum(in_code text, in_value bigint, in_ts timestamp with time zone)
RETURNS bigint AS
$BODY$
my ($code, $value, $ts) = @_;
if ($code ne $_SHARED{‘cumulative_sum_code’} or $ts ne $_SHARED{‘ts’}) {
$_SHARED{‘cumulative_sum_value’} = 0;
$_SHARED{‘cumulative_sum_code’} = $code;
$_SHARED{‘ts’} = $ts;
}
$_SHARED{‘cumulative_sum_value’} += $value;
return $_SHARED{‘cumulative_sum_value’};
$BODY$
LANGUAGE ‘plperl’ VOLATILE
COST 100;
thank you for sharing your expiriences! i want to give something back 🙂
jan
Hi Depesz,
My apologies for not getting back on the previous question.
I was doing something like
select *
from
(select mydate, cumulative_sum(‘x’, registrations) from mytable) subQ
where mydate > now() – interval ’10 days’
and the optimizer kept optimizing it to
select mydate, cumulative_sum(‘x’, registrations) from mytable where mydate > now() – interval ’10 days’
which ofcourse gives different values for the cumulative sum.
I had posted the same on the postgres list and almost immediately got the response that I could tell the optimizer not to touch the inner query by putting an ‘offset 0’ at the end of the subquery.
Now I recently upgraded my DB to 8.3 (mainly to be able to use FTS) and just found out the cumulative_sum() function doesn’t work any more because of the changed way current_setting() behaves (as you’ve debated already on one of the lists).
A fix for this is simple and easy.
Add the following line at the end of postgresql.conf (after custom_class_variables).
depesz.cumulative_sum_name = ‘initial_value’
That takes care of it.
Regards,
Peter
I added custom_variable_classes = ‘depesz’ and restarted the DB,
but when I run I get:
ERROR: unrecognized configuration parameter “depesz.cumulative_sum_name”
CONTEXT: PL/pgSQL function “cumulative_sum” line 7 at assignment
********** Error **********
ERROR: unrecognized configuration parameter “depesz.cumulative_sum_name”
SQL state: 42704
Context: PL/pgSQL function “cumulative_sum” line 7 at assignment
@Pat Mills:
Did you read the comments? Check the one directly above yours.
Sorry, got it.
Thanks
Hi,
Can we use this as a transaction id?
SELECT l.virtualxid
FROM pg_lock_status() l(locktype text, database oid, relation oid, page integer, tuple smallint, virtualxid text, transactionid xid, classid oid, objid oid, objsubid smallint, virtualtransaction text, pid integer, mode text, granted boolean);
May be it is not actual because of 8.4 has window-functions
@AVN:
Not sure about virtualxid, but I guess it would be slow – locks checking is not really super fast.
And from 8.4 on we have window functions, as you said, so there is no point in making your own hacks.
Unfortunately, 8.3 -> 8.4 migration in some cases is not a simple process, for example because of complex third-party patches for PostreSQL sources.