i found lately quite interesting issue. how to calculate cumulative_sum across some dataset.
if you will search for cumulative sum, you will find some answers. most of them revolve around using subselect or join.
so, how to calculate it?
first, let's assume a very simplistic table:
# CREATE TABLE test (id serial PRIMARY KEY, user_id int4, VALUE int4);
nothing fancy.
now, let's put some test data to the table:
INSERT INTO test(user_id, VALUE) VALUES ( 2222, 120 ); INSERT INTO test(user_id, VALUE) VALUES ( 2222, 1566 ); INSERT INTO test(user_id, VALUE) VALUES ( 1111, 10 ); INSERT INTO test(user_id, VALUE) VALUES ( 2222, 123 ); INSERT INTO test(user_id, VALUE) VALUES ( 3333, 1567 ); INSERT INTO test(user_id, VALUE) VALUES ( 2222, 1234 ); INSERT INTO test(user_id, VALUE) VALUES ( 1111, 4 ); INSERT INTO test(user_id, VALUE) VALUES ( 1111, 1 ); INSERT INTO test(user_id, VALUE) VALUES ( 1111, 3 ); INSERT INTO test(user_id, VALUE) VALUES ( 1111, 5 ); INSERT INTO test(user_id, VALUE) VALUES ( 1111, 2 ); INSERT INTO test(user_id, VALUE) VALUES ( 3333, 1588 );
this data is purposely not ordered.
now, let's say that i would like to get
# SELECT * FROM test ORDER BY user_id, id;
which means – records will be “grouped" (not in terms of sql “GROUP BY",
but in terms of rows order) by user_id, and rows for one particular user
will be sorted using id, which can be understood – as oldest first.
let's check how it will look:
# SELECT * FROM test ORDER BY user_id, id; id | user_id | VALUE ----+---------+------- 3 | 1111 | 10 7 | 1111 | 4 8 | 1111 | 1 9 | 1111 | 3 10 | 1111 | 5 11 | 1111 | 2 1 | 2222 | 120 2 | 2222 | 1566 4 | 2222 | 123 6 | 2222 | 1234 5 | 3333 | 1567 12 | 3333 | 1588 (12 ROWS)
ok, so how do we add cumulative sum?
basic approach is like this:
SELECT t1.*, SUM(t2.value) FROM test t1 JOIN test t2 ON t2.user_id = t1.user_id AND t2.id <= t1.id GROUP BY t1.user_id, t1.id, t1.value ORDER BY user_id, id;
and it actually works quite nicely:
id | user_id | value | sum ----+---------+-------+------ 3 | 1111 | 10 | 10 7 | 1111 | 4 | 14 8 | 1111 | 1 | 15 9 | 1111 | 3 | 18 10 | 1111 | 5 | 23 11 | 1111 | 2 | 25 1 | 2222 | 120 | 120 2 | 2222 | 1566 | 1686 4 | 2222 | 123 | 1809 6 | 2222 | 1234 | 3043 5 | 3333 | 1567 | 1567 12 | 3333 | 1588 | 3155 (12 rows)
but, is it the best approach?
let's find out. first, i'll write a simplistic function in pl/perl:
CREATE OR REPLACE FUNCTION cumulative_sum(TEXT, INT8) RETURNS INT8 LANGUAGE plperl AS $BODY$ my ($code, $value) = @_; $_SHARED{'cumulative_sum_' . $code} += $value; RETURN $_SHARED{'cumulative_sum_' . $code}; $BODY$;
code seems to be quite simple to understand, so i will not digg into
details on how it works – if you dont understand the code, hmm .. i
think any kind of perl tutorial will be handy.
so, now, let's check if it will work:
# SELECT id, user_id, VALUE, cumulative_sum(user_id::text, VALUE) FROM test ORDER BY user_id; id | user_id | VALUE | cumulative_sum ----+---------+-------+---------------- 7 | 1111 | 4 | 14 8 | 1111 | 1 | 15 9 | 1111 | 3 | 18 10 | 1111 | 5 | 23 11 | 1111 | 2 | 25 3 | 1111 | 10 | 10 1 | 2222 | 120 | 120 2 | 2222 | 1566 | 1686 4 | 2222 | 123 | 1809 6 | 2222 | 1234 | 3043 5 | 3333 | 1567 | 1567 12 | 3333 | 1588 | 3155 (12 ROWS)
hmm… it shows some numbers. but they are not correct. why? it's
obvious. data flow was: first calculate cumulative_sum, and then sort.
so row order was mangled. any way to fix it?
sure:
SELECT *, cumulative_sum(user_id::text, VALUE) FROM (SELECT * FROM test ORDER BY user_id ) x;
results are strange – returned numbers are too high. why? of course – it
didn't remove previous counters.
let's just disconnect, reconnect and re-run the query:
id | user_id | value | cumulative_sum ----+---------+-------+---------------- 7 | 1111 | 4 | 4 8 | 1111 | 1 | 5 9 | 1111 | 3 | 8 10 | 1111 | 5 | 13 11 | 1111 | 2 | 15 3 | 1111 | 10 | 25 1 | 2222 | 120 | 120 2 | 2222 | 1566 | 1686 4 | 2222 | 123 | 1809 6 | 2222 | 1234 | 3043 5 | 3333 | 1567 | 1567 12 | 3333 | 1588 | 3155 (12 rows)
now, everything's fine, but there are 2 problems.
1. if i will have 100000 users in table, it will allocate 100000 counters.
2. it is next-to-impossible to cleanly remove all allocated counters to
be able to run second cumulative_sum() query in the same connection.
can we do anything about it? sure. new version of the function:
CREATE OR REPLACE FUNCTION cumulative_sum(TEXT, INT8) RETURNS INT8 LANGUAGE plperl AS $BODY$ my ($code, $value) = @_; IF ($code ne $_SHARED{'cumulative_sum_code'}) { $_SHARED{'cumulative_sum_value'} = 0; $_SHARED{'cumulative_sum_code'} = $code; } $_SHARED{'cumulative_sum_value'} += $value; RETURN $_SHARED{'cumulative_sum_value'}; $BODY$;
quick test if it works ok:
SELECT *, cumulative_sum(user_id::text, VALUE) FROM (SELECT * FROM test ORDER BY user_id, id ) x;
shows that everything's fine.
and how about keeping counters between calls?
2 separate calls to our query:
# SELECT *, cumulative_sum(user_id::text, VALUE) FROM (SELECT * FROM test ORDER BY user_id, id ) x;
show that it looks ok, but in case we would like to output data for only
one user – it will get damaged. like here:
# SELECT *, cumulative_sum(user_id::text, VALUE) FROM (SELECT * FROM test WHERE user_id = 1111 ORDER BY user_id, id ) x; id | user_id | VALUE | cumulative_sum ----+---------+-------+---------------- 3 | 1111 | 10 | 10 7 | 1111 | 4 | 14 8 | 1111 | 1 | 15 9 | 1111 | 3 | 18 10 | 1111 | 5 | 23 11 | 1111 | 2 | 25 (6 ROWS)
# SELECT *, cumulative_sum(user_id::text, VALUE) FROM (SELECT * FROM test WHERE user_id = 1111 ORDER BY user_id, id ) x; id | user_id | VALUE | cumulative_sum ----+---------+-------+---------------- 3 | 1111 | 10 | 35 7 | 1111 | 4 | 39 8 | 1111 | 1 | 40 9 | 1111 | 3 | 43 10 | 1111 | 5 | 48 11 | 1111 | 2 | 50 (6 ROWS)
is there no help to fix it?
yes there. is. unfortunatelly i dont know how to make it automatical, but if you'll call this query:
# SELECT cumulative_sum('a', 0) + cumulative_sum('b', 0) + cumulative_sum('a', 0);
it will clear counters (if you dont understand why – take a look at “if" in function code.
what's good – this version uses memory only for one counter and one textual code.
and what about speed?
let's test it:
DROP TABLE test; CREATE TABLE test AS SELECT i AS id, CAST(random() * 9 AS INT4)+1 AS user_id, CAST(random() * 1000 AS INT4) AS VALUE FROM generate_series(1,10000) i; ALTER TABLE test ADD PRIMARY KEY (id); CREATE UNIQUE INDEX q ON test (user_id, id);
now – we have 10000 rows, with 10 different “users".
let's check standard (join) approach:
EXPLAIN ANALYZE SELECT t1.*, SUM(t2.value) FROM test t1 JOIN test t2 ON t2.user_id = t1.user_id AND t2.id <= t1.id GROUP BY t1.user_id, t1.id, t1.value ORDER BY user_id, id; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=11388.06..11413.06 ROWS=10000 width=16) (actual TIME=43634.695..43652.221 ROWS=10000 loops=1) Sort KEY: t1.user_id, t1.id Sort Method: quicksort Memory: 647kB -> HashAggregate (cost=10598.67..10723.67 ROWS=10000 width=16) (actual TIME=43581.713..43604.311 ROWS=10000 loops=1) -> Nested Loop (cost=0.00..8932.00 ROWS=166667 width=16) (actual TIME=0.080..30977.919 ROWS=5257451 loops=1) -> Seq Scan ON test t1 (cost=0.00..150.00 ROWS=10000 width=12) (actual TIME=0.013..19.832 ROWS=10000 loops=1) -> INDEX Scan USING q ON test t2 (cost=0.00..0.62 ROWS=17 width=12) (actual TIME=0.021..1.116 ROWS=526 loops=10000) INDEX Cond: ((t2.user_id = t1.user_id) AND (t2.id <= t1.id)) Total runtime: 43669.583 ms (9 ROWS)
not bad for a 10000 rows. will a second call change anything? or a third? not really. total runtimes were 43960.617 and 42754.347.
ok, so how about the function approach?
EXPLAIN ANALYZE SELECT *, cumulative_sum(user_id::text, VALUE) FROM (SELECT * FROM test ORDER BY user_id, id ) x; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------- Subquery Scan x (cost=0.00..3144.08 ROWS=10000 width=12) (actual TIME=0.161..287.213 ROWS=10000 loops=1) -> INDEX Scan USING q ON test (cost=0.00..469.08 ROWS=10000 width=12) (actual TIME=0.042..26.680 ROWS=10000 loops=1) Total runtime: 308.241 ms (3 ROWS)
next calls gave me times 317.467 and 316.621.
pretty nice. down from 43 seconds to 0.3 second.
that article really helped me a lot.
i just have a complexity step more:
i do not have to cumulate just one field but more than one… it as having value1 value2 value3… value9 (or more)… in your opinion do i have to create a query for each field and than put all in a final one or is there a smarter way to do it?
hello sir, i want to use below code under function ::
SELECT t1.*, sum(t2.value)
FROM test t1 JOIN test t2 ON t2.user_id = t1.user_id AND t2.id <= t1.id
GROUP BY t1.user_id, t1.id, t1.value
ORDER BY user_id, id;
–help me