Today, on irc (#postgresql on freenode.net) Dim mentioned about writing median calculation code.
It got me thinking, and consequently writing my version of median calculation code.
Before I will go any further, let me just say – after I finished, and was quite happy with what I wrote, RhodiumToad showed his approach which is (of course) much better:
15:04 < RhodiumToad> SELECT avg(x) FROM (SELECT x, ROW_NUMBER() OVER (ORDER BY x),COUNT(*) OVER () FROM ...) s WHERE ROW_NUMBER BETWEEN FLOOR((COUNT::f 15:04 < RhodiumToad> loat8-1)/2+1) AND CEIL((COUNT::float8-1)/2+1);
Rewritten:
SELECT avg(x) FROM ( SELECT x, ROW_NUMBER() OVER (ORDER BY x),COUNT(*) OVER () FROM ... ) s WHERE ROW_NUMBER BETWEEN FLOOR((COUNT::float8-1)/2+1) AND CEIL((COUNT::float8-1)/2+1)
I love his approach. But, it will not make me not show you my way 🙂
Basically – I wanted to make it as an aggregate.
The idea is, so that you will be able to write query like this:
SELECT c, median(i) FROM test GROUP BY c
So, the code looks like this:
CREATE OR REPLACE FUNCTION median_aggregate_f( in_array NUMERIC[] ) RETURNS NUMERIC AS $$ DECLARE element_count INT4; get_rows INT4 := 1; reply NUMERIC; BEGIN element_count := array_upper(in_array, 1) - array_lower(in_array, 1); IF element_count IS NULL THEN RETURN NULL; END IF; get_rows := get_rows + ( element_count % 2 ); SELECT avg(e) INTO reply FROM ( SELECT unnest(in_array) AS e ORDER BY e LIMIT get_rows OFFSET FLOOR(element_count / 2) ) x; RETURN reply; END; $$ LANGUAGE plpgsql; CREATE aggregate median ( NUMERIC ) ( SFUNC = array_append, STYPE = NUMERIC[], FINALFUNC = median_aggregate_f, INITCOND = '{}' );
The way it works is pretty simple. Let's start from the end:
CREATE aggregate median ( NUMERIC ) ( SFUNC = array_append, STYPE = NUMERIC[], FINALFUNC = median_aggregate_f, INITCOND = '{}' );
This creates the actual aggregate. The aggregate is named “median" and works on numeric values.
For every value, it calls array_append function (taken from some other aggregate, but since it does what we need – why not use it?
This function (array_append) will convert all numeric values (in one group) into numeric[] – that is array of numeric values.
After the last row of group has been fed to array_append, PostgreSQL, will run median_aggregate_f function which does the necessary calculation:
CREATE OR REPLACE FUNCTION median_aggregate_f( in_array NUMERIC[] ) RETURNS NUMERIC AS $$ DECLARE element_count INT4; get_rows INT4 := 1; reply NUMERIC; BEGIN element_count := array_upper(in_array, 1) - array_lower(in_array, 1); IF element_count IS NULL THEN RETURN NULL; END IF; get_rows := get_rows + ( element_count % 2 ); SELECT avg(e) INTO reply FROM ( SELECT unnest(in_array) AS e ORDER BY e LIMIT get_rows OFFSET FLOOR(element_count / 2) ) x; RETURN reply; END; $$ LANGUAGE plpgsql;
First, the function checks for element_count in given array, and if there are no elements – it returns NULL.
Then it decided if it has to get 1 or 2 rows to calculate median (median for {1,2,3} is 2, but median for {1,2,3,4} is average of (2,3) – so we have to take 2 elements).
Afterward code just unnests given array, orders it by value, get appropriate number of rows, and calculates average out of them.
That's basically all.
One could change the code to accept ints or floats, but numeric is very versatile, so I think it's better to start with version of median() that works on it, instead of any other data type.
yes, but this all is still highly inefficient for large set of data. Because you need to store in an array huge amount of data, and lets be honest, arrays are slow, and require loads of memory.
I wish there was a way to do it, using temporary tables instead…