Cezio wrote post about removing elements from arrays in PostgreSQL.
Unfortunately his blog engine requires registration before comment, which I don't like, so I decided to comment using my own blogspace.
The approach Cezio showed is pretty cool – having more functions to handle arrays is definitely cool, but it might be better to use sql functions instead of pl/pgsql.
I did a quick test with 2 functions. Pl/PgSQL was taken from Cezio post – I just changed the name, and I wrote my own version in SQL. Sources:
CREATE OR REPLACE FUNCTION array_remove_plpgsql(arr int8[], other_arr int8[]) RETURNS int8[] AS $$ DECLARE out_arr int8[]; el_idx INT; BEGIN IF arr IS NULL OR other_arr IS NULL THEN RETURN arr; END IF; FOR el_idx IN array_lower(arr, 1)..array_upper(arr, 1) loop IF NOT arr[el_idx] =any(other_arr) THEN out_arr = array_append(out_arr, arr[el_idx]); END IF; END loop; RETURN out_arr; END; $$ LANGUAGE plpgsql immutable;
CREATE OR REPLACE FUNCTION array_remove_sql(int8[], int8[]) RETURNS int8[] AS $$ SELECT array(SELECT $1[i] FROM generate_series(array_lower($1,1), array_upper($1,1)) i WHERE $1[i] <> ALL($2)); $$ LANGUAGE SQL immutable;
Then I ran it many times. 100000 times to be exact:
EXPLAIN analyze SELECT array_remove_plpgsql(array[1,2,3,4,5,6,7,8,9,10]::int8[],array[9,7,5,3,1]::int8[]) FROM generate_series(1,100000); EXPLAIN analyze SELECT array_remove_sql(array[1,2,3,4,5,6,7,8,9,10]::int8[],array[9,7,5,3,1]::int8[]) FROM generate_series(1,100000);
Results:
PL/PgSQL version:
# EXPLAIN analyze SELECT array_remove_plpgsql(array[1,2,3,4,5,6,7,8,9,10]::int8[],array[9,7,5,3,1]::int8[]) FROM generate_series(1,100000); QUERY PLAN -------------------------------------------------------------------------------------------------------------------------- FUNCTION Scan ON generate_series (cost=0.00..12.50 ROWS=1000 width=0) (actual TIME=36.095..176.331 ROWS=100000 loops=1) Total runtime: 308.723 ms (2 ROWS) # EXPLAIN analyze SELECT array_remove_plpgsql(array[1,2,3,4,5,6,7,8,9,10]::int8[],array[9,7,5,3,1]::int8[]) FROM generate_series(1,100000); QUERY PLAN -------------------------------------------------------------------------------------------------------------------------- FUNCTION Scan ON generate_series (cost=0.00..12.50 ROWS=1000 width=0) (actual TIME=35.962..173.675 ROWS=100000 loops=1) Total runtime: 304.880 ms (2 ROWS) # EXPLAIN analyze SELECT array_remove_plpgsql(array[1,2,3,4,5,6,7,8,9,10]::int8[],array[9,7,5,3,1]::int8[]) FROM generate_series(1,100000); QUERY PLAN -------------------------------------------------------------------------------------------------------------------------- FUNCTION Scan ON generate_series (cost=0.00..12.50 ROWS=1000 width=0) (actual TIME=36.279..175.507 ROWS=100000 loops=1) Total runtime: 308.333 ms (2 ROWS)
Pretty stable result, around 305ms.
SQL version:
# EXPLAIN analyze SELECT array_remove_sql(array[1,2,3,4,5,6,7,8,9,10]::int8[],array[9,7,5,3,1]::int8[]) FROM generate_series(1,100000); QUERY PLAN -------------------------------------------------------------------------------------------------------------------------- FUNCTION Scan ON generate_series (cost=0.00..12.50 ROWS=1000 width=0) (actual TIME=13.459..149.143 ROWS=100000 loops=1) Total runtime: 277.871 ms (2 ROWS) # EXPLAIN analyze SELECT array_remove_sql(array[1,2,3,4,5,6,7,8,9,10]::int8[],array[9,7,5,3,1]::int8[]) FROM generate_series(1,100000); QUERY PLAN -------------------------------------------------------------------------------------------------------------------------- FUNCTION Scan ON generate_series (cost=0.00..12.50 ROWS=1000 width=0) (actual TIME=13.429..144.672 ROWS=100000 loops=1) Total runtime: 270.746 ms (2 ROWS) # EXPLAIN analyze SELECT array_remove_sql(array[1,2,3,4,5,6,7,8,9,10]::int8[],array[9,7,5,3,1]::int8[]) FROM generate_series(1,100000); QUERY PLAN -------------------------------------------------------------------------------------------------------------------------- FUNCTION Scan ON generate_series (cost=0.00..12.50 ROWS=1000 width=0) (actual TIME=13.511..148.526 ROWS=100000 loops=1) Total runtime: 274.809 ms (2 ROWS)
Not bad. PlPgSQL is only 10% slower, while being more readable. Is the trade-off worth it – you have to decide for yourself.
Additionally, one factual correction. Cezio said “Jeśli używacie PostgreSQL'a w wersji >=8.3, to w contribie jest pakiet intarray" which means “If you use PostgreSQL 8.3 or newer, there is contrib module intarray".
Actually – intarray has been in contrib since at least 7.3. But Cezio is 100% right – it's definitely worth checking, as it is simply awesome for handling int4[].
Difference between plpgsql and sql will be bigger with larger arrays (array update is expensive operation). for n=10000 is plpgsql very bad choice.
On my machine with Postgres 8.2 the plpgsql version of your example runs 25% faster.
Great Post! Been trying to figure out how to do this for the last 3 days!
I did some testing and tweaking.
The Test on postgres 8.3:
306.370ms – plpgsql (Winner) AMD Quad-Core??
1849.881ms – sql
The Tweek:
I re-wrote this to allow for any data type.
CREATE OR REPLACE FUNCTION array_difference(anyarray, anyarray)
RETURNS anyarray AS
$BODY$
declare
out_arr $1%TYPE;
el_idx int;
begin
if $1 is null or $2 is null then
return $1;
end if;
for el_idx in array_lower($1, 1)..array_upper($1, 1) loop
if not $1[el_idx] =any($2) then
out_arr = array_append(out_arr, $1[el_idx]);
end if;
end loop;
return out_arr;
end;
$BODY$
LANGUAGE ‘plpgsql’ IMMUTABLE
COST 100;
ALTER FUNCTION array_difference(anyarray, anyarray) OWNER TO postgres;