mage_ from #postgresql had interesting problem today.
he has a table with 2 date fields, and he wants to have list of all years from both fields. together. as one list.
his approach:
SELECT date_part('year', date1) FROM test UNION SELECT date_part('year', date2) FROM test;
is hardly satisfactory – it takes too long.
any way to speed it up?
ok, i asked some questions, got some data.
here's what we're starting with:
- total number of rows: 56990
- date1 is in range : 1937-01-01 to 2006-08-31
- date2 is in range : 1937-12-31 to 1996-12-31
- data will grow in time, but there will be never rows with date prior to 1800-01-01, and max of the dates will be similar to now()
- there are indexes on date1 and date2 columns (two separete, standard, btree indexes)
ok. let's make a test table:
CREATE TABLE test (id serial PRIMARY KEY, date1 DATE, date2 DATE);
now let's create some data:
INSERT INTO test (date1, date2) SELECT random() * 40000 * '1 day'::INTERVAL + '1900-01-01'::DATE, random() * 40000 * '1 day'::INTERVAL + '1900-01-01'::DATE FROM generate_series(1,60000);
and the base indexes:
# CREATE INDEX d1 ON test (date1); # CREATE INDEX d2 ON test (date2);
ok. now we have 60000 rows, with min/max values:
# SELECT MIN(date1), MAX(date1), MIN(date2), MAX(date2) FROM test; MIN | MAX | MIN | MAX ------------+------------+------------+------------ 1900-01-02 | 2009-07-06 | 1900-01-01 | 2009-07-06 (1 ROW)
it's not perfect copy of mage's table, but it's close enough.
now. let's try his approach just to get some idea on timing on my machine:
# EXPLAIN analyze SELECT date_part('year', date1) FROM test UNION SELECT date_part('year', date2) FROM test; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------- UNIQUE (cost=15152.10..15752.10 ROWS=120000 width=4) (actual TIME=1235.563..1731.876 ROWS=110 loops=1) -> Sort (cost=15152.10..15452.10 ROWS=120000 width=4) (actual TIME=1235.556..1505.904 ROWS=120000 loops=1) Sort KEY: (date_part('year'::text, (public.test.date1)::TIMESTAMP WITHOUT TIME zone)) Sort Method: external MERGE Disk: 2336kB -> Append (cost=0.00..3590.00 ROWS=120000 width=4) (actual TIME=0.040..760.877 ROWS=120000 loops=1) -> Seq Scan ON test (cost=0.00..1195.00 ROWS=60000 width=4) (actual TIME=0.036..171.935 ROWS=60000 loops=1) -> Seq Scan ON test (cost=0.00..1195.00 ROWS=60000 width=4) (actual TIME=0.018..150.961 ROWS=60000 loops=1) Total runtime: 1763.287 ms (8 ROWS)
ok, so we have all that we need.
first – let's check if postgresql will be smart enough not to do sort of 120000 rows, but use index on fields, and do merge-sort:
# CREATE INDEX q1 ON test (date_part('year', date1)); # CREATE INDEX q2 ON test (date_part('year', date2));
and retry the query.
result – no change.
hmm .. let's try different approach.
i know the range of possible years – it is 1800 to now(). since now() can change, let's assume, that now() == 2100. just to have safe 90 years of using the query.
for this – i will use standard indexes, so the extra ones can be removed:
# DROP INDEX q1; # DROP INDEX q2;
so, now for the query. we will use our magic-tool: generate_series:
SELECT i FROM generate_series(1800, 2100) i WHERE EXISTS ( SELECT * FROM test t WHERE t.date1 BETWEEN CAST(i::text||'-01-01' AS DATE) AND CAST(i::text||'-12-31' AS DATE) ) OR EXISTS ( SELECT * FROM test t WHERE t.date2 BETWEEN CAST(i::text||'-01-01' AS DATE) AND CAST(i::text||'-12-31' AS DATE) );
if it's not obvious, algorithm is quite simple:
- for every year in (1800 .. 2100) check if there is any row that has date1 between ‘YEAR-01-01' and ‘YEAR-12-31'.
- if there is such a row – return this year, and go to next year
- if thers is no such a row – check again but this time using date2 field.
- if there is such a row – return year
that's pretty much it.
how does it work?
explain analyze showed this:
QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------- Function Scan on generate_series i (cost=0.00..5437.40 rows=750 width=4) (actual time=4.384..76.243 rows=110 loops=1) Filter: ((subplan) OR (subplan)) SubPlan -> Index Scan using d2 on test t (cost=0.03..806.27 rows=300 width=12) (actual time=0.008..0.008 rows=0 loops=191) Index Cond: ((date2 >= ((($0)::text || '-01-01'::text))::date) AND (date2 <= ((($0)::text || '-12-31'::text))::date)) -> Index Scan using d1 on test t (cost=0.03..806.25 rows=300 width=12) (actual time=0.227..0.227 rows=0 loops=301) Index Cond: ((date1 >= ((($0)::text || '-01-01'::text))::date) AND (date1 <= ((($0)::text || '-12-31'::text))::date)) Total runtime: 76.642 ms (8 rows)
pretty neat 🙂
the problem with it is that it doesn't look good. (1800, 2100) ? and what if i'd want to add something from 1799? or 2200 ? i would have to modify the query.
also – in our current case – we do a lot of empty runs – as i know that there is no row from before 1900, or after 2009.
any chance to get it done right?
first – let's find the lowest year number:
# EXPLAIN analyze SELECT least(MIN(date1), MIN(date2)) FROM test; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------ RESULT (cost=0.09..0.10 ROWS=1 width=0) (actual TIME=0.115..0.117 ROWS=1 loops=1) InitPlan -> LIMIT (cost=0.00..0.04 ROWS=1 width=8) (actual TIME=0.054..0.056 ROWS=1 loops=1) -> INDEX Scan USING d1 ON test (cost=0.00..2616.22 ROWS=60000 width=8) (actual TIME=0.049..0.049 ROWS=1 loops=1) FILTER: (date1 IS NOT NULL) -> LIMIT (cost=0.00..0.04 ROWS=1 width=8) (actual TIME=0.040..0.043 ROWS=1 loops=1) -> INDEX Scan USING d2 ON test (cost=0.00..2616.24 ROWS=60000 width=8) (actual TIME=0.035..0.035 ROWS=1 loops=1) FILTER: (date2 IS NOT NULL) Total runtime: 0.198 ms (9 ROWS)
this looks reasonably fast.
now, the last possible year – quite simple as well:
SELECT greatest(MAX(date1), MAX(date2)) FROM test;
now i can rewrite the generate_series query to make it:
- data-change-proof
- not searching in years that definitelly doesn't have any data.
EXPLAIN analyze SELECT i FROM generate_series( (SELECT EXTRACT('year' FROM least(MIN(date1), MIN(date2)))::int4 FROM test), (SELECT EXTRACT('year' FROM greatest(MAX(date1), MAX(date2)))::int4 FROM test) ) i WHERE EXISTS ( SELECT * FROM test t WHERE t.date1 BETWEEN CAST(i::text||'-01-01' AS DATE) AND CAST(i::text||'-12-31' AS DATE) ) OR EXISTS ( SELECT * FROM test t WHERE t.date2 BETWEEN CAST(i::text||'-01-01' AS DATE) AND CAST(i::text||'-12-31' AS DATE) ); QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------- FUNCTION Scan ON generate_series i (cost=0.21..5437.61 ROWS=750 width=4) (actual TIME=0.236..4.409 ROWS=110 loops=1) FILTER: ((subplan) OR (subplan)) InitPlan -> RESULT (cost=0.09..0.10 ROWS=1 width=0) (actual TIME=0.082..0.084 ROWS=1 loops=1) InitPlan -> LIMIT (cost=0.00..0.04 ROWS=1 width=8) (actual TIME=0.035..0.037 ROWS=1 loops=1) -> INDEX Scan USING d1 ON test (cost=0.00..2616.22 ROWS=60000 width=8) (actual TIME=0.031..0.031 ROWS=1 loops=1) FILTER: (date1 IS NOT NULL) -> LIMIT (cost=0.00..0.04 ROWS=1 width=8) (actual TIME=0.022..0.023 ROWS=1 loops=1) -> INDEX Scan USING d2 ON test (cost=0.00..2616.24 ROWS=60000 width=8) (actual TIME=0.018..0.018 ROWS=1 loops=1) FILTER: (date2 IS NOT NULL) -> RESULT (cost=0.09..0.10 ROWS=1 width=0) (actual TIME=0.032..0.034 ROWS=1 loops=1) InitPlan -> LIMIT (cost=0.00..0.04 ROWS=1 width=8) (actual TIME=0.010..0.011 ROWS=1 loops=1) -> INDEX Scan Backward USING d1 ON test (cost=0.00..2616.22 ROWS=60000 width=8) (actual TIME=0.006..0.006 ROWS=1 loops=1) FILTER: (date1 IS NOT NULL) -> LIMIT (cost=0.00..0.04 ROWS=1 width=8) (actual TIME=0.009..0.011 ROWS=1 loops=1) -> INDEX Scan Backward USING d2 ON test (cost=0.00..2616.24 ROWS=60000 width=8) (actual TIME=0.005..0.005 ROWS=1 loops=1) FILTER: (date2 IS NOT NULL) SubPlan -> INDEX Scan USING d2 ON test t (cost=0.03..806.27 ROWS=300 width=12) (never executed) INDEX Cond: ((date2 >= ((($0)::text || '-01-01'::text))::DATE) AND (date2 <= ((($0)::text || '-12-31'::text))::DATE)) -> INDEX Scan USING d1 ON test t (cost=0.03..806.25 ROWS=300 width=12) (actual TIME=0.030..0.030 ROWS=1 loops=110) INDEX Cond: ((date1 >= ((($0)::text || '-01-01'::text))::DATE) AND (date1 <= ((($0)::text || '-12-31'::text))::DATE)) Total runtime: 4.735 ms
so i went from 1760 to 4.7. not bad.
of course the code is ugly. and much harder to understand. but hey – you can't win them all 🙂
No jestem pod wrażeniem ;). Indeksy częściowe są w tym wypadku kiepskie po prostu. Próbowałem z zapytaniem
SELECT DISTINCT date_part(‘year’, …
które wykorzystało indeks, ale czas wykonania skrócił się tylko o połowę w stosunku do czasu pierwotnego zapytania.
Oops, should be in English. I’m impressed. Expression indexes are used in SELECT DISTINCT date_part… but it’s only 50% faster that original query. So, you’re the winner 😉