mage_ from #postgresql had interesting problem today.
he has a table with 2 date fields, and he wants to have list of all years from both fields. together. as one list.
his approach:
select date_part('year', date1) from test union select date_part('year', date2) from test;
is hardly satisfactory – it takes too long.
any way to speed it up?
ok, i asked some questions, got some data.
here's what we're starting with:
- total number of rows: 56990
- date1 is in range : 1937-01-01 to 2006-08-31
- date2 is in range : 1937-12-31 to 1996-12-31
- data will grow in time, but there will be never rows with date prior to 1800-01-01, and max of the dates will be similar to now()
- there are indexes on date1 and date2 columns (two separete, standard, btree indexes)
ok. let's make a test table:
create table test (id serial primary key, date1 date, date2 date);
now let's create some data:
insert into test (date1, date2) select random() * 40000 * '1 day'::interval + '1900-01-01'::date, random() * 40000 * '1 day'::interval + '1900-01-01'::date from generate_series(1,60000);
and the base indexes:
# create index d1 on test (date1); # create index d2 on test (date2);
ok. now we have 60000 rows, with min/max values:
# select min(date1), max(date1), min(date2), max(date2) from test; min | max | min | max ------------+------------+------------+------------ 1900-01-02 | 2009-07-06 | 1900-01-01 | 2009-07-06 (1 row)
it's not perfect copy of mage's table, but it's close enough.
now. let's try his approach just to get some idea on timing on my machine:
# explain analyze select date_part('year', date1) from test union select date_part('year', date2) from test; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------- Unique (cost=15152.10..15752.10 rows=120000 width=4) (actual time=1235.563..1731.876 rows=110 loops=1) -> Sort (cost=15152.10..15452.10 rows=120000 width=4) (actual time=1235.556..1505.904 rows=120000 loops=1) Sort Key: (date_part('year'::text, (public.test.date1)::timestamp without time zone)) Sort Method: external merge Disk: 2336kB -> Append (cost=0.00..3590.00 rows=120000 width=4) (actual time=0.040..760.877 rows=120000 loops=1) -> Seq Scan on test (cost=0.00..1195.00 rows=60000 width=4) (actual time=0.036..171.935 rows=60000 loops=1) -> Seq Scan on test (cost=0.00..1195.00 rows=60000 width=4) (actual time=0.018..150.961 rows=60000 loops=1) Total runtime: 1763.287 ms (8 rows)
ok, so we have all that we need.
first – let's check if postgresql will be smart enough not to do sort of 120000 rows, but use index on fields, and do merge-sort:
# create index q1 on test (date_part('year', date1)); # create index q2 on test (date_part('year', date2));
and retry the query.
result – no change.
hmm .. let's try different approach.
i know the range of possible years – it is 1800 to now(). since now() can change, let's assume, that now() == 2100. just to have safe 90 years of using the query.
for this – i will use standard indexes, so the extra ones can be removed:
# drop index q1; # drop index q2;
so, now for the query. we will use our magic-tool: generate_series:
select i from generate_series(1800, 2100) i where exists ( select * from test t where t.date1 between cast(i::text||'-01-01' as date) and cast(i::text||'-12-31' as date) ) or exists ( select * from test t WHERE t.date2 between cast(i::text||'-01-01' as date) and cast(i::text||'-12-31' as date) );
if it's not obvious, algorithm is quite simple:
- for every year in (1800 .. 2100) check if there is any row that has date1 between ‘YEAR-01-01' and ‘YEAR-12-31'.
- if there is such a row – return this year, and go to next year
- if thers is no such a row – check again but this time using date2 field.
- if there is such a row – return year
that's pretty much it.
how does it work?
explain analyze showed this:
QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------- Function Scan on generate_series i (cost=0.00..5437.40 rows=750 width=4) (actual time=4.384..76.243 rows=110 loops=1) Filter: ((subplan) OR (subplan)) SubPlan -> Index Scan using d2 on test t (cost=0.03..806.27 rows=300 width=12) (actual time=0.008..0.008 rows=0 loops=191) Index Cond: ((date2 >= ((($0)::text || '-01-01'::text))::date) AND (date2 <= ((($0)::text || '-12-31'::text))::date)) -> Index Scan using d1 on test t (cost=0.03..806.25 rows=300 width=12) (actual time=0.227..0.227 rows=0 loops=301) Index Cond: ((date1 >= ((($0)::text || '-01-01'::text))::date) AND (date1 <= ((($0)::text || '-12-31'::text))::date)) Total runtime: 76.642 ms (8 rows)
pretty neat 🙂
the problem with it is that it doesn't look good. (1800, 2100) ? and what if i'd want to add something from 1799? or 2200 ? i would have to modify the query.
also – in our current case – we do a lot of empty runs – as i know that there is no row from before 1900, or after 2009.
any chance to get it done right?
first – let's find the lowest year number:
# explain analyze select least(min(date1), min(date2)) from test; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------ Result (cost=0.09..0.10 rows=1 width=0) (actual time=0.115..0.117 rows=1 loops=1) InitPlan -> Limit (cost=0.00..0.04 rows=1 width=8) (actual time=0.054..0.056 rows=1 loops=1) -> Index Scan using d1 on test (cost=0.00..2616.22 rows=60000 width=8) (actual time=0.049..0.049 rows=1 loops=1) Filter: (date1 IS NOT NULL) -> Limit (cost=0.00..0.04 rows=1 width=8) (actual time=0.040..0.043 rows=1 loops=1) -> Index Scan using d2 on test (cost=0.00..2616.24 rows=60000 width=8) (actual time=0.035..0.035 rows=1 loops=1) Filter: (date2 IS NOT NULL) Total runtime: 0.198 ms (9 rows)
this looks reasonably fast.
now, the last possible year – quite simple as well:
select greatest(max(date1), max(date2)) from test;
now i can rewrite the generate_series query to make it:
- data-change-proof
- not searching in years that definitelly doesn't have any data.
explain analyze select i from generate_series( (select extract('year' from least(min(date1), min(date2)))::int4 from test), (select extract('year' from greatest(max(date1), max(date2)))::int4 from test) ) i where exists ( select * from test t where t.date1 between cast(i::text||'-01-01' as date) and cast(i::text||'-12-31' as date) ) or exists ( select * from test t WHERE t.date2 between cast(i::text||'-01-01' as date) and cast(i::text||'-12-31' as date) ); QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------- Function Scan on generate_series i (cost=0.21..5437.61 rows=750 width=4) (actual time=0.236..4.409 rows=110 loops=1) Filter: ((subplan) OR (subplan)) InitPlan -> Result (cost=0.09..0.10 rows=1 width=0) (actual time=0.082..0.084 rows=1 loops=1) InitPlan -> Limit (cost=0.00..0.04 rows=1 width=8) (actual time=0.035..0.037 rows=1 loops=1) -> Index Scan using d1 on test (cost=0.00..2616.22 rows=60000 width=8) (actual time=0.031..0.031 rows=1 loops=1) Filter: (date1 IS NOT NULL) -> Limit (cost=0.00..0.04 rows=1 width=8) (actual time=0.022..0.023 rows=1 loops=1) -> Index Scan using d2 on test (cost=0.00..2616.24 rows=60000 width=8) (actual time=0.018..0.018 rows=1 loops=1) Filter: (date2 IS NOT NULL) -> Result (cost=0.09..0.10 rows=1 width=0) (actual time=0.032..0.034 rows=1 loops=1) InitPlan -> Limit (cost=0.00..0.04 rows=1 width=8) (actual time=0.010..0.011 rows=1 loops=1) -> Index Scan Backward using d1 on test (cost=0.00..2616.22 rows=60000 width=8) (actual time=0.006..0.006 rows=1 loops=1) Filter: (date1 IS NOT NULL) -> Limit (cost=0.00..0.04 rows=1 width=8) (actual time=0.009..0.011 rows=1 loops=1) -> Index Scan Backward using d2 on test (cost=0.00..2616.24 rows=60000 width=8) (actual time=0.005..0.005 rows=1 loops=1) Filter: (date2 IS NOT NULL) SubPlan -> Index Scan using d2 on test t (cost=0.03..806.27 rows=300 width=12) (never executed) Index Cond: ((date2 >= ((($0)::text || '-01-01'::text))::date) AND (date2 <= ((($0)::text || '-12-31'::text))::date)) -> Index Scan using d1 on test t (cost=0.03..806.25 rows=300 width=12) (actual time=0.030..0.030 rows=1 loops=110) Index Cond: ((date1 >= ((($0)::text || '-01-01'::text))::date) AND (date1 <= ((($0)::text || '-12-31'::text))::date)) Total runtime: 4.735 ms
so i went from 1760 to 4.7. not bad.
of course the code is ugly. and much harder to understand. but hey – you can't win them all 🙂
No jestem pod wrażeniem ;). Indeksy częściowe są w tym wypadku kiepskie po prostu. Próbowałem z zapytaniem
SELECT DISTINCT date_part(‘year’, …
które wykorzystało indeks, ale czas wykonania skrócił się tylko o połowę w stosunku do czasu pierwotnego zapytania.
Oops, should be in English. I’m impressed. Expression indexes are used in SELECT DISTINCT date_part… but it’s only 50% faster that original query. So, you’re the winner 😉