I wrote about similar things couple of times, but recently found thread on pgsql-general mailing list that made me thing about it again.
Summary of the problem from mail is: we have a table, ~ 800 million rows, with, at least 2 columns:
- station – 170 distinct values
- channel – generally 1-3 channels per station
And then we want to run:
SELECT station, array_agg(DISTINCT (channel)) AS channels FROM DATA GROUP BY station;
Which, on Israel's (original poster) machine took ~ 5 minutes.
And this is with index on on data (station, channel).
Can we do better?
To have a way to test it I made myself a table:
=$ CREATE TABLE DATA ( ts timestamptz, station int4, channel int4, VALUE float8 );
and loaded 800 million rows that have following stats:
- ts – semi-randomly generated timestamp, from 2019-03-15 12:10:51+01 to 2021-09-26 15:29:36+02. Timestamps are always rounded up to a second, and there are ~ 10 records per second.
- station – value from 1 to 200
- channel – generally value from 1 to 150. There is a fixed list of channels per station – from 50 to 145, randomly.
- value – random float value with up to 2 decimals
Sample:
=$ SELECT * FROM DATA LIMIT 10; ts │ station │ channel │ VALUE ────────────────────────┼─────────┼─────────┼─────────── 2019-03-15 12:10:52+01 │ 38 │ 26 │ 894387.96 2019-03-15 12:10:53+01 │ 42 │ 58 │ 337180.77 2019-03-15 12:10:53+01 │ 49 │ 16 │ 939004.82 2019-03-15 12:10:52+01 │ 108 │ 14 │ 343292.31 2019-03-15 12:10:53+01 │ 108 │ 36 │ 852999.79 2019-03-15 12:10:52+01 │ 138 │ 56 │ 375714.57 2019-03-15 12:10:52+01 │ 30 │ 29 │ 195615.94 2019-03-15 12:10:51+01 │ 97 │ 61 │ 945488.73 2019-03-15 12:10:53+01 │ 135 │ 33 │ 263851.91 2019-03-15 12:10:51+01 │ 36 │ 10 │ 626327.65 (10 ROWS)
After loading data, I made index, just like the original mail had:
=$ CREATE INDEX data_station_channel_idx ON DATA (station, channel);
After vacuum I ran the query two times getting the same time: 136 seconds.
Now, this is, obviously bad. The fastest approach would be to precache this data using triggers. Since my table is static I can simply:
=$ CREATE TABLE distinct_station_channel AS SELECT station,array_agg(DISTINCT(channel)) AS channels FROM DATA GROUP BY station; =$ ALTER TABLE distinct_station_channel ADD PRIMARY KEY (station);
Afterwards getting the data would be, of course, virtually instantaneous: less than 1ms.
With such table (well, with one more column) we could then use triggers on data column to keep it updated. I showed such triggers previously
But, let's assume that this is no-go. What else can we do?
We could do something called skip scan, or loose scan or loose index scan. To do it will need to use recursive queries.
First, we start with getting first station/channel combo. Sorted alphabetically:
=$ SELECT station, channel FROM DATA ORDER BY station, channel LIMIT 1; station | channel ---------+--------- 1 | 1 (1 ROW)
This query is very fast: ~ 1ms. Now, that we have first row, we can find next one. Since first pair is (1,1), then next one will have to be > (1,1):
=$ SELECT station, channel FROM DATA WHERE (station, channel) > (1, 1) ORDER BY station, channel LIMIT 1; station | channel ---------+--------- 1 | 5 (1 ROW)
Again, runtime is ~ 1ms.
Now, we can build recursive query that will give us all (station, channel) combinations:
=$ WITH recursive dis AS (
( SELECT station, channel FROM DATA ORDER BY station, channel LIMIT 1 )
UNION ALL
(
SELECT d.station, d.channel
FROM
dis p,
lateral (
SELECT x.station, x.channel
FROM DATA x
WHERE (x.station, x.channel) > (p.station, p.channel)
ORDER BY x.station, x.channel LIMIT 1
) d
)
)
SELECT station, array_agg(channel) FROM dis GROUP BY station;
This query returns data in ~ 157ms. Obviously longer than getting the data from precalculated table, but not too bad in comparison with original 2 minutes.
Query explanation:
- subselect in line 2 is getting simply first combo of station, channel
- subselect in lines 9-12 simply gets next combo using values from previously calculated row
- final line, 16, simply groups channels into single array per station
Simple, and works without any data changes.
Are there any issues with it? Well, for starters, it will only work well if there are very few distinct values within data set. So, if you have 800 million rows, but 200 million distinct values – it will not help.
Also – you can't really use it with additional conditions based on range/inequality. So, I couldn't use it for checking what were distinct (station, channel) combinations within specific time frame. For this I'd need to use some other approach – for example triggers that group data in precomputed data for some time ranges – for example, daily, or hourly.
But, for the problem that was reported on mailing list – this solution will work great.
It seems bizarre that the query planner fails to come up with something similar to this approach, that the solution would be to essentially figure out how to construct an query which at the SQL level expresses a particular low-level index scanning behaviour. I am forced to think of https://blog.nelhage.com/post/some-opinionated-sql-takes/
How does the following perform?
@Aristotle:
I nop longer havce the test table, and I’m unwilling to generate 800 million rows again.
But it will do index only scan to get data in order, and then remove almost all rows, or seq scan + sort, and then remove almost all rows.
It will *NOT* do skip scan.
The obvious comment on this is that the data model is “less than optimal”, and there should be two tables: ‘sc_pairs’ and ‘sc_values’ with foreign key the primary key of ‘sc_pairs’, and the result sought is simply the ‘sc_pairs’ table. The index is effectively equivalent to the ‘sc_pairs’ table, retrofitted into the data model.
The recursive query is a way to effectively query the index without querying the underlying relation. I suspect that another way could be found by grouping by ‘station,channel’ and ‘EXISTS’, or ‘COUNT(*) > 0’; the query is in effect “list all station, channel pairs which there is at least one recorded value”.
But the paying customer is always right :-).
«you can’t really use it with additional conditions based on range/inequality»
The main additional condition is that the tables or at least the index must be entirely resident on low-latency media to get 157ms, because with “station – 170 distinct values” and “channel – generally 1-3 channels per station” there are at likely 300-500 combinations to return randomly positioned among 800 million records, or if we are lucky all the station-channel combination occur in the first few thousand record, as most likely in the generated test data.