I missed it completely, but on 24th of March 2017, Alvaro Herrera committed patch:
Implement multivariate n-distinct coefficients Add support for explicitly declared statistic objects (CREATE STATISTICS), allowing collection of statistics on more complex combinations that individual table columns. Companion commands DROP STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are added too. All this DDL has been designed so that more statistic types can be added later on, such as multivariate most-common-values and multivariate histograms between columns of a single table, leaving room for permitting columns on multiple tables, too, as well as expressions. This commit only adds support for collection of n-distinct coefficient on user-specified sets of columns in a single table. This is useful to estimate number of distinct groups in GROUP BY and DISTINCT clauses; estimation errors there can cause over-allocation of memory in hashed aggregates, for instance, so it's a worthwhile problem to solve. A new special pseudo-type pg_ndistinct is used. (num-distinct estimation was deemed sufficiently useful by itself that this is worthwhile even if no further statistic types are added immediately; so much so that another version of essentially the same functionality was submitted by Kyotaro Horiguchi: https://postgr.es/m/.173334..horiguchi.kyotaro@lab.ntt.co.jp though this commit does not use that code.) Author: Tomas Vondra. Some code rework by Álvaro. Ideriha Takeshi Discussion: https://postgr.es/m/.4080608@fuzzy.cz https://postgr.es/m/.ixlaueanxegqd5gr@alvherre.pgsql
Afterwards, there were couple more commits related to it:
- On 5th of April 2017, patch committed by Simon Riggs
- On 17th of April 2017, patch committed by Alvaro Herrera
- On 12nd of May 2017, patch committed by Alvaro Herrera
So what this is about? Phrase from first line, “Implement multivariate n-distinct coefficients", seems pretty opaque.
In reality it's rather simple. As you know – PostgreSQL chooses what to do with query (or, rather, how to get results) based on some estimates on count of rows that will match certain parts of the query.
So far PostgreSQL used only statistics on single column, and if it had to get estimate based on conditions to two columns, it involved some black magic (well, math, but it's pretty similar to magic).
You could have seen the problem with query like this:
$ CREATE TABLE test AS SELECT i, i % 100 AS i_100, i % 500 AS i_500 FROM generate_series(1,100000) i; SELECT 100000
data looks like:
$ SELECT * FROM test ORDER BY random() LIMIT 20; i | i_100 | i_500 -------+-------+------- 21153 | 53 | 153 59239 | 39 | 239 64049 | 49 | 49 33313 | 13 | 313 25366 | 66 | 366 31794 | 94 | 294 39991 | 91 | 491 49162 | 62 | 162 71951 | 51 | 451 82639 | 39 | 139 61799 | 99 | 299 5174 | 74 | 174 13963 | 63 | 463 31412 | 12 | 412 4685 | 85 | 185 35444 | 44 | 444 45713 | 13 | 213 13991 | 91 | 491 15509 | 9 | 9 99022 | 22 | 22 (20 ROWS)
Values seem pretty obvious. And Pg can guess estimated counts pretty well, in simple cases:
$ EXPLAIN analyze SELECT * FROM test WHERE i_100 = 34; QUERY PLAN --------------------------------------------------------------------------------------------------------- Seq Scan ON test (cost=0.00..1791.00 ROWS=1010 width=12) (actual TIME=0.035..23.716 ROWS=1000 loops=1) FILTER: (i_100 = 34) ROWS Removed BY FILTER: 99000 Planning TIME: 0.422 ms Execution TIME: 23.879 ms (5 ROWS) $ EXPLAIN analyze SELECT * FROM test WHERE i_500 = 78; QUERY PLAN ------------------------------------------------------------------------------------------------------- Seq Scan ON test (cost=0.00..1791.00 ROWS=200 width=12) (actual TIME=0.022..11.706 ROWS=200 loops=1) FILTER: (i_500 = 78) ROWS Removed BY FILTER: 99800 Planning TIME: 0.057 ms Execution TIME: 11.737 ms (5 ROWS)
Please note that “rows=" and then the other “rows=" (in actual part of explain) are very similar.
But what will happen if I'll try to use two columns at once?
$ EXPLAIN analyze SELECT * FROM test WHERE i_100 = 91 AND i_500 = 491; QUERY PLAN ----------------------------------------------------------------------------------------------------- Seq Scan ON test (cost=0.00..2041.00 ROWS=2 width=12) (actual TIME=0.172..24.154 ROWS=200 loops=1) FILTER: ((i_100 = 91) AND (i_500 = 491)) ROWS Removed BY FILTER: 99800 Planning TIME: 0.453 ms Execution TIME: 24.233 ms (5 ROWS)
Count is really off it thinks there are two rows, but there are 200 of them.
It can be also thrown off by providing values that clearly can't happen together:
$ EXPLAIN analyze SELECT * FROM test WHERE i_100 = 91 AND i_500 = 17; QUERY PLAN ---------------------------------------------------------------------------------------------------- Seq Scan ON test (cost=0.00..2041.00 ROWS=2 width=12) (actual TIME=24.990..24.990 ROWS=0 loops=1) FILTER: ((i_100 = 91) AND (i_500 = 17)) ROWS Removed BY FILTER: 100000 Planning TIME: 0.458 ms Execution TIME: 25.045 ms (5 ROWS)
The difference (2 vs. 0) might not seem like much, but it would be so much better if we could get it to proper estimate.
Well, now, in Pg 10, there will be a chance:
$ CREATE STATISTICS test_stats ON i_100, i_500 FROM test; $ analyze test;
And now the “bad" queries:
$ EXPLAIN analyze SELECT * FROM test WHERE i_100 = 91 AND i_500 = 491; QUERY PLAN ------------------------------------------------------------------------------------------------------- Seq Scan ON test (cost=0.00..2041.00 ROWS=199 width=12) (actual TIME=0.141..25.531 ROWS=200 loops=1) FILTER: ((i_100 = 91) AND (i_500 = 491)) ROWS Removed BY FILTER: 99800 Planning TIME: 0.545 ms Execution TIME: 25.617 ms (5 ROWS) $ EXPLAIN analyze SELECT * FROM test WHERE i_100 = 91 AND i_500 = 17; QUERY PLAN ------------------------------------------------------------------------------------------------------ Seq Scan ON test (cost=0.00..2041.00 ROWS=199 width=12) (actual TIME=11.803..11.803 ROWS=0 loops=1) FILTER: ((i_100 = 91) AND (i_500 = 17)) ROWS Removed BY FILTER: 100000 Planning TIME: 0.081 ms Execution TIME: 11.825 ms (5 ROWS)
First query shows clearly improved estimate. Second – not so much. Why? To be honest, no idea. If someone can enlighten me, it would be really great. In any way – these are just first steps into getting proper multi-column statistics, so I'm very happy about it, and enthusiastic about future – even if some bit of the functionality now doesn't seem to work as expected (by me). It's a long road, but I'm very grateful that we started it. Thanks a lot to all involved.
The multivariate feature in PG 10 only records a single integer indicating the correlation of the columns — it doesn’t record the probability of specific column combinations. That features is hopefully coming in PG 11.
Your second-to-last query hit a common combination (both ending in 1) while your last query has no matching rows (the last digit is different).