On 15th of May, Simon Riggs committed patch:
TABLESAMPLE, SQL Standard and extensible Add a TABLESAMPLE clause to SELECT statements that allows user to specify random BERNOULLI sampling or block level SYSTEM sampling. Implementation allows for extensible sampling functions to be written, using a standard API. Basic version follows SQLStandard exactly. Usable concrete use cases for the sampling API follow in later commits. Petr Jelinek Reviewed by Michael Paquier and Simon Riggs
Getting random sample of the table looks potentially interesting, but how does it work?
Let's make some random table:
CREATE TABLE test ( id serial PRIMARY KEY, some_timestamp timestamptz, some_text text ); CREATE TABLE INSERT INTO test (some_timestamp, some_text) SELECT now() - random() * '1 year'::INTERVAL, 'depesz #' || i FROM generate_series(1,100000) i; INSERT 0 100000
The table is around 6MB:
List OF relations Schema | Name | TYPE | Owner | SIZE | Description --------+------+-------+--------+---------+------------- public | test | TABLE | depesz | 5920 kB | (1 ROW)
Tablesample has two modes. SYSTEM and BERNOULLI.
Before we'll go any further, we will need to know how large is the table, in pages:
SELECT relpages FROM pg_class WHERE relname = 'test'; relpages ---------- 736 (1 ROW)
OK. So, we have 736 pages, and 100 000 rows, which means that on average, in single page we have 136 rows.
Let's say we'd like to get just 10 rows. 10 rows, out of 100000, means we want to get 0.0001 of the table, so:
EXPLAIN analyze SELECT * FROM test tablesample system ( 0.01 ); QUERY PLAN ------------------------------------------------------------------------------------------------------------- Sample Scan (system) ON test (cost=0.00..0.08 ROWS=8 width=44) (actual TIME=0.016..0.021 ROWS=136 loops=1) Planning TIME: 0.102 ms Execution TIME: 0.045 ms (3 ROWS)
That's too much – 136 rows instead of 10. Why is it so?
Well, SYSTEM TABLESAMPLE method randomly picks single page, and returns all rows from this page. This means it will be fast (pick random value 1-736, load page (8kB) return all rows from it).
But it can't return less data than single page.
of course we can use then secondary randomization:
EXPLAIN analyze WITH x AS (SELECT * FROM test tablesample system ( 0.01 )) SELECT * FROM x ORDER BY random() LIMIT 10; QUERY PLAN --------------------------------------------------------------------------------------------------------------------- LIMIT (cost=0.39..0.41 ROWS=8 width=44) (actual TIME=0.088..0.090 ROWS=10 loops=1) CTE x -> Sample Scan (system) ON test (cost=0.00..0.08 ROWS=8 width=44) (actual TIME=0.015..0.027 ROWS=136 loops=1) -> Sort (cost=0.31..0.33 ROWS=8 width=44) (actual TIME=0.087..0.087 ROWS=10 loops=1) Sort KEY: (random()) Sort Method: top-N heapsort Memory: 25kB -> CTE Scan ON x (cost=0.00..0.19 ROWS=8 width=44) (actual TIME=0.017..0.048 ROWS=136 loops=1) Planning TIME: 0.136 ms Execution TIME: 0.115 ms (9 ROWS)
usually using “order by random()" is slow, but in here, we're orderingonly 136 rows, so it's fast enough.
There is 2nd method – BERNOULLI – that can return smaller number of rows:
EXPLAIN analyze SELECT * FROM test tablesample bernoulli ( 0.01 ); QUERY PLAN ----------------------------------------------------------------------------------------------------------------- Sample Scan (bernoulli) ON test (cost=0.00..736.08 ROWS=8 width=44) (actual TIME=0.465..2.742 ROWS=10 loops=1) Planning TIME: 0.107 ms Execution TIME: 2.758 ms (3 ROWS)
Looks great – the number of rows is what I wanted (it will not always be exactly given percentage, as it's random). But notice what happens when I add more data:
INSERT INTO test (some_timestamp, some_text) SELECT now() - random() * '1 year'::INTERVAL, 'depesz #' || i FROM generate_series(100001, 1000000) i; INSERT 0 900000
The table is now roughly 10 times larger. Times:
EXPLAIN analyze SELECT * FROM test tablesample system ( 0.001 ); QUERY PLAN -------------------------------------------------------------------------------------------------------------- Sample Scan (system) ON test (cost=0.00..0.10 ROWS=10 width=25) (actual TIME=0.029..0.042 ROWS=136 loops=1) Planning TIME: 0.125 ms Execution TIME: 0.061 ms (3 ROWS) EXPLAIN analyze SELECT * FROM test tablesample bernoulli ( 0.001 ); QUERY PLAN -------------------------------------------------------------------------------------------------------------------- Sample Scan (bernoulli) ON test (cost=0.00..7353.10 ROWS=10 width=25) (actual TIME=2.779..27.288 ROWS=15 loops=1) Planning TIME: 0.112 ms Execution TIME: 27.312 ms (3 ROWS)
Time for system tablesample is more or less the same. But in case of BERNOULLI – it's 10x longer. Why? Basically BERNOULLI has to seq scan whole table, and pick rows using some math to return more or less the number of rows we want.
This means that while it is more precise, and gives exactly the same chance to each row, it is slower.
In most cases, I think, that SYSTEM sampling will be best, but it has to be understood that it works on page level.
You have to remember also, that tablesample is applied before any WHERE conditions, so this query:
EXPLAIN analyze SELECT * FROM test TABLESAMPLE SYSTEM ( 1 ) WHERE id < 10;
Will usually not return any rows – it will first pick 1% of the pages, and then filter them by id < 10 - we would have to randomly pick page number 1 (and some other) for the filter to find any matching rows. All in all - I think it is useful. Thanks.