On 9th of November 2017, Robert Haas committed patch:
Add hash partitioning. Hash partitioning is useful when you want to partition a growing data set evenly. This can be useful to keep table sizes reasonable, which makes maintenance operations such as VACUUM faster, or to enable partition-wise join. At present, we still depend on constraint exclusion for partitioning pruning, and the shape of the partition constraints for hash partitioning is such that that doesn't work. Work is underway to fix that, which should both improve performance and make partitioning pruning work with hash partitioning. Amul Sul, reviewed and tested by Dilip Kumar, Ashutosh Bapat, Yugo Nagata, Rajkumar Raghuwanshi, Jesper Pedersen, and by me. A few final tweaks also by me. Discussion: http://postgr.es/m/CAAJ_b96fhpJAP=ALbETmeLk1Uni_GFZD938zgenhF49qgDTjaQ@mail.gmail.com
Some time ago, I wrote that Pg 10 will get native partitioning.
Back then we were limited to partitions by range and by list. These are the most commonly used, and they might even be the most useful. But there is also one more type of partition – by hash.
This can be used to divide rows (more or less) equally into multiple partitions. So, for example, we can have 1 billion rows in 100 partitions, and each will contain 10 million rows (more or less).
Let's see how this can be achieved now, in Pg 11. For my test let's assume we have users table with this base structure:
$ CREATE TABLE users ( username text PRIMARY KEY, password text, created_on timestamptz NOT NULL DEFAULT now(), id_admin bool NOT NULL DEFAULT FALSE );
Making it partitioned by hash (of username) could be done like this:
$ CREATE TABLE users ( username text NOT NULL, password text, created_on timestamptz NOT NULL DEFAULT now(), id_admin bool NOT NULL DEFAULT FALSE ) PARTITION BY HASH (username);
Partitions are then defined using 2 numbers:
- MODULUS
- REMAINDER
In the most simple case – MODULUS is number of partitions, and REMAINDER is number, 0 or more, but less than MODULUS.
For example, assuming we'd want to partition into eight partitions, I could:
$ CREATE TABLE users_p0 PARTITION OF users ( PRIMARY KEY (username) ) FOR VALUES WITH (MODULUS 8, REMAINDER 0); $ CREATE TABLE users_p1 PARTITION OF users ( PRIMARY KEY (username) ) FOR VALUES WITH (MODULUS 8, REMAINDER 1); $ CREATE TABLE users_p2 PARTITION OF users ( PRIMARY KEY (username) ) FOR VALUES WITH (MODULUS 8, REMAINDER 2); $ CREATE TABLE users_p3 PARTITION OF users ( PRIMARY KEY (username) ) FOR VALUES WITH (MODULUS 8, REMAINDER 3); $ CREATE TABLE users_p4 PARTITION OF users ( PRIMARY KEY (username) ) FOR VALUES WITH (MODULUS 8, REMAINDER 4); $ CREATE TABLE users_p5 PARTITION OF users ( PRIMARY KEY (username) ) FOR VALUES WITH (MODULUS 8, REMAINDER 5); $ CREATE TABLE users_p6 PARTITION OF users ( PRIMARY KEY (username) ) FOR VALUES WITH (MODULUS 8, REMAINDER 6); $ CREATE TABLE users_p7 PARTITION OF users ( PRIMARY KEY (username) ) FOR VALUES WITH (MODULUS 8, REMAINDER 7);
With this in place, I now have tables that look like this:
$ \d+ users TABLE "public.users" COLUMN | TYPE | Collation | NULLABLE | DEFAULT | Storage | Stats target | Description ------------+--------------------------+-----------+----------+---------+----------+--------------+------------- username | text | | NOT NULL | | extended | | password | text | | | | extended | | created_on | TIMESTAMP WITH TIME zone | | NOT NULL | now() | plain | | id_admin | BOOLEAN | | NOT NULL | FALSE | plain | | Partition KEY: HASH (username) Partitions: users_p0 FOR VALUES WITH (modulus 8, remainder 0), users_p1 FOR VALUES WITH (modulus 8, remainder 1), users_p2 FOR VALUES WITH (modulus 8, remainder 2), users_p3 FOR VALUES WITH (modulus 8, remainder 3), users_p4 FOR VALUES WITH (modulus 8, remainder 4), users_p5 FOR VALUES WITH (modulus 8, remainder 5), users_p6 FOR VALUES WITH (modulus 8, remainder 6), users_p7 FOR VALUES WITH (modulus 8, remainder 7) $ \d+ users_p1 TABLE "public.users_p1" COLUMN | TYPE | Collation | NULLABLE | DEFAULT | Storage | Stats target | Description ------------+--------------------------+-----------+----------+---------+----------+--------------+------------- username | text | | NOT NULL | | extended | | password | text | | | | extended | | created_on | TIMESTAMP WITH TIME zone | | NOT NULL | now() | plain | | id_admin | BOOLEAN | | NOT NULL | FALSE | plain | | Partition OF: users FOR VALUES WITH (modulus 8, remainder 1) Partition CONSTRAINT: satisfies_hash_partition('1161847'::oid, 8, 1, username) Indexes: "users_p1_pkey" PRIMARY KEY, btree (username)
So, let's insert some random users, and see what happens:
$ \copy users (username) FROM stdin; proffers babbles cents choose chalked redoubts pitting coddling relieves wooing codgers sinewy separate ferry crusty cursing hawkers deducted gaseous voyagers \.
All data is visible in users, of course:
$ SELECT * FROM users; username | password | created_on | id_admin ----------+----------+------------------------------+---------- coddling | | 2017-11-10 14:15:55.71057+01 | f voyagers | | 2017-11-10 14:15:55.71057+01 | f relieves | | 2017-11-10 14:15:55.71057+01 | f codgers | | 2017-11-10 14:15:55.71057+01 | f separate | | 2017-11-10 14:15:55.71057+01 | f hawkers | | 2017-11-10 14:15:55.71057+01 | f deducted | | 2017-11-10 14:15:55.71057+01 | f crusty | | 2017-11-10 14:15:55.71057+01 | f chalked | | 2017-11-10 14:15:55.71057+01 | f redoubts | | 2017-11-10 14:15:55.71057+01 | f sinewy | | 2017-11-10 14:15:55.71057+01 | f pitting | | 2017-11-10 14:15:55.71057+01 | f wooing | | 2017-11-10 14:15:55.71057+01 | f cents | | 2017-11-10 14:15:55.71057+01 | f cursing | | 2017-11-10 14:15:55.71057+01 | f gaseous | | 2017-11-10 14:15:55.71057+01 | f proffers | | 2017-11-10 14:15:55.71057+01 | f choose | | 2017-11-10 14:15:55.71057+01 | f ferry | | 2017-11-10 14:15:55.71057+01 | f babbles | | 2017-11-10 14:15:55.71057+01 | f (20 ROWS)
But we can also see that it is being distributed among partitions:
$ SELECT tableoid::regclass AS partition_name, COUNT(*) FROM users GROUP BY 1 ORDER BY 1; partition_name | COUNT ----------------+------- users_p0 | 2 users_p1 | 5 users_p2 | 1 users_p3 | 3 users_p4 | 2 users_p5 | 3 users_p6 | 3 users_p7 | 1 (8 ROWS)
Of course, with only 20 rows distribution is not perfect, but it works sanely.
Unfortunately, as of now, PostgreSQL can't prune partitions from execution, based on their values:
$ EXPLAIN analyze SELECT * FROM users WHERE username = 'redoubts'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------- Append (cost=0.15..65.34 ROWS=8 width=73) (actual TIME=0.030..0.049 ROWS=1 loops=1) -> INDEX Scan USING users_p0_pkey ON users_p0 (cost=0.15..8.17 ROWS=1 width=73) (actual TIME=0.010..0.010 ROWS=0 loops=1) INDEX Cond: (username = 'redoubts'::text) -> INDEX Scan USING users_p1_pkey ON users_p1 (cost=0.15..8.17 ROWS=1 width=73) (actual TIME=0.004..0.004 ROWS=0 loops=1) INDEX Cond: (username = 'redoubts'::text) -> INDEX Scan USING users_p2_pkey ON users_p2 (cost=0.15..8.17 ROWS=1 width=73) (actual TIME=0.007..0.007 ROWS=0 loops=1) INDEX Cond: (username = 'redoubts'::text) -> INDEX Scan USING users_p3_pkey ON users_p3 (cost=0.15..8.17 ROWS=1 width=73) (actual TIME=0.007..0.008 ROWS=1 loops=1) INDEX Cond: (username = 'redoubts'::text) -> INDEX Scan USING users_p4_pkey ON users_p4 (cost=0.15..8.17 ROWS=1 width=73) (actual TIME=0.006..0.006 ROWS=0 loops=1) INDEX Cond: (username = 'redoubts'::text) -> INDEX Scan USING users_p5_pkey ON users_p5 (cost=0.15..8.17 ROWS=1 width=73) (actual TIME=0.004..0.004 ROWS=0 loops=1) INDEX Cond: (username = 'redoubts'::text) -> INDEX Scan USING users_p6_pkey ON users_p6 (cost=0.15..8.17 ROWS=1 width=73) (actual TIME=0.005..0.005 ROWS=0 loops=1) INDEX Cond: (username = 'redoubts'::text) -> INDEX Scan USING users_p7_pkey ON users_p7 (cost=0.15..8.17 ROWS=1 width=73) (actual TIME=0.003..0.003 ROWS=0 loops=1) INDEX Cond: (username = 'redoubts'::text) Planning TIME: 0.999 ms Execution TIME: 0.115 ms (19 ROWS)
but, from what I gather – there is work underway to make it work.
All things considered – it's great addition, one that will be definitely helpful in some cases, and it will just become better. Thanks a lot to all involved.
I don’t understand why the syntax allows each partition to specify its own modulus. I can’t see why you would ever want to mix different moduli, and in doing so surely you would end up with tuples that could exist in multiple different partitions?
Skimming through the mailing list it certainly looks like allowing different moduli is intentional, but I didn’t see any explanation for how that would work (or why it would be useful)