Some time ago I wrote a piece on speeding up dump/restore process using custom solution that was parallelizing process.
Later on I wrote some tools (“fast dump and restore") to do it in more general way.
But all of them had a problem – to get consistent dump you need to stop all concurrent access to your database. Why, and how to get rid of this limitation?
First why. It's actually pretty simple. Let's assume you have two tables:
CREATE TABLE users (id serial PRIMARY KEY, ...); CREATE TABLE sessions (id serial PRIMARY KEY, user_id int4 REFERENCES users (id), ...);
Now. Someone registers new account, and immediately logs in.
Registration (insert into users) and login (insert into sessions) happen in two separate transactions.
But, concurrently running parallel pg_dump, could have started dump of “users" before the user was inserted (so the data dump will not have it), but sessions – afterwards. So in sessions data, you will get session for user that doesn't exist in user data.
In normal pg_dump it doesn't happen, because it runs all data fetchers in single transaction (in repeatable read transaction isolation mode). But parallel dumps – dump each block of data in separate connection – so it's separate transaction.
In 2011, Tom Lane committed very interesting patch. I didn't write about it in “Waiting for …" series, mostly because I assumed that soon there will be patch that adds parallelization to pg_dump, based on the patch that Tom committed.
Apparently – we still don't have parallel pg_dump. But we have the ability to write it by ourselves.
How? That's simple. First, before anything else will happen, we start first transaction, and do it that way:
$ BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ; BEGIN *$ SELECT pg_export_snapshot(); pg_export_snapshot -------------------- 00000AAF-1 (1 ROW)
of course the value returned by pg_export_snapshot() will be different, but that's not important.
Now, knowing this snapshot number/id, we start a number of connections that will be used for our dumping work, and in each of them, we need to run:
$ BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ; BEGIN *$ SET TRANSACTION SNAPSHOT '00000AAF-1'; SET
It is important that the “SET TRANSACTION SNAPSHOT" has to be run as first command in transaction.
Thanks to this, our new, just started transaction, will have the same data visibility as older, started previously, transaction. Thus providing us with many backends, all sharing the same view of data.
The original transaction cannot be closed until all work that needs to be done using given snapshot will be finished. But – it can be used to run data dumps just as fine as any other connection. It just has to be closed last.
So the idea is that to make data dump in parallel, you'd:
- start single connection (let's call it master)
- start new transaction in master connection, and export snapshot
- start any number of connections (workers), and in each of them – start transaction, and switch to masters transaction snapshot
- run in all connections appropriate commands (COPY or SLECT basically)
- once all work is done, close worker transactions and connections
- close master transaction and connection
Generated dump will be consistent.
of course you will also need to dump schema, and find a way to load the data to some new database, but that's irrelevant for now – methods for this I described, and even wrote tools for.
All in all – while we still don't have parallel pg_dump, we now have the ability to run parallel data copy that will be “as good".
Very simplistic “pg_dump" using snapshot cloning can look like this:
#!/usr/bin/env bash # configuration export PGPORT=5920 export PGDATABASE=rt4 # Temporary directory for small helper files tmp_dir="$( mktemp -d )" trap 'rm -rf "$tmp_dir"' EXIT # Run master psql exec 50> >( exec psql -qAtX ) master_pid="$!" # Start transaction printf 'BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;\n' >&50 # Get transaction snapshot id printf '\o %s/snapshot.id\n' "$tmp_dir" >&50 printf 'SELECT pg_export_snapshot();\n' >&50 # Get list of tables to dump printf '\o %s/tables.list\n' "$tmp_dir" >&50 printf 'SELECT c.oid::regclass FROM pg_class c join pg_namespace n on c.relnamespace = n.oid WHERE c.relkind = $$r$$ and n.nspname !~ $$^(pg_|information_schema)$$ ORDER BY pg_table_size(c.oid) desc;\n' >&50 # Create file that marks that all is done in master printf '\o %s/marker.file\n' "$tmp_dir" >&50 printf 'SELECT 1;\n' >&50 printf '\o\n' >&50 # Wait for marker file to appear while true do if [[ -s "$tmp_dir/marker.file" ]] then break fi sleep 0.1 done # Get snapshot id to variable snapshot_id="$( < "$tmp_dir/snapshot.id" )" # Dump each table separately while read table_name do printf "BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;\nSET TRANSACTION SNAPSHOT '%s';\nCOPY %s TO STDOUT;\n" "$snapshot_id" "$table_name" | psql -qAtX > "$table_name.dump" & done < "$tmp_dir/tables.list" # wait for all dumps to finish wait echo "All done."
It is not very smart – it does every table separetely, but doesn't split tables into many parts, so if you have one table that is 95% of your database – the speedup will be rather small. But still – it's just proof of concept.
The ability to set a transaction snapshot might also be of use in trying to parallelize *queries*, not just dumps. The original session could split the query into pieces and launch child sessions to process the pieces, all from the same point in time, and combine the results.
Hopefully, someone smarter than I is working on that.
@RobJ: that’s really cool idea. I don’t think I can write a general-purpose splitter/combiner, but a special case function/proof-of-concept should be simple to write.
I’m inviting you to try out my patch of a parallel pg_dump and post some comments and statistics about how it works out for you, see the latest patch at
http://www.postgresql.org/message-id/CACw0+10wAM_L6iC8KKL5eGzUtVjSpum1MEhaggWx5kU4VYn-QA@mail.gmail.com
The patch does exactly what you’re describing and can be used out-of-the-box.
a LOT of cool ideas…
@RobJ: So I gave it some thought, and I can’t find a way to do it in db itself. The problem is that to do it, I would have to have a way to run several things in parallel – on remote connections.
So – to do it you’d have to work it out in application. It’s possible, and not complicated, but it’s virtually the same as the dumping script I showed above, so there is not much point in doing it.
Alternatively – one could work it out with PL/Proxy ( https://www.depesz.com/2011/12/02/the-secret-ingredient-in-the-webscale-sauce/ ), but it could need modifying plproxy source.
@Joachim: Thanks for link. I can’t really test it, because the most interesting (large) databases are on 9.1, but it’s good to know that the patch exists, and is easily accessible.
@Depesz: The patch has a –no-synchronized-snapshots argument so you can run it against older databases without synchronized snapshots.
On a side note, when using synchronized snapshots, once you have done “SET TRANSACTION SNAPSHOT” in a transaction, there is no need to hold the original transaction open anymore. At that time, both transactions are equal wrt their snapshots.
Bruce Momjian did something like this for pg_upgrade: http://momjian.us/main/blogs/pgblog/2013.html#January_24_2013
Hello Depesz,
Maybe its stupid question but why pg_dump doesn’t have “–jobs” like pg_restore?
@Mrg:
because dumping in parallel is more complicated, and necessary things (snapshot cloning) was only added in 9.2. And simply noone wrote it. That is – Joachim wrote it (check comments above), but for some reason it wasn’t pulled to official pg sources.