Recently, on irc, there have been some talks with people using various pg_dump/pg_dumpall calls to get dumps of database.
I voiced my ideas, but figured it could be good subject for a blog post.
First things first – we have two tools available:
- pg_dumpall
- pg_dump
pg_dumpall dumps all databases in given PostgreSQL installation (cluster), and does it to plain text file. Everything goes there. Additionally, it dumps global things – roles and tablespaces, which cannot be dumped by pg_dump.
The major benefit of pg_dumpall is that it's single command, and you get results.
There is huge number of drawbacks though:
- dump is large, because it's uncompressed
- dumping is slow, because it's done sequentially, with single worker
- it's hard to restore just parts of dump
To load such dump, one would generally run:
=$ psql -f dump.file
or, better yet:
=$ psql -f dump.file -v ON_ERROR_STOP=1
So that it will stop immediately after first error – making tracking errors easier.
pg_dump, on the other hand, can't dump globals, and can dump only one database at a time. But it can use four dump formats:
- plain
- custom
- directory
- tar
Plain is just plain text format, just like pg_dumpall dumps. You can load it with psql, and extracting parts can be complicated if dump is large.
All other formats (custom, directory, and tar) are restored using pg_restore program.
To see differences more clearly, let me make dumps of a database using all formats:
=$ for a in p c t d do echo "Format: $a" time pg_dump -F $a -C -f dump-$a depesz_explain done Format: p real 0m17.604s user 0m1.102s sys 0m4.646s Format: c real 1m47.669s user 1m45.056s sys 0m1.641s Format: t real 0m22.308s user 0m1.381s sys 0m8.648s Format: d real 1m50.406s user 1m47.875s sys 0m1.473s
This doesn't look good, but it's mostly because of compression. Plain dump used 7.2GB (just like tar), but dir and custom used only 970MB each.
When I removed compression from custom and dir formats, using -Z0 option, I got much more similar times: -Fc finished in 18.442s, and -Fd in 18.732s.
There are benefits to these formats, though.
All of them (except plain) can generate content list of dump, and then restore just parts of the dump.
For example:
=$ pg_restore -l dump-c | head -n 20 ; ; Archive created at 2019-12-10 18:20:59 CET ; dbname: depesz_explain ; TOC Entries: 723 ; Compression: -1 ; Dump Version: 1.14-0 ; Format: CUSTOM ; Integer: 4 bytes ; Offset: 8 bytes ; Dumped from database version: 13devel ; Dumped by pg_dump version: 13devel ; ; ; Selected TOC Entries: ; 8; 2615 370778 SCHEMA - plans depesz_explain 592; 1247 370781 TYPE public register_plan_return depesz_explain 257; 1255 370782 FUNCTION public get_random_string(integer) depesz_explain 258; 1255 370783 FUNCTION public register_plan(text, text, boolean, boolean, text) depesz_explain 259; 1255 370784 FUNCTION public register_plan(text, text, boolean, boolean, text, text) depesz_explain
You can capture -l output to file, edit it, and then pg_restore will restore only elements that you listed.
For example:
=$ pg_restore -l dump-c | grep get_random_string > edited.list =$ cat edited.list 257; 1255 370782 FUNCTION public get_random_string(integer) depesz_explain =$ pg_restore -L edited.list -f partial.restore dump-c =$ cat partial.restore -- -- PostgreSQL database dump -- -- Dumped from database version 13devel -- Dumped by pg_dump version 13devel SET statement_timeout = 0; SET lock_timeout = 0; SET idle_in_transaction_session_timeout = 0; SET client_encoding = 'UTF8'; SET standard_conforming_strings = on; SELECT pg_catalog.set_config('search_path', '', false); SET check_function_bodies = false; SET xmloption = content; SET client_min_messages = warning; SET row_security = off; -- -- Name: get_random_string(integer); Type: FUNCTION; Schema: public; Owner: depesz_explain -- CREATE FUNCTION public.get_random_string(string_length integer) RETURNS text LANGUAGE plpgsql AS $$ DECLARE possible_chars TEXT = '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'; output TEXT = ''; i INT4; pos INT4; BEGIN FOR i IN 1..string_length LOOP pos := 1 + cast( random() * ( length(possible_chars) - 1) as INT4 ); output := output || substr(possible_chars, pos, 1); END LOOP; RETURN output; END; $$; ALTER FUNCTION public.get_random_string(string_length integer) OWNER TO depesz_explain; -- -- PostgreSQL database dump complete --
What's more – with custom/dir dumps, we can load them using multiple parallel workers.
For example:
=$ dropdb --force depesz_explain ; time psql -qAtX -v ON_ERROR_STOP=1 -f dump-p real 2m13.950s user 0m2.817s sys 0m2.537s
While loading custom in 8-way parallel:
=$ dropdb --force depesz_explain ; time pg_restore -j 8 -C -d postgres dump-c real 0m35.152s user 0m21.316s sys 0m1.788s
Time for parallel load of dir dump was the same.
So, finally – there is one BIG difference in favor of dir format – we can dump databases in parallel. For example:
=$ time pg_dump -F d -j 8 -C -f dump-j8-dir depesz_explain real 0m24.928s user 2m30.755s sys 0m2.125s
24 seconds is only 7 seconds more than plain text dump, but dump is smaller (~ 970MB), we can restore in parallel, and we can restore just parts.
If speed of utmost importance:
=$ time pg_dump -F d -j 8 -C -Z 0 -f dump-j8-z0-dir depesz_explain real 0m8.090s user 0m1.849s sys 0m7.780s
So, to sum it nicely:
Dump format | ||||
---|---|---|---|---|
plain | custom | tar | dir | |
Dump in parallel? | ✗ | ✗ | ✗ | ✓ |
Restore in parallel? | ✗ | ✓ | ✗ | ✓ |
Partial restore? | ✗ | ✓ | ✓ | ✓ |
So, how can we use this knowledge to dump all databases, and globals?
I'll reuse the idea from earlier blogpost. This script, with new knowledge gives:
#!/usr/bin/env bash # Unofficial Bash Strict Mode # http://redsymbol.net/articles/unofficial-bash-strict-mode/ set -euo pipefail IFS=$'\n\t' # End of Unofficial Bash Strict Mode # config top_level_backup_dir="/var/tmp/backups-pg" backup_keep_days="3" concurrent_dumps="2" dump_workers="5" # config cd "${top_level_backup_dir}" backup_dir="$( date '+%Y-%m-%d' )" mkdir "${backup_dir}" cd "$backup_dir" # Make actual backup files # First globals pg_dumpall -r -f roles.dump pg_dumpall -t -f tablespaces.dump # And now per-database dumps psql -qAtX -c "select datname from pg_database where datallowconn order by pg_database_size(oid) desc" | \ xargs -d '\n' -P "${concurrent_dumps}" -I % pg_dump -F d -C -j "${dump_workers}" -f pg-%.dump % # Add marker so that we'd know if the backup has finished correctly touch backup.done # Retention policy cd "$top_level_backup_dir" cutoff_date="$( date -d "${backup_keep_days} days ago" '+%Y-%m-%d' )" # Iterate over all backups for backup in [0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9] do # We need to remove - characters from dates, because otherwise we'd get: # >> 2019-12-08: value too great for base (error token is "08") [[ "${backup//-/}" -ge "${cutoff_date//-/}" ]] && continue # Backup is not within backup_keep_days, remove it. rm -rf "${backup}" done # All done. exit 0
And that's all that there is to it.
Hope you'll find it useful.
What about https://pgloader.io/ ?
AFAICR pgloader is a migration tool from other sources (databases and files) to PostgreSQL, it does do fast loads into PostgreSQL however. In my situation in the healthcare sector, it’s almost impossible to get approval to use external tools so pg_dump/pg_dumpall and pg_restore are all we can use.
Thanks for the great post, explains the usage perfectly.
Here, most likely a typo is made:
> for backup in [0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]
may be something like:
for backup in $(ls -d [0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9])
?
@grufos:
no, definitely not. NEVER do things like for … in $( ls … ). This is error prone, will lead to problems sooner or later. Check Bash Pitfalls page: https://mywiki.wooledge.org/BashPitfalls#for_f_in_.24.28ls_.2A.mp3.29 or, if it doesn’t load for you: https://web.archive.org/web/20200109085120/https://mywiki.wooledge.org/BashPitfalls#for_f_in_.24.28ls_.2A.mp3.29
i work on macos, bsd’s, and linux. i prefer scripts that run everywhere and this script is sadly linux centric. i always use `#!/bin/sh` and no bashims as bash (contrary to popular belief) is not installed everywhere.
`date -d` is not posix. why not use `find` and stop with the in-shell date manipulation that is fragile and error prone? something like:
`find ${top_level_backup_dir} -type d -mtime +${backup_keep_days} -delete -maxdepth 1`
isn’t `exit 0` superflous with `set -e`?
i was wondering why you specify `-C` for pg_dump for `format=d`. according to the manual:
“This option is only meaningful for the plain-text format. For the archive formats, you can specify the option when you call pg_restore.”
@minusf:
that’s certainly your prerogative. I find posix sh too limited, on SO MANY levels.
Based on this – I don’t really care about data -d not being posix.
And your find command is dangerous. Consider someone putting a file/dir in the backup dir, by mistake, or to “just let it be there for a moment”.
@minusf:
(re -C) well, cargo cult is alive and well 🙂
actually i forgot to add the last bit to the find command:
`-name “[0-9]*-[0-9]*-[0-9]*”`
(which makes it exactly as dangerous as your script).