On 6th of January 2021, Tomas Vondra committed patch:
Report progress of COPY commands This commit introduces a view pg_stat_progress_copy, reporting progress of COPY commands. This allows rough estimates how far a running COPY progressed, with the caveat that the total number of bytes may not be available in some cases (e.g. when the input comes from the client). Author: Josef Šimánek Reviewed-by: Fujii Masao, Bharath Rupireddy, Vignesh C, Matthias van de Meent Discussion: https://postgr.es/m/CAFp7QwqMGEi4OyyaLEK9DR0+E+oK3UtA4bEjDVCa4bNkwUY2PQ@mail.gmail.com Discussion: https://postgr.es/m/CAFp7Qwr6_FmRM6pCO0x_a0mymOfX_Gg+FEKet4XaTGSW=LitKQ@mail.gmail.com
Loading huge datasets takes time, and it can be frustrating, since you don't know what is your current situation – whether you're at 1% of 99%.
Now, we will get some insight.
Let's test it. The simplest test will be to export some data from DB – preferably one that I can generate at will, and not store in DB itself.
So let's start with this query:
$ SELECT i, now() - random() * '2 years'::INTERVAL, random() FROM generate_series(1,5) i; i | ?COLUMN? | random -----+-------------------------------+---------------------- 1 | 2020-10-26 02:22:33.570694+01 | 0.7886766259618767 2 | 2020-08-09 11:51:18.114694+02 | 0.9346193980623099 3 | 2020-10-27 11:56:38.313094+01 | 0.7523086089557793 4 | 2019-07-11 02:27:29.749894+02 | 0.0374716722469941 5 | 2020-05-25 00:51:41.039494+02 | 0.857011143252528 (5 ROWS)
This generated 5 rows with some values. The good thing is that I can scale it to virtually any number of rows without using any disk space.
Now, let's put it within COPY, sending output to /dev/null to avoid filling disk space, and see the progress.
To see progress, in separate psql session I ran:
$ SELECT * FROM pg_stat_progress_copy \watch 1
Which, in the beginning was showing:
Tue 12 Jan 2021 11:19:39 AM CET (every 1s) pid | datid | datname | relid | bytes_processed | bytes_total | lines_processed -----+-------+---------+-------+-----------------+-------------+----------------- (0 ROWS) Tue 12 Jan 2021 11:19:40 AM CET (every 1s) pid | datid | datname | relid | bytes_processed | bytes_total | lines_processed -----+-------+---------+-------+-----------------+-------------+----------------- (0 ROWS)
No rows, since no COPY operation was in progress.
Now, with this in place, I ran larger COPY:
$ \copy ( SELECT i, now() - random() * '2 years'::INTERVAL, random() FROM generate_series(1,10000000) i ) TO /dev/NULL;
While this was running (~ 7 seconds), my watcher psql showed:
Tue 12 Jan 2021 11:19:41 AM CET (every 1s) pid | datid | datname | relid | bytes_processed | bytes_total | lines_processed -------+--------+---------+-------+-----------------+-------------+----------------- 56595 | 223019 | depesz | 0 | 33733834 | 0 | 601473 (1 ROW) Tue 12 Jan 2021 11:19:42 AM CET (every 1s) pid | datid | datname | relid | bytes_processed | bytes_total | lines_processed -------+--------+---------+-------+-----------------+-------------+----------------- 56595 | 223019 | depesz | 0 | 121581834 | 0 | 2142388 (1 ROW) Tue 12 Jan 2021 11:19:43 AM CET (every 1s) pid | datid | datname | relid | bytes_processed | bytes_total | lines_processed -------+--------+---------+-------+-----------------+-------------+----------------- 56595 | 223019 | depesz | 0 | 210759763 | 0 | 3699553 (1 ROW) Tue 12 Jan 2021 11:19:44 AM CET (every 1s) pid | datid | datname | relid | bytes_processed | bytes_total | lines_processed -------+--------+---------+-------+-----------------+-------------+----------------- 56595 | 223019 | depesz | 0 | 300090238 | 0 | 5259338 (1 ROW) Tue 12 Jan 2021 11:19:45 AM CET (every 1s) pid | datid | datname | relid | bytes_processed | bytes_total | lines_processed -------+--------+---------+-------+-----------------+-------------+----------------- 56595 | 223019 | depesz | 0 | 388599315 | 0 | 6804802 (1 ROW) Tue 12 Jan 2021 11:19:46 AM CET (every 1s) pid | datid | datname | relid | bytes_processed | bytes_total | lines_processed -------+--------+---------+-------+-----------------+-------------+----------------- 56595 | 223019 | depesz | 0 | 477635774 | 0 | 8359459 (1 ROW) Tue 12 Jan 2021 11:19:47 AM CET (every 1s) pid | datid | datname | relid | bytes_processed | bytes_total | lines_processed -------+--------+---------+-------+-----------------+-------------+----------------- 56595 | 223019 | depesz | 0 | 565792160 | 0 | 9898785 (1 ROW) Tue 12 Jan 2021 11:19:48 AM CET (every 1s) pid | datid | datname | relid | bytes_processed | bytes_total | lines_processed -----+-------+---------+-------+-----------------+-------------+----------------- (0 ROWS)
This is pretty cool. I saw both number of bytes that were send, and number of lines (rows).
Given that I knew that my output is 10 million rows, I could easily see where in the stream I am.
Great addition that adds yet another progress information.
Thanks a lot to all involved.
the first query generates 100 rows not 5
@minusf:
thanks, fixed
The reporting has since changed the name of one column, and has had some columns added, so you might want to update this post with the newer results.
@Matthias:
yeah, I generally don’t do it. There have been many changes through the years, and keeping stuff updated is not realistic. I assume that if/when someone will read about it, and will want to use it, they will check what is the current state of view/data.
That seems actually quite reasonable. Thanks for the great blog!