Recently I've seen case like:
- application had to add column to table.
- application ran ALTER TABLE ADD COLUMN (without default!)
- everything stopped for many MINUTES
Why? How to avoid the problem?
First, let's see what's happening when I try to alter table add column:
=# ALTER TABLE test ADD COLUMN whatever int4; ALTER TABLE TIME: 12.662 ms
As you can see adding the column is very fast. So how come it locked other queries for minutes?
Well, let's see locks obtained by alter table:
=# BEGIN; BEGIN =# ALTER TABLE test ADD COLUMN whatever2 int4; ALTER TABLE =# SELECT * FROM pg_locks WHERE pid = pg_backend_pid(); locktype | DATABASE | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | GRANTED | fastpath ---------------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+-------+---------------------+---------+---------- relation | 165725 | 12143 | | | | | | | | 3/2594 | 32470 | AccessShareLock | t | t virtualxid | | | | | 3/2594 | | | | | 3/2594 | 32470 | ExclusiveLock | t | t transactionid | | | | | | 1422 | | | | 3/2594 | 32470 | ExclusiveLock | t | f relation | 165725 | 166142 | | | | | | | | 3/2594 | 32470 | AccessExclusiveLock | t | f (4 ROWS) =# ROLLBACK; ROLLBACK
Please note that there is AccessExclusiveLock – and this will lock everything until the lock is gone, which in my case is – until the transaction with alter table will finish.
But alter table add column takes 12ms. So why minutes? Well, it happened that there were some other queries happening on the table before, and then alter table has to wait with continuing till previous locks are gone.
We can see it by doing:
(SESSION 1) =# BEGIN; BEGIN (SESSION 1) =# SELECT COUNT(*) FROM test; COUNT ------- 0 (1 ROW)
then, without closing session 1, and without closing its transaction:
(SESSION 2) =# ALTER TABLE test ADD COLUMN whatever2 int4;
and this hangs – because it's waiting for lock from session 1. But – it already obtained AccessExclusiveLock on the table, so no other select can happen:
(SESSION 3) =# depesz=# SELECT * FROM test LIMIT 1;
also hangs.
So, is there any way to add this column without long locking of other queries?
YES. Using the same trick pg_reorg/pg_repack is using.
First we need to set statement timeout, and then run alter table:
=$ printf "SET statement_timeout = 50;\nALTER TABLE test add column whatever2 INT4;\n" SET statement_timeout = 50; ALTER TABLE test add column whatever2 INT4;
The timeout will make sure that alter table can't take more than 50 milliseconds.
Then, if I'll run it, via psql, I get:
=$ time printf "SET statement_timeout = 50;\nALTER TABLE test add column whatever2 INT4;\n" | psql -qX ERROR: canceling statement due to statement timeout real 0m0.054s user 0m0.000s sys 0m0.002s =$ echo $? 0
It failed, fast, which is good. But it returned code 0. Which is mark for success. But we can change that by adding:
=$ time printf "SET statement_timeout = 50;\nALTER TABLE test add column whatever2 INT4;\n" | psql -qX -v ON_ERROR_STOP=1 ERROR: canceling statement due to statement timeout real 0m0.054s user 0m0.002s sys 0m0.000s =$ echo $? 3
And this is what we need. Now I can wrap it in a loop, but to make it slightly more readable, let's first get rid of printf from each call:
=$ printf "SET statement_timeout = 50;\nALTER TABLE test add column whatever2 INT4;\n" > alter.sql
And now, I can:
=$ while true; do date; psql -qX -v ON_ERROR_STOP=1 -f alter.sql && break; sleep 1; done Thu 26 Sep 2019 03:43:52 PM CEST psql:alter.sql:2: ERROR: canceling statement due to statement timeout Thu 26 Sep 2019 03:43:53 PM CEST psql:alter.sql:2: ERROR: canceling statement due to statement timeout Thu 26 Sep 2019 03:43:54 PM CEST psql:alter.sql:2: ERROR: canceling statement due to statement timeout Thu 26 Sep 2019 03:43:55 PM CEST psql:alter.sql:2: ERROR: canceling statement due to statement timeout Thu 26 Sep 2019 03:43:56 PM CEST =$
The oneliner expanded:
while true do date psql -qX -v ON_ERROR_STOP=1 -f alter.sql && break sleep 1 done
date is there just so that I'll see when I started, and when it finally worked. Of course one could use another value of sleep.
If I were to use some smarter language than bash I could do it all in one connection, but for these kinds of tasks, I don't really think it matters much to start new psql every second or so.
Obviously adding the column might take long time now – but at no point in time other queries will be locked for over 50 ms.
Please note that if you'd like to do more stuff to the table, you can do it using transaction and alter.sql containing:
BEGIN; SET statement_timeout = 50; LOCK TABLE ONLY test IN ACCESS EXCLUSIVE MODE; SET statement_timeout = 0; ALTER TABLE test ....; -- do whatever you want, timeout is removed. commit;
But you have to remember that if the lock will be obtained, no other connection will be able to use the table.
You can also do a variation of this where you start a transaction and then attempt to take a NOWAIT table lock on the table you need to alter. It will error the transaction immediately if it can’t take the lock and then you can loop with whatever wait time you want.
Has the downside that you will never succeed if you can’t find a time that doesn’t have a conflicting lock already, but has the upside of you don’t block for 50ms (or whatever your statement_timeout needs to be) repeatedly.
Still a good idea to put a statement_timeout on this method also just in case your change takes longer than you thought it should, or you attempted to bundle something else in that transaction that you expected to be fast that isn’t.
Curious what you see as pro/con on statement_timeout vs lock_timeout here?
@Jeremy Schneider:
well, now that I’ve read it – it looks that lock_time is/would be even better. Forgot about it 🙂
Is there any specific reason why PL/pgSQL LOOP was not used? If I apply my Oracle experience to this task using that lock_timeout Jeremy Schneider pointed out, I would rather do something like this:
@Mikhail Velikikh:
No real reason, aside form the fact that I somehow try to steer clear of exception handling in plpgsql. Probably in this case it wasn’t necessary.
On the other hand – my code seems to be shorter 🙂
This is genius, I love it! I took the approach a couple of days ago of doing the DDL change out of hours, but this would have been much better.
There are basically 2 ways to do this:
1. Using timeout logic as explained in this blog, or
2. Using NOWAIT feature, shown below:
@Michael:
yeah, but NOWAIT will exit immediately, not even leaving the chance to wait for 50ms (or whatever is acceptable). That’s why I didn’t use it.
Good point, @depesz, but the loop variable iteration tries to account for that by trying multiple times before giving up, so you could adjust it higher. Don’t get me wrong, I like your way as well!
@Michael:
sure, you can add loop, but – the way it’s written will cause “busy-loop” using all of CPU, and if we’d add sleeps – it is very possible that checking for locks would never succeed on busy systems, while a waiting solution might.
@Depesz, not trying to belabor this point too much but in your opinion can you think of a scenario in which using NOWAIT might be a better solution?
@Michael:
only if you really don’t want to wait. As in: if it locks, do something, otherwise do something else.