How to ALTER tables without breaking application?

I once wrote about this problem, but given that we now have DO blocks and procedures I can make nicer, easy to use example.

Over the years there have been many improvements to how long ALTER TABLE can take. You can now (in some cases) change datatype without rewrite or add default value.

Regardless how fast the thing works, it still needs extremely heavy (though shortlived) lock: Access Exclusive. We can see it easily:

=$ CREATE TABLE z (x text );
CREATE TABLE
 
=$ BEGIN;
BEGIN
 
=$ ALTER TABLE z ADD COLUMN i int4;
ALTER TABLE
 
=$ SELECT locktype, DATABASE, relation::regclass, mode, GRANTED FROM pg_locks WHERE pid = pg_backend_pid();
   locktype    | DATABASE | relation |        mode         | GRANTED 
---------------+----------+----------+---------------------+---------
 relation      |   539439 | pg_locks | AccessShareLock     | t
 virtualxid    |          |          | ExclusiveLock       | t
 relation      |   539439 | z        | AccessExclusiveLock | t
 transactionid |          |          | ExclusiveLock       | t
(4 ROWS)
 
=$ ROLLBACK;
ROLLBACK

If this is short lived, then what is the problem you might ask? Well, the problem is that even when lock is not granted, transactions that try to get some kind of lock that would conflict with it, and that tried to get the lock after we try to get Access Exclusive – they have to wait.

Of course this doesn't work just with Access Exclusive Lock – same thing happens with everything. Trying to get lock works like this:

  • Mark that current transaction is trying to get the lock
  • Wait for all previously running locks that would conflict with “my" lock will finish
  • Mark the lock as granted
  • Continue with work that depends on the lock.

Normally, if the application has just short transactions, it doesn't really matter. You might wait 100ms or a second to get the lock, and other connections (that would need to use this table) wait for you, but it's short, so it's ok.

The problem is: what if there is some long running transaction that touched the table that you want to lock? Then your ALTER TABLE will have to wait for this long transaction to end. And all other connections/transactions will wait for ALTER TABLE to finish.

The solution to this problem is to limit how long we wait, and if getting lock wouldn't work – simply stop trying, wait a bit, and retry.

Doing this will make the whole process longer, but, at the very least, no other connection will have to wait to select/insert/update/delete from the table we want to alter.

So you now know the theory. In my previous blogpost I described how to solve this using bash loop and calling psql repeatedly. This time let's try to do it in pure sql, without depending on external tools.

First solution is DO block.

We can, for example, write something like this:

  1. DO $$
  2. BEGIN
  3.     SET LOCAL lock_timeout = 10;
  4.     LOOP
  5.         BEGIN
  6.             ALTER TABLE z ADD COLUMN test TEXT;
  7.             EXIT;
  8.         EXCEPTION WHEN lock_not_available THEN
  9.             perform pg_sleep(random() * 5 + 2);
  10.         END;
  11.     END LOOP;
  12.     RESET lock_timeout;
  13. END;
  14. $$;

How does that work? That's simple. First, in line 3, we set lock_timeout config variable to 10 milliseconds. This will make each try to lock fail after 10ms, if the lock wasn't obtained.

Then, in line 4 we start infinite loop that does:

  1. start (sub) transaction (line 5)
  2. run the alter table that we need (line 6)
  3. if alter table worked, EXIT – which exits the infinite loop (line 7)
  4. if alter table didn't work, but the exception is lock_not_available: sleep for some random time (2-7 seconds) (line 9)
  5. if there was any other exception, given that we don't have EXCEPTION block that would catch it – it will abort this DO block, and raise exception to client
  6. line 11 ends the loop, and repeats from line 5

After the loop will end (which means that alter table succeeded, I reset lock_timeout to whatever it should be.

And that's all that you need.

The thing is that it doesn't really look good to have to retype the same block over and over again. We can do better. Specifically we can write procedure that will just take ALTER TABLE statement as argument, and will try to run it for as long as needed:

  1. CREATE PROCEDURE execute_with_minimal_locking( IN p_sql TEXT) LANGUAGE plpgsql AS $$
  2. BEGIN
  3.     SET LOCAL lock_timeout = 10;
  4.     LOOP
  5.         BEGIN
  6.             EXECUTE p_sql;
  7.             EXIT;
  8.         EXCEPTION WHEN lock_not_available THEN
  9.             perform pg_sleep(random() * 5 + 2);
  10.         END;
  11.     END LOOP;
  12.     RESET lock_timeout;
  13. END;
  14. $$;

That's all you need. Then you can simply run:

CALL execute_with_minimal_locking('ALTER TABLE z add column test TEXT');

EXECUTE command in line 6 will execute it, and the rest of logic is the same.

We could add there support for variable lock_timeout settings, or limit for number of tries, or any other magic that you could possibly dream of, but this thing, as it is, solves the problem for blocking concurrent access to table while waiting for lock by alter table.

There is one final note, though. If your alter table will take non-trivial time (for example, you're changing datatype in a way that requires table rewrite) – while getting the lock will not block other queries, once the lock is obtained, all other accesses to the table are locked away until the alter table will finish.

If you want to do complicated changes, where, even without concurrent locks, alter table takes long time – you just have to figure out how to do it without such long processing. Or split the processing into parts that don't require long-standing locks.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.