This might not interest many of you, but I recently heard about at least two people that stumbled upon the problems I did, so I figured I can write about problems we discovered, and how we solved them (or not).
When we began our journey, the latest Pg was 14.x, that's why we're upgrading to 14, not 15. But I suspect upgrading to 15 wouldn't change much …
First, let's introduce the environment. At the place where I was in charge of doing the upgrade, there are couple hundred clusters. Each cluster consists of at least four servers:
- primary
- two streaming replicas
- disaster recovery node, replicating using wal archive
Each such cluster has only one DB in there. Sounds simple, but this is where the complications start.
Application itself has, let's say, 400-500 tables, usually. Variance depends on partitioning. Application is provided to customers as service, and we don't put data from different customers to the same tables.
So, we use schemas a lot. Some clusters have really big customer, so only one schema. But there are clusters with many hundreds of schemas.
This leads to cases, where we have, for example over 200k tables with total number of db objects above 2 million.
Diskspace wise, dbs aren't all that big – from couple of hundred megabytes to like 6 TB.
All of these were on Ubuntu bionic (18.04), and we wanted to upgrade to Focal (20.04). On top of it all, while we can have downtime for the upgrade, time for downtime is limited to two hours.
So, let the games begin.
First idea – use pg_upgrade. Unfortunately our previous experiences with pg_upgrade showed (at the time of 9.5 => 12 upgrade) that while, in link mode, size of data doesn't matter, number of objects does. And with our clusters with millions of objects, upgrade time, in link mode, was easily over 4-5 hours!
We could work around it by getting temporary super-beefy hardware, that would do it faster, but – hit the problem #1: locale data changes. Basically between Bionic and Focal, there is change in system locales, which, if we'd upgrade with pg_upgrade, we would have to reindex every index on textual data, or lose data consistency. Or stop seeing data.
So, that is no go.
Well, we can use replication. Of course not binary, but logical. We finally did 9.5 -> 12 with pglogical, but now, in pg12, we have normal, built-in, LOGICAL REPLICATION. Life is great. Right? Well, not so much.
Initial tests on small test dbs, proved to work nice. To speed things up we used max_sync_workers_per_subscription. The idea and prototype were rather simple:
- copy empty tables with primary keys only (no other indexes/constraints/triggers)
- create publication
- subscribe it
- add all tables
- wait for it to replicate, and be up to date
- add missing indexes/triggers/whatever
- when the time to switch will be upon us, verify that all is up to date, and make the pg14 logical replica primary.
Simple? Well, as you can guess: no. Problem #2 – pg created WAY too many replication slots. We had limit (max_wal_senders) set to 50, and with max_sync_workers_per_subscription=2, it created all 50 slots, and then broke.
The problem apparently is now fixed, but upgrading so many clusters is not an easy task, and will add literally months to our schedule. So, let's look for alternative.
We could make it replicate using only single connection. Could we? Well, for larger dbs (in terms of disk space), it proved to be too expensive because wal had to be kept on primary for long enough that we had to extend disk spaces, and while extending is simple(ish), reducing it afterwards ain't easy.
The problem with extra slots, is, as far as I understand, that while replica lets primary know “i'm done with this slot, you can close it", it doesn't want to the slot to be closed, and immediately starts new temporary “initial-sync" slot.
So, how about making it impossible to get too many slots?
Revised plan:
- copy empty tables with primary keys only (no other indexes/constraints/triggers)
- create publication
- subscribe it
- add up to 40 tables (we use 50 max_wal_senders, so 40 is safe for us)
- wait for it to replicate, and be up to date
- wait for all temp sync slots to disappear
- if there are some not-yet-added tables, go to step 4
- add missing indexes/triggers/whatever
- when the time to switch will be upon us, verify that all is up to date, and make the pg14 logical replica primary.
Simple. Easy. YAY 🙂 Well, no.
Adding missing indexes/triggers/whatever allowed me to find problem #3. Since we wanted to add the indexes/constraints/stuff fast, we used pg_restore -j N, where N was generally from 5 to 10. And this caused pg_restore to deadlock with itself.
This problem got solved by simply manually adding fkeys that it failed at. Not a big deal. Can be scripted.
At this time, it all looked well. We actually started upgrades. Upgraded couple of clusters. The sun was shining, people were happy. And then … problem #4.
Generally if you have WAL based replication, primary has to keep WAL for as long as some replication slot needs it. Once this replica will fetch WAL up to location “120", and will apply it, it then lets primary know that it is currently at least on 120, so everything before can be deleted. Amazing. Brilliant. Superb. Until it doesn't work.
In some cases, Pg started removing WAL that was still needed. Not entirely sure what is at fault. Pg? AWS networking? Magic? I angered some ancient gods?
Whatever the case, once the problem happened, the replica (on pg14) was not salvageable, and had to do the whole thing from scratch. Initial sync, and all.
Not a big deal. We can redo. But for some clusters the problem happened always. Sometimes after 10 hours, sometimes after 3 days. We can't do the downtime of app at any time we want, we have to let client know. So, the work plan is: prepare new servers, let client know, wait notice time (generally a week), and then switchover. But some clusters didn't make it through this week without fail. Whatever we did. Rebuild 10 times in a row? Nothing helps.
At this moment the whole upgrade project was paused to find viable solution. And I dove nose deep in depression. I love PostgreSQL, but these things were killing me.
Out of despair I figured I'll try with pglogical again. It worked for 9.5 -> 12, so it should work now. Right? Well, kinda, maybe, no.
The problem that we found (#5?) was that pglogical can't do the initial sync in parallel. So it sends all tables, in one transaction. Even transaction that took (only) 6 hours had significant effect on other parts of system (think about databases that routinely hit 30k tps), so long transactions weren't really an option.
So, back to logical replication. Can we somehow make that work?
After some time, it dawned on me: in all the cases where “lost wal" problem manifested, number of wal files removed were rather small. Less than 8. I can't make wal_keep_segments ridiculousely high, because of write traffic. But if I could somehow make Pg remove wal files after they are no longer used and after, let's say 1GB of data has passed since, that might help.
So, I wrote test script that:
- calls: select * from pg_create_logical_replication_slot(‘depesz', ‘test_decoding'); to create replication slot
- periodically checks what is the oldest WAL location used by any slot (aside from slot depesz), does some math to calculate this minus 1GB
- calls: select * from pg_replication_slot_advance(‘depesz', ‘this_location'); to make sure that slot depesz is always around 1GB before oldest slot
So far, it seems to work. We didn't upgrade any cluster with it yet, but at the very least, one of the problematic clusters, is keeping up with replication for over 3 days with no crash.
Fun bit is that in Pg 12, I can't simply get wal location (a.k.a. LSN) and subtract 1GB from it:
=> SELECT pg_current_wal_lsn() - ( 1024 * 1024 * 1024 ); ERROR: operator does NOT exist: pg_lsn - INTEGER LINE 1: SELECT pg_current_wal_lsn() - ( 1024 * 1024 * 1024 ); ^ HINT: No operator matches the given name AND argument types. You might need TO ADD explicit TYPE casts.
So I had to write myself helper function:
=$ CREATE FUNCTION buffered_lsn( IN p_lsn pg_lsn, IN p_offset INT8 DEFAULT 1024^3 ) RETURNS pg_lsn AS $$ DECLARE v_new_pos NUMERIC := p_lsn - '0/0'::pg_lsn; v_upper INT8; v_lower INT8; BEGIN IF p_offset < 0 THEN raise exception 'offset for dba.buffered_lsn() can''t be negative.'; END IF; -- Actually subtract offset v_new_pos := v_new_pos - p_offset; -- Make sure we don't enter the realm of negative lsns. IF v_new_pos < 0 THEN RETURN '0/0'; END IF; v_lower := v_new_pos % 4294967296; v_upper := div( v_new_pos, 4294967296 ); RETURN format('%s/%s', to_hex(v_upper), to_hex(v_lower))::pg_lsn; END; $$ LANGUAGE plpgsql;
Sooo. That's about it. This is my current state.
I've been working on the upgrade process for over a year now (there were also other things I did), and while it looks OKish now, given previously encountered problems, I'm not 100% sure that it will work fine this way. But, there is hope.
I am not on the level you are handling here, but
have you tried going from Pg12 to Pg13 first?
@mistige:
This is not relevant. the problems arise because of (what I assume to be) bugs in logical replication. And pg_restore. And missing functionalities in pglogical.
It doesn’t matter if I’d setup the replication to pg14, 13, 12, 15, or 16. Problem lies in the server that is doing the logical replication publication.
Excellent article, nice to know that weird problems happen to masters as well 🙂
I wish you good luck with the upgrades!
I know that for the old version of Ubuntu you are using in old cluster, you haven’t access to reflinks like in new versions of XFS, i.e..
But this is the best way I found to do a pg_upgrade with almost zero downtime and with a bonus of preserving the old cluster intact, with the same space. All of my new installations of Postgres I do in XFS with reflink.
Best regards
@Alexandre Paes:
Not sure what you mean “xfs with reflink”. We’re using XFS, though.
And in our tests, as I wrote in this blogpost, `pg_upgrade –link` is generally very fast, until we need to run it on db with many tables.
Perhaps this reflink that you mention is something else, I’m not sure.
I’m facing a similar challenge, having to migrate a fleet of PG12 clusters from Ubuntu 18.04 to 22.04 with the new libc collation. Some smaller clusters we will try to do pg_dump/restore migrations but the larger ones will require a logical replication setup & cutover as you describe.
This is the kind of thing that keeps me up with worry. Thanks for this article.
@Don Seiler:
for what it’s worth, it looks like our hack with “purposely lagging” slot helped, and the problematic cluster (one of them) is replicating nicely. Upgrade time is scheduled, so there is a chance we will be able to resume our upgrades.
@depesz:
Sorry in delay to reply. Reflinks are like a snapshot of a file. The old cluster will be preseved and the new cluster have only a point to old data plus the changed data incrementaly. If you need to start old cluster again,no problem. To use, you need a filesystem that supports reflinks and use pg_upgrade –clone instead –link.
@Alexandre – in this case *of course* we tried. I wrote about it:
“First idea – use pg_upgrade. Unfortunately our previous experiences with pg_upgrade showed (at the time of 9.5 => 12 upgrade) that while, in link mode …”
@depesz:
I did read that you tried pg_upgrade with –link option, that use *hardlinks*.
pg_upgrade with –clone option uses *reflinks*, that are a different approach but requires that filesystem have reflinks enabled (mkfs.xfs -m reflink=1), which I doubt you have in Ubuntu Bionic. I talked about reflinks (and pg_upgrade –clone) in my first post, because it is a great feature to use in new installations.
Anyway, congrats for the valuable work with this site !
@Alexandre:
So, I didn’t try it. But I can’t imagine how it could be faster, given that docs say that “it gives speed advantages of hard links…”.
Plus, in our case it’s not the size (gigabytes) of data that is a problem, but number of objects.
It’s cool feature, though, one that lets you get speed of –link, but still retaining old pgdata undamaged…
pg_upgrade is not an option in any case due to the collation change after Ubuntu Bionic unless you have a downtime window to rebuild all your text indexes immediately afterward.
I also hit all problems from 1-5 with a 1.2TB db with 150+ tables 2.9 billion rows. Resolved the not enough slots by going from 25 to 200. Then hit the wal missing. Fought this for a few days until we found a cron which was running pg_archivecleanup every 10 minutes. This explained why the wals were disappearing from the subscriber perspective. We commented out that cron job and got all our tables syncing and finishing correctly without erroring. Thought I’d add this so everyone else who comes here can benefit as well. Good luck everyone.