Case study: optimization of weirdly picked bad plan

We recently hit an interesting case where planner picked wrong plan. Kinda. And figured it could be used to show how to deal with such cases.

So, we have some databases on PostgreSQL 14 (yes, I know, we should upgrade, but it is LONG project to upgrade, so it's not really on the table now).

Last week someone noticed problems with website. These were tracked to queries getting killed because of statement_timeout of 3 minutes.

The query was relatively simple (most of the identifiers were obfuscated, but tried to keep them somewhat memorable):

SELECT
    MAX( voluntaries.cached_due_date )
FROM
    changes51_gauge_8273.voluntaries
WHERE
    voluntaries.mutilation_id = 12122656 AND
    voluntaries.workflow_state <> 'deleted';

Interesting thing was that on primary db the query returned in couple of milliseconds, but on replica it got killed after 3 minutes. Repeatably.

First thing that I did was run explain on the query. Got this:

                                                               QUERY PLAN
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Result  (cost=8473.04..8473.05 rows=1 width=8)
   InitPlan 1 (returns $0)
     ->  Limit  (cost=0.58..8473.04 rows=1 width=8)
           ->  Index Scan Backward using index_voluntaries_on_cached_due_date on voluntaries  (cost=0.58..39905284.07 rows=4710 width=8)
                 Index Cond: (cached_due_date IS NOT NULL)
                 Filter: (((workflow_state)::text <> 'deleted'::text) AND (mutilation_id = 12122656))
(6 rows)

\(cached_due_date\|mutilation_id\|workflow_state\)

So, it's using an index. What can I see about it, and the table itself? Well, the table is rather large: 232GB of data in table, additional 2GB in toast. And ~ 380GB in indexes. Many indexes.

In total, the table has 50 columns of various datatypes, and 30 indexes.

Interesting columns, and indexes look like this:

              COLUMNTYPE             │ Collation │ NULLABLEDEFAULT
══════════════════════════════════╪═════════════════════════════╪═══════════╪══════════╪══════════════════════════════════════════════════════════════
 mutilation_id                    │ BIGINT                      │           │ NOT NULL │
 workflow_state                   │ CHARACTER VARYING(255)      │           │ NOT NULL │
 cached_due_date                  │ TIMESTAMP WITHOUT TIME zone │           │          │
Indexes:
    "index_active_voluntaries" btree (mutilation_id, workman_exhale_id) WHERE workflow_state::text <> 'deleted'::text
    "index_voluntaries_on_mutilation_id_and_automation_type" btree (mutilation_id, automation_type)
    "index_voluntaries_on_mutilation_id_and_user_id" btree (mutilation_id, user_id)
    "index_voluntaries_on_cached_due_date" btree (cached_due_date)

At the moment, I don't recall which index was used on primary, but I think it was index_active_voluntaries.

Let's think about what has happened.

On primary, where everything was fast, Pg decided to use index to filter based on mutilation_id and workflow_state, and then it got max value of cached_due_date by simply reading the value from all rows that matched mutilation_id/workflow_state combo.

But on secondary, it decided to scan data based on order of cached_due_date and stop on first found row that matches mutilation_id/workflow_state.

The problem with this is, that if rows that match the mutilation_id/workflow_state are relatively rare (or don't exist) – PostgreSQL will have to read the whole table (230GB) in random order (well, based on cached_due_date) to find first row with appropriate mutilation_id/workflow_state.

If you don't understand it – consider phone book of millions of people, sorted by “firstname".

In such phone if I'd like to find “max(lastname)" for “firstname = ‘hubert'" – it will be fast. Index scan to find hubert, and then find max lastname. Trivial.

But what if the phone book is sorted by “lastname"? Well, if person with lastname “zzzzzzzz" happened to have “hubert" as firstname, it would be trivial. We scan backward the book, and find first hubert. But what will happen if there is only “hubert abc"? Finding it by scanning the book from zzz will take VERY LONG TIME.

This is what happened.

So, what can we do about it? First idea was to run ANALYZE. We did, it didn't help. Which, in hind sight wasn't surprising – primary had the same stats, and was picking different index.

Given this we had to find another way. There were actually two ways to fix the query: change the query so that pg will have to find rows with the mutilation_id/workflow_state combo that we want, and only then look for max cached_due_date. And another idea: add specialized index.

First approach, forcing Pg to pick what we need is trivial thanks to MATERIALIZED clause of common table expression. Rewrote the query to:

=$ WITH x AS MATERIALIZED (
    SELECT
        voluntaries.cached_due_date
    FROM
        changes51_gauge_8273.voluntaries
    WHERE
        voluntaries.mutilation_id = 12122656 AND
        voluntaries.workflow_state <> 'deleted'
)
SELECT
    MAX( cached_due_date )
FROM
    x;

Unfortunately in the heat of the moment, with broken-ish site, I didn't think about saving the explain analyze plan from this. But it generally ran in ~ 5ms, if my memory serves.

Of course if there were millions of rows that match the mutilation_id/workflow_state condition, it would take longer, as it would have to scan more rows to find the one with max(cached_due_date).

5ms isn't bad, but the problem with handling many rows caused me to look into better index.

Luckily this is rather simple. Our query had one interesting condition using = operator, one condition that is using <>, which can be used as where clause for index, and one more column used for sorting/getting max.

This means that index should be:

=$ CREATE INDEX depesz_bandaid ON changes51_gauge_8273.voluntaries (mutilation_id, cached_due_date) WHERE workflow_state <> 'deleted';

Making this index took a while (after all, it had to read data from the whole table). But with this index, old query was running in:

                                                                         QUERY PLAN
═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
 Result  (cost=0.62..0.63 rows=1 width=8) (actual time=0.014..0.015 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Limit  (cost=0.58..0.62 rows=1 width=8) (actual time=0.013..0.013 rows=0 loops=1)
           ->  Index Only Scan Backward using depesz_bandaid on voluntaries  (cost=0.58..154.95 rows=3311 width=8) (actual time=0.012..0.012 rows=0 loops=1)
                 Index Cond: ((mutilation_id = 12122656) AND (cached_due_date IS NOT NULL))
                 Heap Fetches: 0
 Planning Time: 0.321 ms
 Execution Time: 0.028 ms
(8 rows)

And the new query, with MATERIALIZED was even slightly faster:

                                                                   QUERY PLAN
════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
 Aggregate  (cost=1024.69..1024.70 rows=1 width=8) (actual time=0.051..0.052 rows=1 loops=1)
   CTE x
     ->  Index Only Scan using depesz_bandaid on voluntaries  (cost=0.58..676.60 rows=15471 width=8) (actual time=0.015..0.032 rows=55 loops=1)
           Index Cond: (mutilation_id = 12122656)
           Heap Fetches: 55
   ->  CTE Scan on x  (cost=0.00..309.42 rows=15471 width=8) (actual time=0.016..0.045 rows=55 loops=1)
 Planning Time: 0.193 ms
 Execution Time: 0.070 ms
(8 rows)

I'm actually kinda surprised that the materialized query seemed to be faster, would have expected it to be slightly slower, but both approached, with the new index, are good enough.

After the whole thing has solved (in terms of site performance) I looked at why replica was picking wrong index. And… well, I'm not sure if I found anything. Primary DB was on 48 cores with 392GB of ram. Replica was on bigger machine: 64 cores and 512GB of ram.

The only changes in configuration were to reflect that we had more memory:

Primary: Secondary:
effective_cache_size 185.6 GB 248.8 GB
shared_buffers 93.3 GB 124.4 GB

Honestly I wouldn't expect giving box more memory would cause problems, but apparently it might happen as well.

Hope you can find it useful 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.