SQL best practices – don’t compare count(*) with 0

Every now and then I see something like this:

SELECT u.* FROM users u
WHERE 0 = (SELECT COUNT(*) FROM addresses a WHERE a.user_id = u.id);

and it kinda pains me. So figured, I'll write about what is the problem with it, and how to avoid such constructs.

Lets consider what went through someones mind when they wrote it. This query will return all users, that have zero addresses. In other words – users that never provided any address.

Why is that bad? After all, if we need to find users without addresses, it does what needs to be done? Right. No.

The problem is counting. What will happen if an user has million addresses? Pg will dutifully count all of them, only then to reject the row, because there was something.

The thing is that you don't need to have million addresses. Even if there can be two – they still need to be counted, and while the time cost will be lower than counting million rows, it is still there, and it's 100% useless work.

So, the question can be: how to do the thing properly? It's easy – there is EXISTS expression:

SELECT u.* FROM users u
WHERE NOT EXISTS (SELECT FROM addresses a WHERE a.user_id = u.id);

EXISTS (well, in this case NOT EXISTS) will check abandoning the test as soon as one row will be found. No need to count.

Of course, in some (most?) cases, where people use this construct (0 == count), someone can say: “but it doesn't matter in our case, because…". Well, this is where best practices part of the title comes from. One shouldn't do wrong things, even if they don't matter in this particular case, at the very least not to get into habit of doing it, and then accidentally make the same mistake when it will actually be important.

12 thoughts on “SQL best practices – don’t compare count(*) with 0”

  1. I use the analogy of asking someone how many people are in a meeting room vs asking if the meeting room is occupied. If you’re looking for an unoccupied room to hold a meeting you don’t care how many are in the room.

  2. great to point that out!

    One remark: there is still a sub select in the resulting query which is generally a slowdown. Replacing it with a LEFT JOIN is a better choice

    SELECT u.* FROM users u
    LEFT JOIN addresses a ON a.user_id = u.id
    WHERE a.user_id IS NULL;

    What do you think?

  3. @Manuel:

    1. the idea that “subselects are bad, so use ‘xxx’ instead” is insanely harmful
    2. as with almost everything in dbs – it depends on data distribution. sometimes join will be better, sometimes not.

  4. @Manuel: The subquery with NOT EXISTS will be executed as a anto-join by the PostgreSQL optimizer, so it will perform at least as well as the outer join you suggested.

  5. I thought the big idea with SQL (and why we tolerate its declarative nature which makes things so opaque for humans) is that it should rewrite our poorly optimized queries like this. Is anyone sure that this isn’t actually optimized in major DBs?

  6. @depesz can you elaborate on why avoiding subselects when possible is harmful? If possible some simple examples would go a long way. I was taught this too and would love to know more/better.

    @Laurenz_Albe points out that the NOT EXISTS subselect will be optimized by PostgreSQL. I do not doubt that and love that PostgreSQL can do such a thing! However, I and many devs I’ve worked with do not know that instinctively and would rewrite it as @Manuel suggests. In fact, that was the solution I thought of even before I finished reading the post.

  7. What is the difference between LEFT JOIN and NOT EXISTS in this article? Can you list them and write a separate article to explain them.

  8. @agrberg:
    Each tool has it’s uses. It’s like saying: avoid using hammers, because if you hit yourself in a finger, it will hurt.

    Subqueries aren’t inherently bad. Neither are IN(), JOINs or anything else. Everything should be used in their situation.

    There are people that say that “x” is always bad, so always use “anything other than x” – which I believe is bad approach, as it is, basically, cargo cult.

  9. 业余草:
    Sorry, I don’t do requests. Unless you want to sponsor articles, but I don’t think it’s realistic.

    What is left join: https://duckduckgo.com/?t=ffab&q=sql+tutorial+left+join&ia=web

    What is not exists: https://duckduckgo.com/?q=sql+tutorial+exists&t=ffab&ia=web

    And as with everything – try. You have free database that provides great documentation, and amazing tool: explain. Use it to learn. Don’t expect others to respond to chat-gpt style prompts to answer your questions.

  10. This is not a good example.

    Sure, NOT EXISTS is better than count(*) = 0.
    But, NOT EXISTS subquery in where condition is harmful in this example.
    To ENSURE each record returned NOT EXISTS the following subquery, means every row in this query need to perform this subquery once.
    When there’s millions of users, then it will be very slow.
    Left Join where join target is null will be much faster in this case, and that’s why Postgres auto rewrite this with left join.

    Aviod non necessary subquery for EVERY row is very important. So if there’s other filter condition that would use good index, and the actual fetched row count is low then the NOT EXISTS every row subqurry check is fine, and slightly better that count(*) = 0.

    So your assertion is correct but this example is not helpful for your readers. If your DB not automatically rewrite with left join, it’s very very slow or even not working when there’s tons of data in this example.

  11. @Pondin:

    > To ENSURE each record returned NOT EXISTS the following subquery, means every row in this query need to perform this subquery once.

    Well, this is clearly nonsense. If you can’t be bothered to test it yourself just read the comment from Laurenz above.

    Your post is clear example of the approach that I was commenting on in my comments earlier: “subqueries are wrong”. Do some tests, and stop spreading FUD.

    Just so that I will be clear:

    =$ CREATE TABLE x (id serial PRIMARY KEY);
    =$ CREATE TABLE y (id serial PRIMARY KEY, x_id int4 REFERENCES x (id));
    =$ CREATE INDEX yi ON y (x_id);
    =$ INSERT INTO x SELECT generate_series(1,100000) i;
    =$ INSERT INTO y(x_id) SELECT 1 + (random() * 100000)::int4 FROM generate_series(1,50000) ;
    =$ EXPLAIN (analyze, buffers) SELECT * FROM x WHERE NOT EXISTS (SELECT * FROM y WHERE y.x_id = x.id);
                                                         QUERY PLAN
    ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
     Hash RIGHT Anti JOIN  (cost=2693.00..3827.50 ROWS=50000 width=4) (actual TIME=35.896..47.804 ROWS=60769 loops=1)
       Hash Cond: (y.x_id = x.id)
       Buffers: shared hit=665
       ->  Seq Scan ON y  (cost=0.00..723.72 ROWS=50172 width=4) (actual TIME=0.013..2.973 ROWS=50000 loops=1)
             Buffers: shared hit=222
       ->  Hash  (cost=1443.00..1443.00 ROWS=100000 width=4) (actual TIME=10.148..10.163 ROWS=100000 loops=1)
             Buckets: 131072  Batches: 1  Memory Usage: 4540kB
             Buffers: shared hit=443
             ->  Seq Scan ON x  (cost=0.00..1443.00 ROWS=100000 width=4) (actual TIME=0.022..3.281 ROWS=100000 loops=1)
                   Buffers: shared hit=443
     Planning TIME: 0.103 ms
     Execution TIME: 50.030 ms
    (12 ROWS)

    As you can see none part of this query was ran “for every row”.

    And just as a final note, this is explain for the query rewritten as outer join:

    =$ EXPLAIN (analyze, buffers) SELECT * FROM x LEFT OUTER JOIN y ON x.id = y.x_id WHERE y.id IS NULL;
                                                         QUERY PLAN
    ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
     Hash RIGHT JOIN  (cost=2693.00..3546.25 ROWS=1 width=12) (actual TIME=37.864..48.290 ROWS=60769 loops=1)
       Hash Cond: (y.x_id = x.id)
       FILTER: (y.id IS NULL)
       ROWS Removed BY FILTER: 50000
       Buffers: shared hit=665
       ->  Seq Scan ON y  (cost=0.00..722.00 ROWS=50000 width=8) (actual TIME=0.012..1.972 ROWS=50000 loops=1)
             Buffers: shared hit=222
       ->  Hash  (cost=1443.00..1443.00 ROWS=100000 width=4) (actual TIME=17.567..17.570 ROWS=100000 loops=1)
             Buckets: 131072  Batches: 1  Memory Usage: 4540kB
             Buffers: shared hit=443
             ->  Seq Scan ON x  (cost=0.00..1443.00 ROWS=100000 width=4) (actual TIME=0.009..4.855 ROWS=100000 loops=1)
                   Buffers: shared hit=443
     Planning:
       Buffers: shared hit=17
     Planning TIME: 0.351 ms
     Execution TIME: 49.677 ms
    (16 ROWS)

    Stop with the “subqueries are evil” nonsense! Or, at the very least *do some minimal research*.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.