Every now and then I see something like this:
SELECT u.* FROM users u WHERE 0 = (SELECT COUNT(*) FROM addresses a WHERE a.user_id = u.id);
and it kinda pains me. So figured, I'll write about what is the problem with it, and how to avoid such constructs.
Lets consider what went through someones mind when they wrote it. This query will return all users, that have zero addresses. In other words – users that never provided any address.
Why is that bad? After all, if we need to find users without addresses, it does what needs to be done? Right. No.
The problem is counting. What will happen if an user has million addresses? Pg will dutifully count all of them, only then to reject the row, because there was something.
The thing is that you don't need to have million addresses. Even if there can be two – they still need to be counted, and while the time cost will be lower than counting million rows, it is still there, and it's 100% useless work.
So, the question can be: how to do the thing properly? It's easy – there is EXISTS expression:
SELECT u.* FROM users u WHERE NOT EXISTS (SELECT FROM addresses a WHERE a.user_id = u.id);
EXISTS (well, in this case NOT EXISTS) will check abandoning the test as soon as one row will be found. No need to count.
Of course, in some (most?) cases, where people use this construct (0 == count), someone can say: “but it doesn't matter in our case, because…". Well, this is where best practices part of the title comes from. One shouldn't do wrong things, even if they don't matter in this particular case, at the very least not to get into habit of doing it, and then accidentally make the same mistake when it will actually be important.
I use the analogy of asking someone how many people are in a meeting room vs asking if the meeting room is occupied. If you’re looking for an unoccupied room to hold a meeting you don’t care how many are in the room.
great to point that out!
One remark: there is still a sub select in the resulting query which is generally a slowdown. Replacing it with a LEFT JOIN is a better choice
SELECT u.* FROM users u
LEFT JOIN addresses a ON a.user_id = u.id
WHERE a.user_id IS NULL;
What do you think?
@Manuel:
1. the idea that “subselects are bad, so use ‘xxx’ instead” is insanely harmful
2. as with almost everything in dbs – it depends on data distribution. sometimes join will be better, sometimes not.
@Manuel: The subquery with NOT EXISTS will be executed as a anto-join by the PostgreSQL optimizer, so it will perform at least as well as the outer join you suggested.
I thought the big idea with SQL (and why we tolerate its declarative nature which makes things so opaque for humans) is that it should rewrite our poorly optimized queries like this. Is anyone sure that this isn’t actually optimized in major DBs?
I find the subquery less easy to read than a join.