Every now and then I see something like this:
SELECT u.* FROM users u WHERE 0 = (SELECT COUNT(*) FROM addresses a WHERE a.user_id = u.id);
and it kinda pains me. So figured, I'll write about what is the problem with it, and how to avoid such constructs.
Lets consider what went through someones mind when they wrote it. This query will return all users, that have zero addresses. In other words – users that never provided any address.
Why is that bad? After all, if we need to find users without addresses, it does what needs to be done? Right. No.
The problem is counting. What will happen if an user has million addresses? Pg will dutifully count all of them, only then to reject the row, because there was something.
The thing is that you don't need to have million addresses. Even if there can be two – they still need to be counted, and while the time cost will be lower than counting million rows, it is still there, and it's 100% useless work.
So, the question can be: how to do the thing properly? It's easy – there is EXISTS expression:
SELECT u.* FROM users u WHERE NOT EXISTS (SELECT FROM addresses a WHERE a.user_id = u.id);
EXISTS (well, in this case NOT EXISTS) will check abandoning the test as soon as one row will be found. No need to count.
Of course, in some (most?) cases, where people use this construct (0 == count), someone can say: “but it doesn't matter in our case, because…". Well, this is where best practices part of the title comes from. One shouldn't do wrong things, even if they don't matter in this particular case, at the very least not to get into habit of doing it, and then accidentally make the same mistake when it will actually be important.
I use the analogy of asking someone how many people are in a meeting room vs asking if the meeting room is occupied. If you’re looking for an unoccupied room to hold a meeting you don’t care how many are in the room.
great to point that out!
One remark: there is still a sub select in the resulting query which is generally a slowdown. Replacing it with a LEFT JOIN is a better choice
SELECT u.* FROM users u
LEFT JOIN addresses a ON a.user_id = u.id
WHERE a.user_id IS NULL;
What do you think?
@Manuel:
1. the idea that “subselects are bad, so use ‘xxx’ instead” is insanely harmful
2. as with almost everything in dbs – it depends on data distribution. sometimes join will be better, sometimes not.
@Manuel: The subquery with NOT EXISTS will be executed as a anto-join by the PostgreSQL optimizer, so it will perform at least as well as the outer join you suggested.
I thought the big idea with SQL (and why we tolerate its declarative nature which makes things so opaque for humans) is that it should rewrite our poorly optimized queries like this. Is anyone sure that this isn’t actually optimized in major DBs?
I find the subquery less easy to read than a join.
@depesz can you elaborate on why avoiding subselects when possible is harmful? If possible some simple examples would go a long way. I was taught this too and would love to know more/better.
@Laurenz_Albe points out that the NOT EXISTS subselect will be optimized by PostgreSQL. I do not doubt that and love that PostgreSQL can do such a thing! However, I and many devs I’ve worked with do not know that instinctively and would rewrite it as @Manuel suggests. In fact, that was the solution I thought of even before I finished reading the post.
What is the difference between LEFT JOIN and NOT EXISTS in this article? Can you list them and write a separate article to explain them.
@agrberg:
Each tool has it’s uses. It’s like saying: avoid using hammers, because if you hit yourself in a finger, it will hurt.
Subqueries aren’t inherently bad. Neither are IN(), JOINs or anything else. Everything should be used in their situation.
There are people that say that “x” is always bad, so always use “anything other than x” – which I believe is bad approach, as it is, basically, cargo cult.
业余草:
Sorry, I don’t do requests. Unless you want to sponsor articles, but I don’t think it’s realistic.
What is left join: https://duckduckgo.com/?t=ffab&q=sql+tutorial+left+join&ia=web
What is not exists: https://duckduckgo.com/?q=sql+tutorial+exists&t=ffab&ia=web
And as with everything – try. You have free database that provides great documentation, and amazing tool: explain. Use it to learn. Don’t expect others to respond to chat-gpt style prompts to answer your questions.
This is not a good example.
Sure, NOT EXISTS is better than count(*) = 0.
But, NOT EXISTS subquery in where condition is harmful in this example.
To ENSURE each record returned NOT EXISTS the following subquery, means every row in this query need to perform this subquery once.
When there’s millions of users, then it will be very slow.
Left Join where join target is null will be much faster in this case, and that’s why Postgres auto rewrite this with left join.
Aviod non necessary subquery for EVERY row is very important. So if there’s other filter condition that would use good index, and the actual fetched row count is low then the NOT EXISTS every row subqurry check is fine, and slightly better that count(*) = 0.
So your assertion is correct but this example is not helpful for your readers. If your DB not automatically rewrite with left join, it’s very very slow or even not working when there’s tons of data in this example.
@Pondin:
> To ENSURE each record returned NOT EXISTS the following subquery, means every row in this query need to perform this subquery once.
Well, this is clearly nonsense. If you can’t be bothered to test it yourself just read the comment from Laurenz above.
Your post is clear example of the approach that I was commenting on in my comments earlier: “subqueries are wrong”. Do some tests, and stop spreading FUD.
Just so that I will be clear:
As you can see none part of this query was ran “for every row”.
And just as a final note, this is explain for the query rewritten as outer join:
Stop with the “subqueries are evil” nonsense! Or, at the very least *do some minimal research*.