so there you go, you have some “categories" and some objects. for simplicity let's assume one object can be in only one category.
if this is too theoretical for you – let's assume these are “mails in folders", “photos in galleries", “posts in categories" or “auctions in categories on ebay".
everything clear? now, let's assume you want to know how many “objects" are in given “category".
most basic way to do it is:
SELECT COUNT(*) FROM objects WHERE category = some_category;
but this method is far from optimal. now, we'll learn how to do it better.
one warning for those of you who read the rss feed – if you say “yeah, i know the code, it's simple" – ask yourself – is your code deadlock-proof?
the best way to get the counts fast is not to count them – i.e.to have them counted for you, ready to be used.
to do so we will need some triggers.
basically we need triggers on 3 actions:
- insert of new object (counter for given category has to be incremented)
- delete of object (counter for given category has to be decremented)
- update of object – in this case we have to check if category has changed, and, if so, update two counters appropriately.
now, some of you might be tempted to skip the third trigger – don't!
in some cases there is no actual “delete" of object – you'd rather change some status. this means that you have to modify “on update" code, but you will have to keep “on delete" trigger just in case somebody deletes an “active" object.
so, now goes a big question – one trigger to rule them all, or 3 separate triggers? i tested it and found out that 3 triggers (separate for insert, update and delete) are faster. the difference is not big (1.5% – 4.5%), but it is there (i did the test by repeat 350 times: insert a lot of rows, update a lot of rows, delete a lot of rows).
let's see how the code will look.
tables for our exercise are very simple, nothing fancy:
CREATE TABLE categories ( id INT8 PRIMARY KEY, object_count INT4 NOT NULL DEFAULT 0 ); CREATE TABLE objects ( id serial PRIMARY KEY, category_id INT8 NOT NULL REFERENCES categories );
i think names of tables and fields are self explanatory, so no description is necessary.
triggers. first the trigger for “on insert":
CREATE OR REPLACE FUNCTION count_trg_i() RETURNS TRIGGER AS $BODY$ DECLARE BEGIN UPDATE categories SET object_count = object_count + 1 WHERE id = NEW.category_id; RETURN NEW; END; $BODY$ LANGUAGE plpgsql; CREATE TRIGGER count_trg_i AFTER INSERT ON objects FOR EACH ROW EXECUTE PROCEDURE count_trg_i();
on update:
CREATE OR REPLACE FUNCTION count_trg_u() RETURNS TRIGGER AS $BODY$ DECLARE BEGIN IF NEW.category_id <> OLD.category_id THEN UPDATE categories SET object_count = object_count + 1 WHERE id = NEW.category_id; UPDATE categories SET object_count = object_count - 1 WHERE id = OLD.category_id; END IF; RETURN NEW; END; $BODY$ LANGUAGE plpgsql; CREATE TRIGGER count_trg_u AFTER UPDATE ON objects FOR EACH ROW EXECUTE PROCEDURE count_trg_u();
and the last one, on delete:
CREATE OR REPLACE FUNCTION count_trg_d() RETURNS TRIGGER AS $BODY$ DECLARE BEGIN UPDATE categories SET object_count = object_count - 1 WHERE id = OLD.category_id; RETURN OLD; END; $BODY$ LANGUAGE plpgsql; CREATE TRIGGER count_trg_d AFTER DELETE ON objects FOR EACH ROW EXECUTE PROCEDURE count_trg_d();
doesn't look scary, does it?
so, let's test it:
first, i'll insert some categories:
INSERT INTO categories (id) SELECT i FROM generate_series(1, 1000 ) i;
and now i have 1000 categories. with no objects (yet). so let's create some objects:
INSERT INTO objects (category_id) SELECT 1 + CAST(random() * ( 999 ) AS INT4) FROM generate_series(1, 5000 ) i;
5000 objects ready 🙂
how about some updates?
UPDATE objects SET category_id = 1 + CAST(random() * ( 999 ) AS INT4);
and then, let's have some fun with delete:
DELETE FROM objects WHERE random() < 0.8;
around 80% of objects got removed.
now, let's see if the calculations it made are correct. to check real counts i will use this query:
SELECT c.id, MIN(object_count), COUNT(o.id) FROM categories c LEFT OUTER JOIN objects o ON c.id = o.category_id GROUP BY c.id HAVING MIN(object_count) <> COUNT(o.id);
if it returns no rows – it means the count in “categories.object_count" is correct.
the problem is that the “on update" trigger as it is here can lead to deadlocks. why? order of updates is not guaranteed, so it's perfectly possible that 2 separate updates will lock the same categories' records but with “wrong" order, like:
- process 1 updates (and locks) category with id = 123;
- process 2 updates (and locks) category with id = 234;
- process 1 waits for lock on category 234
- process 2 waits for lock on category 123
- deadlock
is there any chance to prevent it? sure. let's change the “on update" trigger function to:
CREATE OR REPLACE FUNCTION count_trg_u() RETURNS TRIGGER AS $BODY$ DECLARE BEGIN IF NEW.category_id = OLD.category_id THEN RETURN NEW; END IF; IF NEW.category_id < OLD.category_id THEN UPDATE categories SET object_count = object_count + 1 WHERE id = NEW.category_id; UPDATE categories SET object_count = object_count - 1 WHERE id = OLD.category_id; ELSE UPDATE categories SET object_count = object_count - 1 WHERE id = OLD.category_id; UPDATE categories SET object_count = object_count + 1 WHERE id = NEW.category_id; END IF; RETURN NEW; END; $BODY$ LANGUAGE 'plpgsql';
in this way order of locking is guaranteed – first lock will be to category with smaller id. so the same 2 processes from deadlock above will work like this:
- process 1 updates (and locks) category with id = 123;
- process 2 waits for lock on category 123
- process 1 updates (and locks) category with id = 234;
- process 1 commits transaction (and releases the locks)
- process 2 updates category with id = 123;
- process 2 updates category with id = 234;
- process 1 commits transaction
everybody is happy 🙂
of course some of you might say that such a situation (leading to deadlock) is very rare, or “will never happen to me". yeah. right. think again.
deadlock-proof?
could U explain what is deadlock? ;]
It’s when two things happen (like updates) at the same time and both attempt to work with the same data. Everything grinds to a halt.
This is good – I never thought about this happenning with the triggers listed first.
How about locking the two records at the beginning of the update trigger:
SELECT id FROM categories WHERE id IN (OLD.id, NEW.id) FOR UPDATE;
I’m curious whether it locks the two records really the same time, or can I run to deadlock with that?
Anyway, thank you for the idea.
Row locks are always acquired one at a time in some order, although which order is not necessarily guaranteed and the time interval between locks might be extremely short.
Two SELECT queries like the one in comment should always lock the rows in the same order, although that order is based on where the data happens to be stored in the table. Most of the time the planner will generate the same execution plan, although in some cases (admittedly cases which don’t usually occur with primary keys or other unique columns) two apparently similar queries can generate radically different plans.
One gotcha is that the locks will interact with rows that are in the index but not necessarily committed or visible to the current transaction. This is how you can get deadlocks from INSERT queries and indexes with UNIQUE constraints.
One thing of note: while that implementation is correct and safe in that no deadlocks will happen, if you have a lot of traffic on your objects table, you may run into the problem of the trigger actions “stacking”. A viable solution is to have all of the trigger actions insert data into some interim table with the update value for the category table (+/- x) and have a separate process that repeatedly sweeps and aggregates those updates to cut down on the number of updates happening on the categories table.