just recently we got another array for out main production database. this means – we will be able to add new tablespace, thus making everything go faster.
in theory – it's nice. but which tables to move to the other?
the basic assumption is simple – index on table should not be on the same tablespace as the table itself. that's easy. but – should we really put all tables on one tablespace, and all indexes on another?
we decided that the important things that should be “boosted" are seeks and writes. sequential reads are (in our situation) more or less irrelevant.
read on to check how we split the load.
but how do i know how many seeks and writes i have for every table?
luckily – postgres tells it to me:
# SELECT relname, idx_tup_fetch, n_tup_ins, n_tup_upd, n_tup_del FROM pg_stat_user_tables ORDER BY relname; relname | idx_tup_fetch | n_tup_ins | n_tup_upd | n_tup_del ----------------------------+---------------+-----------+-----------+----------- browse | 0 | 0 | 0 | 0 LAST | 249 | 248 | 0 | 247 mailaddresses | 0 | 0 | 0 | 0 messages | [NULL] | 17577 | 0 | 37 presence | 250 | 1844 | 0 | 1843 presence_history | 0 | 3687 | 0 | 0 privacy | 0 | 0 | 0 | 0 private | 0 | 0 | 0 | 0 roster | 1231 | 295 | 0 | 294 storedsubscriptionrequests | 44 | 35 | 0 | 30 users | 6158 | 1 | 0 | 0 vcard | 821 | 5 | 0 | 4 (12 ROWS)
this is example output from my jabber database. it shows that the most seeks (which are related to idx_tup_fetch) are in users table. as for writes – i have to sumammarize inserted, updated and deleted rows, but it's not really complicated:
# SELECT relname, idx_tup_fetch AS seeks, n_tup_ins + n_tup_upd + n_tup_del AS writes FROM pg_stat_user_tables ORDER BY relname; relname | seeks | writes ----------------------------+--------+-------- browse | 0 | 0 LAST | 249 | 495 mailaddresses | 0 | 0 messages | [NULL] | 17614 presence | 250 | 3687 presence_history | 0 | 3687 privacy | 0 | 0 private | 0 | 0 roster | 1231 | 589 storedsubscriptionrequests | 44 | 65 users | 6158 | 1 vcard | 821 | 9 (12 ROWS)
now, the best option to split all these tables would be to do so in a way that summarized seeks on both tablespaces will be the same (it would mean that we had splitthe seeks load exactly 50/50). same goes for writes.
for example:
let's assume that tablespace “primary" will get tables browse, last and users. while new, secondary tablespace will get rest of them.
primary load would be:
- seeks = 0 (browse) + 249 (last) + 6158 (users) = 6407.
- writes = 0 + 495 + 1 = 496
secondary load:
- seeks = 0 (messages) + 250 (persence) + 1231 (roster) + 44 (storedsubscriptionrequests) + 821 (vcard) = 2346
- writes = 17614 + 3687 + 3687 + 589 + 65 + 9 = 25651
so, let's see how the load is balanced. seeks: 6407 vs. 2346. which means smaller is around 36.6% of bigger.
in case of writes the lack of balance is even more visible: 496 vs. 25651, so the smaller is only 1.9%.
now – in the best situation – there would be no smaller, so both tablespaces would have “100%" in terms of seeks and writes. combined – it would show 200% (seeks + writes).
based on this assumption i wrote a program which tries to find the best layout.
it connects to database, gets list of tables, their respective seeks and writes numbers, and tries to find the best layout.
output of this program looks like this:
tablespace: primary ||| tablespace: secondary ||| score --------------------------------------------+++--------------------------------------------+++---------- tables | idx_tup_fetch | writes ||| tables | idx_tup_fetch | writes ||| score --------+-----------------+-----------------+++--------+-----------------+-----------------+++---------- ------: 132 | 24 100 792 839 | 30 964 283 ||| 148 | 12 386 640 794 | 17 829 644 ||| 1.089765 1000: 132 | 20 153 660 201 | 24 973 620 ||| 148 | 16 333 773 432 | 23 820 307 ||| 1.764281 2000: 132 | 20 153 659 880 | 24 973 597 ||| 148 | 16 333 773 753 | 23 820 330 ||| 1.764282 3000: 132 | 20 153 659 880 | 24 973 488 ||| 148 | 16 333 773 753 | 23 820 439 ||| 1.764291 4000: 132 | 20 153 660 109 | 24 973 485 ||| 148 | 16 333 773 524 | 23 820 442 ||| 1.764291 5000: 132 | 20 153 659 880 | 24 973 485 ||| 148 | 16 333 773 753 | 23 820 442 ||| 1.764291 6000: 132 | 20 153 659 880 | 24 973 485 ||| 148 | 16 333 773 753 | 23 820 442 ||| 1.764291 7000: 132 | 20 153 659 880 | 24 973 485 ||| 148 | 16 333 773 753 | 23 820 442 ||| 1.764291 8000: 132 | 20 153 659 880 | 24 973 485 ||| 148 | 16 333 773 753 | 23 820 442 ||| 1.764291 9000: 132 | 20 153 659 880 | 24 973 485 ||| 148 | 16 333 773 753 | 23 820 442 ||| 1.764291 10000: 132 | 20 153 659 880 | 24 973 485 ||| 148 | 16 333 773 753 | 23 820 442 ||| 1.764291 11000: 132 | 20 153 659 880 | 24 973 485 ||| 148 | 16 333 773 753 | 23 820 442 ||| 1.764291 12000: 132 | 20 153 659 880 | 24 973 485 ||| 148 | 16 333 773 753 | 23 820 442 ||| 1.764291 13000: 132 | 20 153 659 880 | 24 973 485 ||| 148 | 16 333 773 753 | 23 820 442 ||| 1.764291 14000: 132 | 20 153 659 880 | 24 973 485 ||| 148 | 16 333 773 753 | 23 820 442 ||| 1.764291 15000: 132 | 20 153 659 880 | 24 973 485 ||| 148 | 16 333 773 753 | 23 820 442 ||| 1.764291 16000: 132 | 20 153 659 880 | 24 973 485 ||| 148 | 16 333 773 753 | 23 820 442 ||| 1.764291 17000: 132 | 20 153 659 880 | 24 973 485 ||| 148 | 16 333 773 753 | 23 820 442 ||| 1.764291 18000: 132 | 20 153 659 880 | 24 973 485 ||| 148 | 16 333 773 753 | 23 820 442 ||| 1.764291 19000: 132 | 20 153 659 880 | 24 973 485 ||| 148 | 16 333 773 753 | 23 820 442 ||| 1.764291 20000: 132 | 20 153 659 880 | 24 973 485 ||| 148 | 16 333 773 753 | 23 820 442 ||| 1.764291 21000: 132 | 20 153 659 880 | 24 973 485 ||| 148 | 16 333 773 753 | 23 820 442 ||| 1.764291 tablespace: primary ||| tablespace: secondary ||| score --------------------------------------------+++--------------------------------------------+++---------- tables | idx_tup_fetch | writes ||| tables | idx_tup_fetch | writes ||| score --------+-----------------+-----------------+++--------+-----------------+-----------------+++---------- ------: 133 | 12 947 666 336 | 39 956 104 ||| 147 | 23 539 824 882 | 8 837 991 ||| 0.771225 1000: 133 | 18 158 881 443 | 24 397 048 ||| 147 | 18 328 609 775 | 24 397 047 ||| 1.990740 2000: 133 | 18 158 881 443 | 24 397 048 ||| 147 | 18 328 609 775 | 24 397 047 ||| 1.990740
what it all means?
first the program assigns tables to tablespaces (virtually, of course!) totally randomly. then it calculates achieved “balance".
this is this line:
------: 132 | 24 100 792 839 | 30 964 283 ||| 148 | 12 386 640 794 | 17 829 644 ||| 1.089765
it means that after randomized layout, it did put 132 tables to primary tablespace, and 148 to secondary.
summarized seeks are 24 100 792 839 in case of primary, and 12 386 640 794 for secondary. for writes the numbers were 30 964 283 vs. 17 829 644.
number 1.089… is “balance" score which is calculated like i showed above but not in percents – so score 1.089, means 108.9% (this is summarized for 2 factors).
now, my program tries to randomly swap tables between tablespaces. if new (after swap) score is higher than previous – the move is made. if it is not higher – it is rollbacked (again, virtually).
every 1000 iterations program prints current score:
1000: 132 | 20 153 660 201 | 24 973 620 ||| 148 | 16 333 773 432 | 23 820 307 ||| 1.764281
as this can be seen, after 1000 iterations it did found a nice layout – with score 1.76, and really close numbers of seeks and writes.
after printing out results, program checks if it didn't reach the target for current check – in this case, target was 1.99. if the target is not reached, it makes next swaps.
if 21000 swaps happened, and the target is still not reached, program starts from scratch – gets new data from database, randomizes layout, and starts random swaps.
when target score is reached – program stops.
since every time new line is printed, program outputs current layout (in form of ready sql script) – after finishing the program i have nice sql which can be used to put the tables to their respective tablespaces (assuming they are named primary and secondary).
simple. yet, doing it “by hand" for 300 tables is not something i would call “fun".
additionally the program can be told not to put 2 tables on the same tablespace. this is useful for example if you have triggers that on every insert/update/delete to table “a", automatically update table “b". in such a case putting them (tables “a" and “b") on separate tablespaces will be a good idea.
now, you can get the program from my GitHub (it is named: find.best.tablespace.split.pl).
as usual there is no manual, and the code is ugly, but it should be easily customizable.
important parts:
- database connection info is in get_data_from_db() function (last one in code)
- list of which tables should not be on the same tablespace as another is at the very beginning, and is stored in %cant_be_together (hash, dictionary).
- sql script with commands to move tables to tablespaces is saved as tablespace.split.out in current directory
if you have any questions about it – feel free to mail or comment on blog 🙂
Would simulated annealing (or a similar algorithm) help you find a solution without needing to restart?
i.e. add one function, and change one line to:
if ($new_result
if ($new_result is less than $current_result + slowly_reducing_temperature()) {