Waiting for 9.3 – Dramatically reduce System V shared memory consumption.

On 28th of June, Robert Haas committed patch:

Dramatically reduce System V shared memory consumption.
 
Except when compiling with EXEC_BACKEND, we'll now allocate only a tiny
amount of System V shared memory (as an interlock to protect the data
directory) and allocate the rest as anonymous shared memory via mmap.
This will hopefully spare most users the hassle of adjusting operating
system parameters before being able to start PostgreSQL with a
reasonable value for shared_buffers.
 
There are a bunch of documentation updates needed here, and we might
need to adjust some of the HINT messages related to shared memory as
well.  But it's not 100% clear how portable this is, so before we
write the documentation, let's give it a spin on the buildfarm and
see what turns red.

This patch doesn't add any new functionality, but removes one thing that had caused some issues.

As you perhaps know, PostgreSQL has so called “shared_buffers". In there, it stores various data, most importantly copies of data pages (8kB blocks).

Problem with shared_buffers is that you usually start by setting them to something like 20%-25% of available RAM, which with current multi-gigabyte servers is a non-trivial amount.

And most of the systems I've seen have very conservative limits on how much shared memory there can be. For example – my desktop Ubuntu 12.04 has the limit set to:

=$ cat /proc/sys/kernel/shmmax
33554432

“Whopping" 32MB.

This means that when you configure your PostgreSQL to actually use the memory it has for good use – i.e. for shared_buffers – you have to configure your kernel too.

And if you forget, or something fails to re-configure it on reboot – PostgreSQL will not start, showing errors like:

2012-07-12 12:25:13 CEST [] [7510]: [1-1] user=,db=,e=XX000: FATAL:  could not create shared memory segment: Invalid argument
2012-07-12 12:25:13 CEST [] [7510]: [2-1] user=,db=,e=XX000: DETAIL:  Failed system call was shmget(key=5910001, size=3318874112, 03600).
2012-07-12 12:25:13 CEST [] [7510]: [3-1] user=,db=,e=XX000: HINT:  This error usually means that PostgreSQL's request for a shared memory segment exceeded your kernel's SHMMAX parameter.  You can either reduce the request size or reconfigure the kernel with larger SHMMAX.  To reduce the request size (currently 3318874112 bytes), reduce PostgreSQL's shared memory usage, perhaps by reducing shared_buffers or max_connections.
        If the request size is already small, it's possible that it is less than your kernel's SHMMIN parameter, in which case raising the request size or reconfiguring SHMMIN is called for.
        The PostgreSQL documentation contains more information about shared memory configuration.

Error message is pretty helpful, so fixing it is usually not a problem. But why have the error in the first place, when you can skip it altogether?

Roberts commit does exactly it. Instead of using so called “System V shared memory" (which is the subject to limitation in SHMMAX), it switches to use shared memory by mmap.

Thanks to this, on the same machine, with the same 32MB limit for SHMMAX, I can start PostgreSQL 9.3, with shared_buffers = 3GB, and it works:

=$ ps -u pgdba f u
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
pgdba     7993  0.0  0.7 3268388 86136 ?       S    12:28   0:00 /home/pgdba/work/bin/postgres
pgdba     7997  0.0  0.0  24792   560 ?        Ss   12:28   0:00  \_ postgres: logger process
pgdba     7999  0.0  0.0 3269928  984 ?        Ss   12:28   0:00  \_ postgres: checkpointer process
pgdba     8000  0.0  0.1 3269928 15772 ?       Ss   12:28   0:00  \_ postgres: writer process
pgdba     8001  0.0  0.0 3269928  984 ?        Ss   12:28   0:00  \_ postgres: wal writer process
pgdba     8002  0.0  0.0 3270932 2456 ?        Ss   12:28   0:00  \_ postgres: autovacuum launcher process
pgdba     8003  0.0  0.0  26888   632 ?        Ss   12:28   0:00  \_ postgres: archiver process
pgdba     8004  0.0  0.0  27184  1240 ?        Ss   12:28   0:00  \_ postgres: stats collector process

One less thing to worry about, and one less reason why starting Pg might fail. Of course it can still fail, if you'll configure shared_buffers larger than your actual memory size, but thats much less likely.

8 thoughts on “Waiting for 9.3 – Dramatically reduce System V shared memory consumption.”

Colin 't Hart says:

2012-07-12 at 14:20

Has there been any fallout in the buildfarm as a result of this change?
depesz says:

2012-07-12 at 14:50

@Colin: as far as I know – no.
Sean Chittenden says:

2012-07-12 at 19:23

When this feature was committed just after the 9.2 branch was cut, a part of me cried. 9.3 can’t get here any faster. This change makes it significantly easier to administrate hosts that run many clusters on the same server.
Andreas says:

2012-07-13 at 12:36

@Sean: The patch is quite small so you can backport the relevant commits quite easily.
Scott C. says:

2012-08-31 at 05:15

Does this make it impossible to use huge pages in the future?

For Oracle, huge pages have a big performance impact once its SGA goes past ~ 8GB. I would imagine that for Postgres shared_buffers the same would apply. I noticed that Tom Lane tried to use huge pages a while back with 1GB and saw no benefit, but if Oracle is any guide, that is not large enough to see benefit from huge pages.
depesz says:

2012-08-31 at 10:28

@Scott:
Such question should be asked on pgsql-hackers – I am, by far, not a specialist when it comes to internals, and how inner elements of Pg work.
Craig Ringer says:

2012-10-23 at 10:29

Since most platforms also ulimit() the amount of mlock()able memory Pg presumably isn’t pinning shared_buffers in RAM, which is a really nice effect of this change.

This should make running multiple clusters on a single host a lot nicer than the current pain with clusters fighting over shm and wasting resources with all that pinned shared memory.
David Gould says:

2013-02-09 at 10:58

Huge pages make a big difference with high connection counts. If you are trying to map a 16GB buffer cache into 2000 processes using 4K pages it eats up to 64GB for page tables.

Comments are closed.