Word of warning: this blogpost is about thing related to Bash (well, maybe other shells too, didn't really test), but since I found it while doing Pg work, and it might bite someone else doing Pg related work, I decided to add it to “postgresql" tag.
So, due to some work I had to do, I needed a quick, repeatable way to setup some Pg instances, replication between them, and some data loader. All very simple, no real problems. At least that's what I thought…
Let's start with two simple bash scripts:
setup.sh:
#!/usr/bin/env bash pg_ctl -D test/master start
and generate-load.sh:
#!/usr/bin/env bash i=0 j=0 while true do seq 1 10 | sed 's/.*/select * from increment();/' | psql -qAtX -p 5432 -U test -d postgres echo -n '#' i=$(( i + 1 )) if (( i == 100 )) then j=$(( j + 1 )) echo " : $j" i=0 fi sleep 0.1 done
First script is just one line, that starts Pg, but the other looks more complex. What is does is very simple – it runs psql, calls increment() function in postgres database (some helper function, irrelevant , could be select 1) – 10 times (i.e. 10 times calls the function), then sleeps 0.1s, and repeats.
Every repeat it prints “#" character and every 100 repeats it moves cursor to new line.
Very simple.
When I start it manually, from shell, it looks like this:
=$ ./setup.sh server starting =$ ./generate-load.sh ################################^C =$ ps uxf USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND test 19409 0.0 0.0 90532 2268 ? S May15 0:01 sshd: test@pts/4 test 19410 0.0 0.0 25084 5264 pts/4 Ss May15 0:00 \_ -bash test 12152 0.0 0.0 19668 1308 pts/4 R+ 00:14 0:00 \_ ps uxf test 11925 0.1 0.1 173492 12940 pts/4 S 00:13 0:00 /home/pgdba/work/bin/postgres -D test/master test 11929 0.0 0.0 27016 804 ? Ss 00:13 0:00 \_ postgres: logger process test 11931 0.0 0.0 173492 976 ? Ss 00:13 0:00 \_ postgres: checkpointer process test 11932 0.0 0.0 173492 984 ? Ss 00:13 0:00 \_ postgres: writer process test 11933 0.0 0.0 173492 976 ? Ss 00:13 0:00 \_ postgres: wal writer process test 11934 0.0 0.0 174308 2180 ? Ss 00:13 0:00 \_ postgres: autovacuum launcher process test 11935 0.0 0.0 29112 880 ? Ss 00:13 0:00 \_ postgres: archiver process last was 000000010000000000000003 test 11936 0.0 0.0 29248 1076 ? Ss 00:13 0:00 \_ postgres: stats collector process
All fine. But I'm lazy. So at the end of setup.sh (in my real script, it was longer, but that's irrelevant), I added: ./generate-load.sh, and to make it wait properly, added also -w flag to pg_ctl. So finally setup.sh looks like:
#!/usr/bin/env bash pg_ctl -D test/master -w START ./generate-LOAD.sh
Theoretically all is fine, there shouldn't be any problems. Right?
Well:
=$ ps uxf USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND test 19409 0.0 0.0 90532 2268 ? S May15 0:01 sshd: test@pts/4 test 19410 0.0 0.0 25088 5268 pts/4 Ss May15 0:00 \_ -bash test 14129 0.0 0.0 19668 1304 pts/4 R+ 00:16 0:00 \_ ps uxf =$ ./setup.sh waiting for server to start.... done server started ##########################^C =$ ps uxf USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND test 19409 0.0 0.0 90532 2268 ? S May15 0:01 sshd: test@pts/4 test 19410 0.0 0.0 25088 5268 pts/4 Ss May15 0:00 \_ -bash test 14291 0.0 0.0 19668 1304 pts/4 R+ 00:16 0:00 \_ ps uxf
Pg got killed when I pressed ctrl-c when executing generate-load.sh!
I tried all kinds of “stuff". Removed -w (and added sleep to prevent generate-load.sh from starting too soon). Added /dev/null redirections for all standard streams. Added nohup. Added “&" at the end.
All for nothing.
When script runs Pg, and later on is CTRL-C'ed – Pg gets stopped too.
That was major WTF.
It looked like SIGINT (ctrl-c) got propagated to already-started, happily running background processes. But why?
RhodiumToad on irc found out why exactly, and what to do to prevent it from happening.
Before I will go into fixing it, let's try to simplify the case. Let's write new script – test.sh, that will do something “like" running pg_ctl:
#!/usr/bin/env bash perl -e 'fork and exit; sleep(1000000)' & echo "press ctrl-c, run ps, and notice lack of perl script" sleep 1000000
When I run it, I get:
=$ ps uxf USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND test 15064 0.0 0.0 90380 2232 ? S 00:22 0:00 sshd: test@pts/2 test 15065 0.2 0.0 24880 5056 pts/2 Ss 00:22 0:00 \_ -bash test 15323 0.0 0.0 19668 1308 pts/2 R+ 00:23 0:00 \_ ps uxf =$ ./test.sh press ctrl-c, run ps, and notice lack of perl script ^C =$ ps uxf USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND test 15064 0.0 0.0 90380 2232 ? S 00:22 0:00 sshd: test@pts/2 test 15065 0.2 0.0 24880 5056 pts/2 Ss 00:22 0:00 \_ -bash test 15334 0.0 0.0 19668 1304 pts/2 R+ 00:23 0:00 \_ ps uxf test 15330 0.0 0.0 17348 276 pts/2 S 00:23 0:00 perl -e fork and exit; sleep(1000000)
WTF?! Why is it there? Clearly bash left it to its work, but why?
Well, Bash didn't leave it alone. It's just that Perl, when started in background ( & at the end ) simply ignores SIGINT (which is what is generated when ctrl-c is pressed). But postgres doesn't ignore it – it exits. Of course, I could modify PostgreSQL sources to make it invulnerable, but somehow this didn't strike me as particularly good idea.
Instead, decided to make my perl “stand-in for PostgreSQL" die on sigint, which is luckily simple:
$ cat test.sh #!/usr/bin/env bash perl -e 'fork and exit; $SIG{INT} = sub {die "SIGINT\n" }; sleep(1000000)' & echo "press ctrl-c, run ps, and notice lack of perl script" sleep 1000000 =$ ps uxf USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND test 15064 0.0 0.0 90380 2232 ? S 00:22 0:00 sshd: test@pts/2 test 15065 0.0 0.0 24880 5056 pts/2 Ss 00:22 0:00 \_ -bash test 15575 0.0 0.0 19668 1304 pts/2 R+ 00:26 0:00 \_ ps uxf =$ ./test.sh press ctrl-c, run ps, and notice lack of perl script ^CSIGINT =$ ps uxf USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND test 15064 0.0 0.0 90380 2232 ? S 00:22 0:00 sshd: test@pts/2 test 15065 0.0 0.0 24880 5056 pts/2 Ss 00:22 0:00 \_ -bash test 15624 0.0 0.0 19668 1304 pts/2 R+ 00:27 0:00 \_ ps uxf
Now – perl is gone, and it even printed “SIGINT" as requested.
So – clearly Bash is propagating those sigints to previously started processes.
Simple solution is to use so called “monitor" mode in Bash. This is also what interactive shell uses by default – it enables normal job control.
When I'll change test.sh to:
#!/usr/bin/env bash set -o monitor perl -e 'fork and exit; $SIG{INT} = sub {die "SIGINT\n" }; sleep(1000000)' & echo "press ctrl-c, run ps, and notice lack of perl script" sleep 1000000
When I run it, I get:
=$ ps uxf USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND test 15064 0.0 0.0 90380 2232 ? S 00:22 0:00 sshd: test@pts/2 test 15065 0.0 0.0 24880 5056 pts/2 Ss 00:22 0:00 \_ -bash test 15784 0.0 0.0 19668 1304 pts/2 R+ 00:29 0:00 \_ ps uxf =$ ./test.sh press ctrl-c, run ps, and notice lack of perl script ^C =$ ps uxf USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND test 15064 0.0 0.0 90380 2232 ? S 00:22 0:00 sshd: test@pts/2 test 15065 0.0 0.0 24880 5056 pts/2 Ss 00:22 0:00 \_ -bash test 15796 0.0 0.0 19668 1304 pts/2 R+ 00:29 0:00 \_ ps uxf test 15792 0.0 0.0 17348 480 pts/2 S 00:29 0:00 perl -e fork and exit; $SIG{INT} = sub {die "SIGINT\n" }; sleep(1000000)
Nice. Better. But why exactly did it help? What has changed? Let's use ps to view some details of processes from within the script:
#!/usr/bin/env bash perl -e 'fork and exit; $SIG{INT} = sub {die "SIGINT #1\n" }; sleep(10)' & { set -o monitor perl -e 'fork and exit; $SIG{INT} = sub {die "SIGINT #2\n" }; sleep(10)' } & sleep 1 ps xf o pid,ppid,sid,pgrp,tt,cmd echo "press ctrl-c" sleep 1000000
1st perl (SIGINT #1) is called in normal Bash environment.
The second one though is ran separately, with monitor mode enabled.
So, let's see how it works:
=$ ps xf o pid,ppid,sid,pgrp,tt,cmd PID PPID SID PGRP TT CMD 15064 14793 14793 14793 ? sshd: test@pts/2 15065 15064 15065 15065 pts/2 \_ -bash 16547 15065 15065 16547 pts/2 \_ ps xf o pid,ppid,sid,pgrp,tt,cmd =$ ./test.sh PID PPID SID PGRP TT CMD 15064 14793 14793 14793 ? sshd: test@pts/2 15065 15064 15065 15065 pts/2 \_ -bash 16552 15065 15065 16552 pts/2 \_ bash ./test.sh 16559 16552 15065 16552 pts/2 \_ ps xf o pid,ppid,sid,pgrp,tt,cmd 16558 1 15065 16556 pts/2 perl -e fork and exit; $SIG{INT} = sub {die "SIGINT #2\n" }; sleep(10) 16557 1 15065 16552 pts/2 perl -e fork and exit; $SIG{INT} = sub {die "SIGINT #1\n" }; sleep(10) press ctrl-c ^CSIGINT #1 =$ ps xf o pid,ppid,sid,pgrp,tt,cmd PID PPID SID PGRP TT CMD 15064 14793 14793 14793 ? sshd: test@pts/2 15065 15064 15065 15065 pts/2 \_ -bash 16564 15065 15065 16564 pts/2 \_ ps xf o pid,ppid,sid,pgrp,tt,cmd 16558 1 15065 16556 pts/2 perl -e fork and exit; $SIG{INT} = sub {die "SIGINT #2\n" }; sleep(10)
Please notice the values in PGRP column (process group id) for both Perls.
The one ran without monitor mode has the same PGRP as bash script itself. On the other hand – the one that was started within monitor mode has different pgrp (and it's not artefact of using {}s, checked it).
Looks like Bash is sending SIGINT to processes it has within the same PGRP.
This theory looks great until we'll do one slight change:
#!/usr/bin/env bash perl -e 'fork and exit; $SIG{INT} = sub {die "SIGINT #1\n" }; sleep(10)' & set -o monitor perl -e 'fork and exit; $SIG{INT} = sub {die "SIGINT #2\n" }; sleep(10)' & sleep 1 ps xf o pid,ppid,sid,pgrp,tt,cmd echo "press ctrl-c" sleep 1000000
And running it shows:
=$ ps xf o pid,ppid,sid,pgrp,tt,cmd PID PPID SID PGRP TT CMD 15064 14793 14793 14793 ? sshd: test@pts/2 15065 15064 15065 15065 pts/2 \_ -bash 17010 15065 15065 17010 pts/2 \_ ps xf o pid,ppid,sid,pgrp,tt,cmd =$ ./test.sh PID PPID SID PGRP TT CMD 15064 14793 14793 14793 ? sshd: test@pts/2 15065 15064 15065 15065 pts/2 \_ -bash 17015 15065 15065 17015 pts/2 \_ bash ./test.sh 17021 17015 15065 17021 pts/2 \_ ps xf o pid,ppid,sid,pgrp,tt,cmd 17020 1 15065 17017 pts/2 perl -e fork and exit; $SIG{INT} = sub {die "SIGINT #2\n" }; sleep(10) 17019 1 15065 17015 pts/2 perl -e fork and exit; $SIG{INT} = sub {die "SIGINT #1\n" }; sleep(10) press ctrl-c ^C =$ ps xf o pid,ppid,sid,pgrp,tt,cmd PID PPID SID PGRP TT CMD 15064 14793 14793 14793 ? sshd: test@pts/2 15065 15064 15065 15065 pts/2 \_ -bash 17031 15065 15065 17031 pts/2 \_ ps xf o pid,ppid,sid,pgrp,tt,cmd 17020 1 15065 17017 pts/2 perl -e fork and exit; $SIG{INT} = sub {die "SIGINT #2\n" }; sleep(10) 17019 1 15065 17015 pts/2 perl -e fork and exit; $SIG{INT} = sub {die "SIGINT #1\n" }; sleep(10)
Apparently globally enabled monitor mode stops Bash from sending SIGINT even to processes in it's group.
To be honest, I don't fully understand what's going on, and why monitor mode (apparently) disables all SIGINT propagation, but the final thought is: if you want to start Pg (or anything else that dies on SIGINT) from a script, and this script can be later on killed by ctrl-c, make sure you use monitor-mode.
This can be used also, as I showed, for single command:
... { set -o monitor pg_ctl ... } & ...
Hope it helps.
You’re looking for the ‘setsid’ command. On my machine this is in the ‘util-linux’ package, so I guess it might not exist on other OSs. It’s a pretty simple wrapper around the POSIX setsid() API, though, which is available by default in both Perl and Python, so it shouldn’t be too hard to write such a script yourself.
@Screwtape:
portability is a problem. And sure, I could write simple “daemonizer” script in Perl, but it just seems weird that I had to go through all the problem with starting daemon from *shell* – which supposedly exists to run other programs 🙂
Anyway, I got my script fixed, and setsid is definitely a possibility if one is lucky enough to be sure that all he needs is Linux.
nohup? Or simply disown (at least in zsh).
@Gotar:
did you check it?
I just did, and both “nohup” and disown did *not* help – the process is killed.
I really did test all:
– redirections of std*
– nohup
– disown
and all of these keep killing (sigint) the process.
Tutaj chyba jest wyjaśnienie:
http://www.vidarholen.net/contents/blog/?p=34
Hello,
I use a custom script to start postgres relying on a bash specific internal function close to nohup, but which remove the command from the job table: disown.
I found this while messing with “pg_ctl -w” with no success to be able to know exactly if postgres succed to start or not. Last time I checked, this was still a running discussion on pgsql-hackers.
Main part of my script about this is:
postmaster -D “$PGDATA” >> “$PGLOGS/postgresql-$1.log” 2>&1 &
PGPID=$!
disown $PGPID
…then some more code related to linux to wait until I know it start or failed. I can share my script if needed.
Cheers,
@ioguix:
it doesn’t help in any way with the problem I have.
Do what I was describing: start pg, in any way you want, and then put in the same script – sleep 10000. and ctrl-c the script execution while sleeping.
you will notice that the pg will get killed too, just as I described. disown does not help with it, and simplest possible test shows. like this one:
Oh, I kept the page opened since a bunch of time and didn’t noticed the other answers.
I will make some further tests with disown, but according to the bash manual and my experience with this script (which has some years now), it should certainly not receive the SIGINT and stay up.
The only difference with your use case is that I use this code in a bash function I source-ed from my .bashrc, not from a script. But when I close a terminal where I started one or more pgsql cluster, they are definitely not killed and I can see the processes re-ordered with a “ps fax”.
@ioguix:
I KNOW.
That’s the whole point. If the script that started pg has ended, or you’re running in interactive shell, all works fine.
But if you are using a script (i.e. not interactive shell), and you do stuff after pg has been started, and you ctrl-c the script – it kills pg too.
Sorry – I just don’t know how to be clearer on this – all the comments so far suggest testing disown/nohup/redirections – where I thought that I was clear that none of these does work. And even 2 line shell script can prove that SIGINT is propagated regardless of nohup/disown/stream redirections.
Wow, you’re right.
So I guess there’s a difference between calling disown from the script itself or from a function.
My bad, thanks for sharing this information!
The difference is on whether you’re using interactive shell or a script. That’s why “set -o monitor” helps – because it makes bash “assume” it’s interactive even when it’s not.
Yes, it makes sense to me now.
Thank you again 😉