
The signal() system call is inconsistent, unreliable and deprecated. It has been replaced by sigaction(), which is standardised, robust but also more complicated.
sigaction() is in a group of system and standard library calls that use or manipulate sets of signals ("sigsets"):
sigaction(): set or clear signal handlersigprocmask(), sigsuspend(): block or unblock signalssigaddset(), sigdelset(), ...: manipulate sigsetsWe refer to this group of system and standard library calls as "the sigset ecosystem".
I learned about the sigset ecosystem during a recent project that launched a lot of child processes and I thought that other people, particularly those of the C/pre-Linux generation like myself, might benefit from what I learned.
This article covers:
signal()There are a lot of signals (see signal(7) or run kill -l for a list) but this article is concerned with handling only a few of them:
SIGINT: when the controlling terminal detects that SIGTERM: the default signal sent by the kill command; it should be interpreted by the receiving process as "please commit suicide but clean up before you do"SIGCHLD: when a child process exits (for whatever reason), then the kernel sends this signal to the parent processSIGALRM: when a process calls alarm( _secs_ ) then the kernel sends this signal to the same process secs seconds laterSIGUSR1: one of two signals available to illicit user-defined behaviour in a processRegarding my coding style:
Many of the example C programs below contain empty comment blocks like this:
/*
* Type and struct definitions
*/
/*
* Global variables
*/
As well as being orderly, these provide diff or meld (or whatever
diffing tool you might use) with more
synchronisation points so these tools can do a better job of aligning the
contents of their file arguments, which means that you can more easily
identify the differences between one source file in this article and the
next.
In order to keep the example source codes as uncluttered as possible: return codes are rarely checked; there is no protection against buffer overflows; type casting is rarely made explicitly; functions and variables that could be static are generally global; signal handlers handle only one signal type (even though they could handle multiple types).
Shell sessions for compiling and running programs show: my shell prompt
(lagane$), input in bold, output in roman (i.e. not bold); additional
newlines may have been added in order to make output more readable.
Finally, if you see a mistake in this article, then please let me know. Thanks!
signal()The Linux signal(2) man page states:
In the original UNIX systems, when a handler that was established using signal() was invoked by the delivery of a signal, the disposition of the signal would be reset to SIG_DFL, and the system did not block delivery of further instances of the signal. ...
System V also provides these semantics for signal(). This was bad because the signal might be delivered again before the handler had a chance to reestablish itself. Furthermore, rapid deliveries of the same signal could result in recursive invocations of the handler.
BSD improved on this situation, but unfortunately also changed the semantics of the existing signal() interface while doing so. On BSD, when a signal handler is invoked, the signal disposition is not reset, and further instances of the signal are blocked from being delivered while the handler is executing. Furthermore, certain blocking system calls are automatically restarted if interrupted by a signal handler (see signal(7)). The BSD semantics are equivalent to calling sigaction(2) with the following flags:
sa.sa_flags = SA_RESTART;
... The [Linux] kernel's signal() system call provides System V semantics.
... the [Linux] signal() wrapper function [i.e. not the kernel's signal() system call] does not invoke the kernel system call. Instead, it [supplies] BSD semantics.
If a system call is "automatically restarted", it effectively becomes a wrapper to the real system call like this:
int system_call_x(...)
{
while ((rc=the_real_system_call_x(...)) == ERROR && errno == EINTR)
;
return(rc)
}
The points I wanted to illustrate with that pseudocode are:
nanosleep())Regarding which blocking system calls behave like this and when, the Linux signal(7) man page states:
If a signal handler is invoked while a system call or library function call is blocked, then either:
- the call is automatically restarted after the signal handler returns; or
- the call fails with the error EINTR.
Which of these two behaviors occurs depends on the interface and whether or not the signal handler was established using the SA_RESTART flag ... The details vary across UNIX systems; below, the details for Linux.
If a blocked call to one of the following interfaces is interrupted by a signal handler, then the call will be automatically restarted after the signal handler returns if the SA_RESTART flag was used; otherwise the call will fail with the error EINTR:
- read(2), readv(2), write(2), writev(2), and ioctl(2) calls on "slow" devices.
Related to that last point:
accept()sleep() or its underlying system call nanosleep()fgets() but that is no longer the caseBefore we launch into a series of examples, check that you can download a test program, compile it and run it.
lagane$ gcc -o test funcs.c test.c
lagane$lagane$ ./test
0.000000: main: forty-two in digits is 42
0.000183: main: this messages goes to stderr
lagane$In order to demonstrate a program being interrupted we need that program to be doing something to be interrupted from. That something should:
fgets() is not suitable)system() is not suitable)sleep() or fork()+wait() are good options.
lagane$ gcc -o sleep funcs.c sleep.c lagane$ ./sleep 0.000000: main: setting up signal handlers ... 0.000371: main: before calling sleep() ^C 3.932070: sigint_handler: received SIGINT 3.932087: main: after calling sleep() 3.932108: main: sleep() returned early due to: Interrupted system call 3.932113: main: cleaning up and exiting ... lagane$ ./sleep & [1] 1663 lagane$ 0.000000: main: setting up signal handlers ... 0.000274: main: before calling sleep() lagane$ lagane$ kill %1 lagane$ 7.293202: sigterm_handler: received SIGTERM 7.293218: main: after calling sleep() 7.293235: main: sleep() returned early due to: Interrupted system call 7.293238: main: cleaning up and exiting ... [1]+ Done ./sleep lagane$
The points I wanted to illustrate with that example are:
SIGINT; the kill command sends SIGTERM by defaultsigint_handler() and sigterm_handler() were called and then main() continued and eventually main() exitedsigint_handler() or sigterm_handler() or in main() after inspecting a global variable that the signal handlers would set to communicate the need for this task to be performed)Shortly we will look at a program that waits for one of two different events: a timeout expiring or a child process exiting. But let's look at just timeouts first.
But what operation do we want to time out? The simplest is to call sleep() and to pretend it represents some other "long"-but-interruptable operation.
lagane$ gcc -o timeout timeout.c funcs.o lagane$ ./timeout 0.000000: main: setting up signal handlers ... 0.000366: main: scheduling timeout alarm ... 0.000535: main: before calling sleep() 5.000690: sigalrm_handler: received SIGALRM 5.000731: main: after calling sleep() 5.000757: main: sleep() returned early due to: Interrupted system call 5.000761: main: cleaning up and exiting ... lagane$
The points I wanted to illustrate with that example are:
alarm() scheduled SIGALRM to be delivered to the process itself 5s latersleep(3600) slept for only 5ssleep() returned early and why it did sosystem()Shortly we will look at a program that waits for one of two different events: a timeout expiring or a child process exiting. But let's look just at monitoring a child process first.
We will do this in a few steps. Firstly a version without signal handlers.
Copy and paste this source code into system1.c:
Compile and run the program as follows:
lagane$ gcc -o system1 system1.c funcs.o lagane$ ./system1 0.000000: main: parent starting one child ... 0.000418: main: parent sees child has pid 3778 and waits for it ... child running ... child exiting ... 10.002626: main: parent cleaning up and exiting ... lagane$
The points I wanted to illustrate with that example are:
wait() will wait until a child process exits (or it is interrupted by the arrival of a signal)main() from doing any other tasks concurrentlyIn system1.c, comment out pid = wait(&wstatus); and uncomment sleep(A_LONG_TIME); simply by changing the definition of WAIT_INSTEAD_OF_SLEEP from this:
#define WAIT_INSTEAD_OF_SLEEP TRUE
to this:
#define WAIT_INSTEAD_OF_SLEEP FALSE
and then recompile and run the program as follows:
lagane$ gcc -o system1 system1.c funcs.o lagane$ ./system1 & [2] 4714 lagane$ 0.000000: main: parent starting one child ... 0.000464: main: parent sees child has pid 4715 and waits for it ... child running ... child exiting ... lagane$ lagane$ ps -lp 4715 F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD 0 Z 1000 4715 4714 0 80 0 - 0 - pts/7 00:00:00 shlagane$ kill 4714 lagane$ [2]- Terminated ./system1 lagane$
The points I wanted to illustrate with that example are:
ps shows the child process's state is Z (zombie)wait() results in a zombie process therefore we must call wait()wait() to clear up a zombie process's leftovers is called reapingHowever, there is no need to call wait() as soon as we launch the child process; instead we can delay calling wait() until we know that a child process has already exited and is reapable. So then what should main() do in the mean time? Shortly we will look at a main() doing something more complicated but, for now, let's just make it loop until the handler has set a global variable to indicate that it has called wait().
Copy and paste this source code into system2.c:
Compile and run the program as follows:
lagane$ gcc -o system2 system2.c funcs.o lagane$ ./system2 0.000000: main: parent setting up signal handlers ... 0.000319: main: parent starting one child ... 0.000576: main: parent sees child has pid 6034 and loops checking flag ... child running ... child exiting ... 10.003294: sigchld_handler: parent received SIGCHLD; reaping and setting child_reaped flag ... 10.003330: main: parent sees child_reaped flag and so stops looping ... 10.003335: main: parent cleaning up and exiting ... lagane$
The points I wanted to illustrate with that example are:
signal(SIGCHLD, sigchld_handler) means "call sigchld_handler() whenever the OS informs us that a child process has just exited"wait() when we know that the child process has just exited and is reapable, then wait() returns immediatelysystem() with timeoutsNow we combine both system2.c and timeout.c to monitor a child process and to kill it if it runs for longer than a specified timeout.
Copy and paste this source code into system-with-timeout.c:
Note that child process will run for 10s; that's this bit:
#define CHILD_RUN_TIME 10
...
sprintf(buf, "echo \"child running ...\";"
"sleep %d;"
"echo \"child exiting ...\"",
CHILD_RUN_TIME);
execlp("/bin/sh", "sh", "-c", buf, (char *) NULL);
and the timeout is 15s; that's this bit:
#define TIMEOUT 15
...
alarm(TIMEOUT);
Compile and run the program as follows:
lagane$ gcc -o system-with-timeout system-with-timeout.c funcs.o lagane$ ./system-with-timeout 0.000000: main: parent setting up signal handlers ... 0.000284: main: parent starting one child ... 0.000484: main: parent scheduling timeout alarm ... 0.000620: main: parent entering monitoring loop ... 0.000726: main: parent sleeping until signal arrives ... child running ... child exiting ... 10.003069: sigchld_handler: parent received SIGCHLD; reaping and setting child_reaped flag ... 10.003111: main: parent cancelling alarm ... 10.003119: main: parent sees child_reaped flag and so stops looping ... 10.003124: main: parent cleaning up and exiting ... lagane$
The points I wanted to illustrate with that example are:
SIGCHLD caused by the child process exiting arrived before the SIGALRM caused by the timeout would have donealarm(0)sleep(A_LONG_TIME) but what we really mean is "do nothing until a signal (SIGCHLD or SIGALRM) arrives"; remember: sleep() is interrupted if a signal arrivesChange the child run time to 20s by setting:
#define CHILD_RUN_TIME 20
and recompile and run the program follows:
lagane$ gcc -o system-with-timeout system-with-timeout.c funcs.o lagane$ ./system-with-timeout 0.000000: main: parent setting up signal handlers ... 0.000325: main: parent starting one child ... 0.000614: main: parent scheduling timeout alarm ... 0.000820: main: parent entering monitoring loop ... 0.000987: main: parent sleeping until signal arrives ... child running ... 15.001000: sigalrm_handler: parent received SIGALRM; setting timed_out flag ... 15.001051: main: parent cancelling alarm ... 15.001058: main: parent sees timed_out flag and so kills child ... 15.001073: main: parent sleeping until signal arrives ... 15.001476: sigchld_handler: parent received SIGCHLD; reaping and setting child_reaped flag ... 15.001503: main: parent cancelling alarm ... 15.001509: main: parent sees child_reaped flag and so stops looping ... 15.001513: main: parent cleaning up and exiting ... lagane$
The points I wanted to illustrate with that example are:
SIGALRM caused by the timeout arrived before the SIGCHLD caused by the child process exitingwait() for the child processkill(pid, SIGTERM) or we really kill it with kill(pid, SIGKILL).system() with timeouts and multi-child supportThe previous example used two global variables to record the state of one child process (has it timed out? has it been reaped?). If we want to launch more child processes in parallel then two global variables are not going to be enough. Instead we define a structure and then allocate an array of that structure to store information about some arbitrary number of child processes:
#define MAX_CHILDREN 1000
...
struct child {
pid_t pid;
time_t start;
};
...
struct child children[MAX_CHILDREN];
Note that instead of recording whether a process has timed out, we could just record its start time. This would mean:
SIGTERM and set its start time to zero; this way we would know not to repeatedly signal itA complication is that we can't schedule one alarm per running child process because there is only one alarm clock. So we need to work out the interval after which the next-to-time-out child process will time out and that is when we set the alarm for.
Another complication is that alarm() takes an integer argument so if we want it to schedule the SIGALRM signal to arrive in 0.5s then we have a problem. The solution presented here is that we work entirely with integer times (except when displaying timestamps for debugging).
Copy and paste this source code into system-with-timeout-and-multichild-support1.c:
Compile and run the program as follows:
lagane$ gcc -o system-with-timeout-and-multichild-support1 \ system-with-timeout-and-multichild-support1.c funcs.o lagane$ ./system-with-timeout-and-multichild-support1 0.000000: main: parent initialising children status table ... 0.000377: main: parent setting up signal handlers ... 0.000543: main: parent starting 5 children ... 0.001276: main: parent entering monitoring loop ... 0.001543: main: parent checking for running children ... 0.001717: main: parent sees 5 children still running 0.001884: main: parent scheduling timeout alarm ... 0.002092: main: parent scheduling alarm for 30s ... 0.002285: main: parent sleeping until signal arrives ... child 8123 started ... child 8122 started ... child 8120 started ... child 8121 started ... child 8119 started ... child 8119 exiting ... 0.010015: sigchld_handler: parent received SIGCHLD; reaping and clearing child data ... 0.010048: main: parent cancelling alarm ... 0.010055: main: parent checking for running children ... 0.010063: main: parent sees 4 children still running 0.010071: main: parent scheduling timeout alarm ... 0.010079: main: parent scheduling alarm for 30s ... 0.010084: main: parent sleeping until signal arrives ... child 8120 exiting ... 1.009713: sigchld_handler: parent received SIGCHLD; reaping and clearing child data ... 1.009738: main: parent cancelling alarm ... 1.009745: main: parent checking for running children ... 1.009756: main: parent sees 3 children still running 1.009766: main: parent scheduling timeout alarm ... 1.009775: main: parent scheduling alarm for 29s ... 1.009781: main: parent sleeping until signal arrives ... child 8121 exiting ... 2.009266: sigchld_handler: parent received SIGCHLD; reaping and clearing child data ... 2.009291: main: parent cancelling alarm ... 2.009331: main: parent checking for running children ... 2.009343: main: parent sees 2 children still running 2.009353: main: parent scheduling timeout alarm ... 2.009362: main: parent scheduling alarm for 28s ... 2.009369: main: parent sleeping until signal arrives ... child 8122 exiting ... 3.008095: sigchld_handler: parent received SIGCHLD; reaping and clearing child data ... 3.008121: main: parent cancelling alarm ... 3.008128: main: parent checking for running children ... 3.008138: main: parent sees 1 children still running 3.008147: main: parent scheduling timeout alarm ... 3.008157: main: parent scheduling alarm for 27s ... 3.008163: main: parent sleeping until signal arrives ... child 8123 exiting ... 4.008600: sigchld_handler: parent received SIGCHLD; reaping and clearing child data ... 4.008626: main: parent cancelling alarm ... 4.008633: main: parent checking for running children ... 4.008644: main: parent sees 0 children still running 4.008648: main: parent cleaning up and exiting ... lagane$
The points I wanted to illustrate with that example are:
2. so each of the 5 child processes (indexed 0, 1, 2, 3, 4) runs for a _different_ amount of time (0s, 1s, 2s, 3s, 4s)
3. so each of the `SIGCHLD` signals arrive at different times (~0s, ~1s, ~2s, ~3s, ~4s)
4. with a ~1s interval between the signals there is ample time to process each signal before the next signal arrives
### Old-school signal processing in C: but now it starts to go wrong
In system-with-timeout-and-multichild-support1.c, change this line:
```c
to this:
i.e. all 5 child processes should exit simultaneously after 1s.
Recompile and run the program as follows:
lagane$ gcc -o system-with-timeout-and-multichild-support1 \ system-with-timeout-and-multichild-support1.c funcs.o lagane$ ./system-with-timeout-and-multichild-support1 ...
If we are lucky then the program exits ~1s later. If we are unlucky then it will do something like this:
lagane$ ./system-with-timeout-and-multichild-support1 0.000000: main: parent initialising children status table ... 0.000330: main: parent setting up signal handlers ... 0.000500: main: parent starting 5 children ... 0.001201: main: parent entering monitoring loop ... 0.001403: main: parent checking for running children ... 0.001595: main: parent sees 5 children still running 0.001759: main: parent scheduling timeout alarm ... 0.001951: main: parent scheduling alarm for 30s ... 0.002126: main: parent sleeping until signal arrives ... child 9015 started ... child 9014 started ... child 9013 started ... child 9012 started ... child 9011 started ... child 9013 exiting ... 1.010212: sigchld_handler: parent received SIGCHLD; reaping and clearing child data ... 1.010250: main: parent cancelling alarm ... 1.010257: main: parent checking for running children ... 1.010267: main: parent sees 4 children still running 1.010277: main: parent scheduling timeout alarm ... 1.010286: main: parent scheduling alarm for 29s ... 1.010292: main: parent sleeping until signal arrives ... child 9014 exiting ... child 9012 exiting ... child 9011 exiting ... child 9015 exiting ... 1.011580: sigchld_handler: parent received SIGCHLD; reaping and clearing child data ... 1.011597: main: parent cancelling alarm ... 1.011603: main: parent checking for running children ... 1.011612: main: parent sees 3 children still running 1.011622: main: parent scheduling timeout alarm ... 1.011631: main: parent scheduling alarm for 29s ... 1.011637: main: parent sleeping until signal arrives ... <several seconds go by with no output> ^C lagane$
The points I wanted to illustrate with that example are:
child ... exiting" messages)parent sees 3 children still running" message)The problem is that the child processes exit so close to each other that code in the OS decides to interrupt the code only once but the OS expects the single call to the signal handler to handle the multiple pending SIGCHLD signals.
The workaround is pretty simple: a single call to the signal handler should reap all reapable child processes. wait() is not sophisticated enough to support doing this, but waitpid() is.
In system-with-timeout-and-multichild-support1.c, replace sigchld_handler()'s logic with a loop by changing this:
#define WAITPID_LOOP FALSE
to this:
#define WAITPID_LOOP TRUE
Recompile and run the program as above. This time it should always exit after ~1s.
So now we're going to:
SIGUSR1-triggered dump of the child processes' statuses so that we can inspect the statuses if things look like they've gone wrongCopy and paste this source code into system-with-timeout-and-multichild-support2.c:
Compile and run the program as follows:
lagane$ gcc -o system-with-timeout-and-multichild-support2 \ system-with-timeout-and-multichild-support2.c funcs.o lagane$ ./system-with-timeout-and-multichild-support2 >/dev/null ...
If we are lucky, which we usually are, then the program exits ~1s later. If we are unlucky then it does not. We will now run it repeatedly until we are unlucky using a wrapper script.
Copy and paste this source code into hanger.sh:
Compile and run the program as follows:
lagane$ cat hanger.sh >hanger lagane$ chmod a+x hanger lagane$ hanger system-with-timeout-and-multichild-support2
Soon I got this:
... system-with-timeout-and-multichild-support-support2 (pid 24829) hung on 4th attempt; children dump follows ... 2.998520: sigusr1_handler: parent received SIGUSR1 2.998706: sigusr1_handler: slot:481; pid:07244, start=1632241800 2.998710: sigusr1_handler: slot:482; pid:07245, start=1632241800 2.998714: sigusr1_handler: slot:483; pid:07246, start=1632241800 2.998719: sigusr1_handler: slot:484; pid:07247, start=1632241800 2.998723: sigusr1_handler: slot:485; pid:07248, start=1632241800 2.998727: sigusr1_handler: slot:486; pid:07249, start=1632241800 2.998731: sigusr1_handler: slot:487; pid:07250, start=1632241800 2.998736: sigusr1_handler: slot:488; pid:07251, start=1632241800 2.998740: sigusr1_handler: slot:489; pid:07252, start=1632241800 2.998744: sigusr1_handler: slot:490; pid:07253, start=1632241800 2.998749: sigusr1_handler: slot:491; pid:07254, start=1632241800 2.998753: sigusr1_handler: slot:492; pid:07255, start=1632241800 2.998757: sigusr1_handler: slot:493; pid:07256, start=1632241800 2.998761: sigusr1_handler: slot:494; pid:07257, start=1632241800 2.998766: sigusr1_handler: slot:495; pid:07258, start=1632241800 2.998770: sigusr1_handler: slot:496; pid:07259, start=1632241800 2.998775: sigusr1_handler: slot:497; pid:07260, start=1632241800 2.998779: sigusr1_handler: slot:498; pid:07261, start=1632241800 2.998784: sigusr1_handler: slot:499; pid:07262, start=1632241800 lagane$
On another occassion I got this:
... system-with-timeout-and-multichild-support-support2 (pid 21121) hung on 24th attempt; children dump follows ... 2.998824: sigusr1_handler: parent received SIGUSR1 lagane$
The points I wanted to illustrate with these examples are:
children[] does not reflect thischildren[] does reflect this but the program still failed to exit at the right timeSIGUSR1 causes the program to exit!The race condition needs a bit of explanation. Let's imagine that the execution of the program has proceeded to the point where only one child process - let's call it child#499 - is still running and it is about to exit; main() is calling sleep(A_LONG_TIME) ...
SIGCHLD to system-with-timeout-and-multichild-support2 to inform it that child#499 has exitedmain()’s call to sleep(A_LONG_TIME) is interrupted by the arrival of SIGCHLDsigchld_handler() is called to handle the signalmain() continues!main() calls alarm(0) to clear the pending alarm and jumps to the top of the loopsigchld_handler() calls waitpid() to reap the just-exited child process and to determine its PID - let's call it PID#499main() starts searching through children[], to see if any child processes are still marked as runningsigchld_handler() starts searching through children[], looking for the entry pertaining to PID#499 (it doesn't know that it's in children[499].pid yet)main() finds one entry regarding a running pid in children[499].pidsigchld_handler() finds PID#499 in children[499].pidsigchld_handler() sets children[499].pid = 0 to indicate the child#499 has exitedmain() calls sleep(A_LONG_TIME) even though all children have exited and the children[] array indicates that!It should be clear that the problem is due to main() and sigchld_handler() accessing children[] concurrently.
Regarding race conditions, Wikipedia says:
Critical race conditions often happen when the processes or threads depend on some shared state. Operations upon shared states are done in critical sections that must be mutually exclusive. Failure to obey this rule can corrupt the shared state.
We could implement mutual exclusion using a semaphore or other atomic locking mechanism:
sigchld_handler() takes the semaphore or waits to do so if it is not immediately available, then it updates children[] and then it releases the semaphoremain() either: (a) takes the semaphore or waits to do so if it is not immediately available, then it updates children[] and then it releases the semaphore or (b) if the semaphore is not immediately available it does other tasks instead but then does not go back to sleep, looping round until the semaphore is immediately availableBut this starts to get ugly: global variables are required so that the signal handler knows what semaphore to take.
Besides, there are also other problems with doing things the old way.
Copy and paste this source code into tcp-server1.c:
Compile and run the program as follows:
lagane$ gcc -o tcp-server1 tcp-server1.c funcs.o lagane$ ./tcp-server1 & [1] 31254 lagane$ 0.000000: main: parent setting up listening socket ... 0.000410: main: parent initialising children status table ... 0.000591: main: parent setting up signal handlers ... 0.000753: main: parent entering monitoring loop ... 0.000909: main: parent awaiting incoming connection ... lagane$ lagane$ nc localhost 2345 < /dev/null 30.412644: main: parent starting one child ... 30.412805: main: parent awaiting incoming connection ... 30.413026: start_child_server: child is pid 31262 30.413167: start_child_server: child sending message to client ... this is a message from the server to the client 30.413399: start_child_server: child sleeping a bit ... 50.413532: start_child_server: child exiting ... 50.413946: sigchld_handler: parent received SIGCHLD; reaping and clearing child data ... lagane$
The points I wanted to illustrate with that example are:
accept() automatically restarts if a signal arrives; we know this because after the SIGCHLD arrived we did not see the message accept() failedHere we change from a server that forks a child process on an incoming TCP connection to a server that forks child processes every 10s. Copy and paste this source code into interval-server.c:
Compile and run the program as follows:
lagane$ gcc -o interval-server interval-server.c funcs.o lagane$ ./interval-server 0.000000: main: parent initialising children status table ... 0.000353: main: parent setting up signal handlers ... 0.000516: main: parent entering monitoring loop ... 0.000675: main: parent sees some children not started yet 0.000871: main: parent starting one child ... 0.001149: main: parent sleeping 10s ... 0.001432: start_child_sleep: child running for 15s ... child 1477 started ... 10.001590: main: parent sees some children not started yet 10.001641: main: parent starting one child ... 10.001800: main: parent sleeping 10s ... 10.002173: start_child_sleep: child running for 3s ... child 1480 started ... child 1480 exiting ... 13.005047: sigchld_handler: parent received SIGCHLD; reaping and clearing child data ... 13.005072: main: parent sees some children not started yet 13.005078: main: parent starting one child ... 13.005189: main: parent sleeping 10s ... 13.005585: start_child_sleep: child running for 10s ... child 1482 started ... child 1477 exiting ... 15.006176: sigchld_handler: parent received SIGCHLD; reaping and clearing child data ... 15.006203: main: parent sees some children not started yet 15.006208: main: parent starting one child ... 15.006318: main: parent sleeping 10s ... 15.006743: start_child_sleep: child running for 17s ... child 1484 started ... ^C
The points I wanted to illustrate with that example are:
sleep() does not automatically restart if a signal arrivesWe could work around that problem with something like:
start_sleep_time = time(NULL);
while (TRUE) {
/* how long to sleep? */
desired_sleep_time = start_sleep_time + INTER_CHILD_INTERVAL - time(NULL);
/* if slept full amount then no need to sleep more */
if (sleep(desired_sleep_time) == desired_sleep_time)
break;
/* if didn't sleep full amount but not due to signal then exit sleep loop */
if (errno != EINTR)
break;
}
Obviously, if we were to fork a child process in response to an event more complicated that just the completion of a time interval, for example by calling select() to monitor several file handles, which would be a less contrived and more realistic example, but which would make the code much more complicated, then the interruption might become harder to work around.
As Wikipedia says:
The
sigaction()function provides an interface for reliable signals in replacement of the unreliable and deprecatedsignal()function.
Here is system-with-timeout-and-multichild-support2.c updated to use sigaction() instead of signal() and sigsuspend() instead of sleep(A_LONG_TIME). Copy and paste this source code into system-with-timeout-and-multichild-support3.c:
Compile and run the program as follows:
lagane$ gcc -o system-with-timeout-and-multichild-support3 \ system-with-timeout-and-multichild-support3.c funcs.o lagane$ lagane$ hanger system-with-timeout-and-multichild-support3 <after-an-hour-still-no-output> ^C lagane$
The points I wanted to illustrate with that example are:
#define HANDLE_MULTIPLE_PENDING_SIGCHLDS_IN_ONE_GO FALSE to see this)sigprocmask() calls)sigprocmask() call (see the first sigprocmask() call)sigsuspend() call)sa_mask in the act variable, which is passed to sigaction())
signal sets allow us to do this (see the manual page for sigsetops())sigprocmask() to block signals during critical code but we can flip this around:
sigprocmask() to block signalssigprocmask() to unblock signalssigsuspend() to handle recently-dispatched-but-currently-blocked signalssigprocmask() by unblocking it in the second call to sigprocmask()sigaction() and sigprocmask() is complicated; even O'Reilly got it wrong in the first edition of their Perl Cookbook (that call to sigprocmask(SIG_UNBLOCK, $old_sigset) should have been either sigprocmask(SIG_UNBLOCK, $sigset) or better still sigprocmask(SIG_SETMASK, $old_sigset))system-with-timeout-and-multichild-support3.c program can be improved a bit:
printf() and echo in the monitoring loop then it will run faster and we will stand an even better chance of getting it to hangHowever, before we do that ...
SA_RESTART and sigprocmask()The tcp-server1.c program above used signal() to establish the signal handler. The signal() man page states:
the signal() wrapper function ... calls sigaction(2) using flags that supply BSD semantics
...
The BSD semantics are equivalent to calling sigaction(2) with the following flags:
sa.sa_flags = SA_RESTART;
and the sigaction() man page explains:
SA_RESTART [provides] behavior compatible with BSD signal semantics by making certain system calls restartable across signals
The accept() man page states:
EINTR [indicates] the system call was interrupted by a signal that was caught before a valid connection arrived
all of which suggests that accept() might be a system call that is affected by SA_RESTART (either when explicitly specified in a call to sigaction() or implicitly specified in a call to signal()).
In tcp-server1, and as noted above, the exiting of the child server process and the consequence dispatch of a SIGCHLD signal did not cause the accept() call to return prematurely, so either accept() restarted automatically or accept() ignores or blocks the signal.
In order to determine which of these happened, we can clone tcp-server1.c to tcp-server2.c and replace calls to signal() with calls to sigaction() but leave sa_flags set to 0.
Copy and paste this source code into tcp-server2.c:
Note the #define macros at the top of the source.
#define USE_SA_RESTART FALSE
#define BLOCK_SIGCHLD FALSE
Compile and run the program as follows:
lagane$ gcc -o tcp-server2 tcp-server2.c funcs.o lagane$ ./tcp-server2 & [1] 18925 lagane$ 0.000000: main: parent setting up listening socket ... 0.000325: main: parent initialising children status table ... 0.000489: main: parent setting up signal handlers ... 0.000600: main: parent entering monitoring loop ... 0.000704: main: parent awaiting incoming connection ... lagane$ lagane$ lagane$ nc localhost 2345 < /dev/null 6.496349: main: parent starting one child ... 6.496527: main: parent awaiting incoming connection ... 6.496778: start_child_server: child is pid 18927 6.496988: start_child_server: child sending message to client ... this is a message from the server to the client 6.497326: start_child_server: child sleeping a bit ... 26.497686: start_child_server: child exiting ... 26.498167: handler: parent received SIGCHLD; reaping and clearing child data ... 26.498188: main: accept() failed 26.498193: main: parent awaiting incoming connection ... lagane$
The points I wanted to illustrate with that example are:
accept() failingsignal() is equivalent to a call to sigaction() with SA_RESTART enabled, implying that tcp-server1's call to accept() had automatic restart enabled whereas tcp-server2's call to accept() has automatic restart disabled: this confirms that accept() is a system call influenced by SA_RESTARTNow change the #define macros at the top of tcp-server2.c source file to:
#define USE_SA_RESTART TRUE
#define BLOCK_SIGCHLD FALSE
Recompile and run the program as follows:
lagane$ gcc -o tcp-server2 tcp-server2.c funcs.o lagane$ ./tcp-server2 & [2] 20810 lagane$ 0.000000: main: parent setting up listening socket ... 0.000407: main: parent initialising children status table ... 0.000632: main: parent setting up signal handlers ... 0.000820: main: parent entering monitoring loop ... 0.000982: main: parent awaiting incoming connection ... lagane$ lagane$ lagane$ nc localhost 2345 < /dev/null 5.470687: main: parent starting one child ... 5.470838: main: parent awaiting incoming connection ... 5.471082: start_child_server: child is pid 20812 5.471288: start_child_server: child sending message to client ... this is a message from the server to the client 5.471636: start_child_server: child sleeping a bit ... 25.472001: start_child_server: child exiting ... 25.472451: handler: parent received SIGCHLD; reaping and clearing child data ... lagane$
The points I wanted to illustrate with that example are:
accept() is internally restarted againNow disable SA_RESTART but enable the blocking of SIGCHLD in line with the changes that were made between system-with-timeout-and-multichild-support2.c and system-with-timeout-and-multichild-support3.c by setting this in tcp-server2.c:
#define USE_SA_RESTART FALSE
#define BLOCK_SIGCHLD TRUE
Recompile and run the program as follows:
lagane$ gcc -o tcp-server2 tcp-server2.c funcs.o lagane$ ./tcp-server2 & [2] 21242 lagane$ 0.000000: main: parent setting up listening socket ... 0.000394: main: parent initialising children status table ... 0.000573: main: parent setting up signal handlers ... 0.000738: main: parent entering monitoring loop ... 0.000896: main: parent awaiting incoming connection ... lagane$ lagane$ lagane$ nc localhost 2345 < /dev/null 8.333692: main: parent starting one child ... 8.333893: main: parent awaiting incoming connection ... 8.334146: start_child_server: child is pid 21249 8.334351: start_child_server: child sending message to client ... this is a message from the server to the client 8.334701: start_child_server: child sleeping a bit ... 28.335050: start_child_server: child exiting ... lagane$
The points I wanted to illustrate with that example are:
accept() was not interrupted when the child server eventually exits because SIGCHLD is blockedSIGCHLD remains blocked then we do not see exited child servers being reapedAs a consequence of that second point, if we run the nc client command a few more times, we accumulate zombie processes:
lagane$ ps fax
...
1881 pts/7 S 0:00 \_ ./tcp-server2
2005 pts/7 Z 0:00 | \_ [tcp-server2] <defunct>
2013 pts/7 Z 0:00 | \_ [tcp-server2] <defunct>
2027 pts/7 Z 0:00 | \_ [tcp-server2] <defunct>
2041 pts/7 Z 0:00 | \_ [tcp-server2] <defunct>
2049 pts/7 Z 0:00 | \_ [tcp-server2] <defunct>
2051 pts/7 Z 0:00 | \_ [tcp-server2] <defunct>
2059 pts/7 Z 0:00 | \_ [tcp-server2] <defunct>
2061 pts/7 Z 0:00 | \_ [tcp-server2] <defunct>
2069 pts/7 Z 0:00 | \_ [tcp-server2] <defunct>
2071 pts/7 Z 0:00 | \_ [tcp-server2] <defunct>
2076 pts/7 Z 0:00 | \_ [tcp-server2] <defunct>
...
lagane$
So here is system-with-timeout-and-multichild-support3.c reworked to:
Copy and paste this source code into system-with-timeout-and-multichild-support4.c:
That was the final C program in this article. Next we will look at other programming languages.
Copy and paste this source code into system-with-timeout-and-multichild-support-perl.pl:
Compile and run the program as follows:
lagane$ cat system-with-timeout-and-multichild-support-perl.pl > \ system-with-timeout-and-multichild-support-perl lagane$ chmod +x system-with-timeout-and-multichild-support-perl lagane$ ./system-with-timeout-and-multichild-support-perl ...
The points I wanted to illustrate with that example are:
Python's signal module does not expose sigaction(), etc so we have to do without it.
Copy and paste this source code into system-with-timeout-and-multichild-support-python.py:
Compile and run it as follows:
lagane$ cat system-with-timeout-and-multichild-support-python.py > \ system-with-timeout-and-multichild-support-python lagane$ chmod +x system-with-timeout-and-multichild-support-python lagane$ ./system-with-timeout-and-multichild-support-python ...
The points I wanted to illustrate with that example are:
main() scanning children[] and finding running child processes while, effectively simultaneously, the signal handler is marking those children as having been reaped (this is exactly the same problem that system-with-timeout-and-multichild-support2 had)We can attempt to address this problem by getting the signal handler to request the main loop to modify children[] via a reliable messenging channel, rather than modifying children[] itself. Firstly, we try this using Python's queue module.
Copy and paste this source code into system-with-timeout-and-multichild-support-python2.py:
Compile and run the program as follows:
lagane$ cat system-with-timeout-and-multichild-support-python2.py > \ system-with-timeout-and-multichild-support-python2 lagane$ chmod +x system-with-timeout-and-multichild-support-python2 lagane$ ./system-with-timeout-and-multichild-support-python2 ...
The points I wanted to illustrate with that example are:
queue module does not provide a way to implement reliable signal handlingWe can try using System V IPC queues instead.
Copy and paste this source code into system-with-timeout-and-multichild-support-python3.py:
Compile and run the program as follows:
lagane$ pip3 install sysv-ipc lagane$ cat system-with-timeout-and-multichild-support-python3.py > \ system-with-timeout-and-multichild-support-python3 lagane$ chmod +x system-with-timeout-and-multichild-support-python3 lagane$ ./system-with-timeout-and-multichild-support-python3 ...
The points I wanted to illustrate with that example are:
sysv_ipc module does not provide a way to implement reliable signal handlingpysigset provides wrappers around the OS's sigset ecosystem and it works! It's badly documented but hopefully that will be fixed.
The pysigset module may be available for your OS/distribution. If it is then install it using your package manager otherwise install it by running:
lagane$ pip3 install pysigset
lagane$
Copy and paste this source code into system-with-timeout-and-multichild-support-python4.py:
Compile and run the program as follows:
lagane$ cat system-with-timeout-and-multichild-support-python4.py > \ system-with-timeout-and-multichild-support-python4 lagane$ chmod +x system-with-timeout-and-multichild-support-python4 lagane$ ./hanger system-with-timeout-and-multichild-support-python4 <no-output>
The points I wanted to illustrate with that example are:
pysigset provides reliable signal handling in PythonSA_RESTART flag) than the signal() system callsigprocmask() and sigsuspend() allow us to define a particular place in the main loop where signals can be handled safely and effectively synchronously