Signals, Traps, and Interrupts

1.0 Interrupts and Signals

Generally a signal, or interrupt, is any exception that causes a processor to (temporarily) transfer control to another program, or function. When that function completes, control is (typically) returned to the interrupted process, which resumes from the point it was interrupted. Some interrupts can be masked, or disabled, others cannot; for instance, the RESET interrupt when a system powers-up cannot be disabled.

Signals can originate from a variety of sources, but there are two fundamental origins: hardware and software. Hardware signals are most often referred to as interrupts, to distinguish them from software signals. Hardware interrupts are asynchronous entities, typically employed to provide an effective means for a system to react to outside stimuli. These interrupts are often assigned different priorities, whereas, software signals are not. For example, a workstation with a real-time clock and a local printer port. It not unreasonable to assume that each of these devices uses an interrupt to request service from the system. Between the devices, it is important to response quickly to the real-time clock's interrupt, thus its interrupt will have a higher priority than the printer port's. This means that if both events occur simultaneously, the real-time clock will always be serviced first. In contrast, software signals are not given priority over one another. Thus signals that occur at the same time will be serialized in no particular order into a FIFO queue, and then each in turn interrupts the receiving process.

1.1 Software Signals

Software signals, or software interrupts, are generated by processes, and are considered asynchronous activities, that is, processes don't have to wait for signals. Signals are used establish a process' respond to certain events in the system. When a program is designed to respond to an event, the process is said to catch, trap, or handle that event. The actual code that catches a signal is referred to by various names. We will call it a signal handler, or handler. Once the signal handler completes, control is returned to the point in the program where it was interrupted. In figure 1 a sample program captures the control-C signal and redirects it to the function called handler. Thus each time the control-C key combination is depressed, the main program will be suspended and the handler run. When the handler completes, control is returned to main program’s forever loop. Albeit, while the program in figure 1 is too simple, it illustrates the point.

 

 

 

1.2 Default Handlers

When the typical process is loaded and linked for execution, it is assigned a set of default signal handlers by the system, one handler for each event the process can catch. Operationally, for example, if a program is running interactively in the console and the control-C key combination is depressed, a signal is generated. The operating system then sends that signal to the running process, at which point the process' handler will execute. In Unix, the default handler's action for control-C is to terminate the process. Systems can have many different signals and each is uniquely numbered for identification. The list below shows some of the common manifest constants for Unix signals .

Identifier Constant Default action

#define SIGINT 2 control-c

#define SIGFPE 8 floating point error

#define SIGKILL 9 Kill**

#define SIGSEGV 11 segment access violation

#define SIGALRM 14 alarm clock

#define SIGUSR1 16 user defined signal

#define SIGUSR2 17 User defined signal

#define SIGSTOP 23 Halt the process**

note: ** these signals cannot be masked or blocked by a process.

Combined with the concept of a vector table, interrupts provide fixed service access ports to system services, whose own absolute address may change due to an upgrade or fix. The diagram below illustrates a vector table. Here the processes A and B call the programs alarm and communications, respectively. Both processes always use the manifest constants SIGALRM and COMM that reference an offset in an array (i.e., vector table), whose entries contain the "actual" address of the programs alarm and communications. For example, suppose the actual location of the alarm program were to change, rather than recompiling ProcessA to reference the alarm program's new address, the operating system simply updates the alarm's entry in the vector table.

 

2.0 Operations on Signals

There are a number of operations that can be done on signals; for example, a signal can be captured (signal), initiated (kill), waited for (sigpause), masked (sigmask) or blocked (sigblock), held on a queue (sighold), or released (sigrelse). In addition, signals have been extended to provide support for Multithreaded processes. These Multithreaded extensions can direct a signal to a specific thread (thr_kill), or set a thread's signal mask (thr_sigsetmask), for example. These Multithreaded features will be addressed later in this chapter.

 

2.1 Catching Signals

Signals are a common part of the Unix operating system, allowing autonomous processes to specify how various interrupts (e.g., traps) from the system are to be handled, and this includes ignoring a specific signal (e.g., signal(SIGINT, SIG_IGN) ). There are various means to demarcate the set of signals blocked or allowed by a process; for example, one system can use a mask, while another requires explicitly calling the system for each signal to be blocked. A handler can be designed to respond to multiple signal types. This is accomplished through the system convention of always passing the current interrupt's identity number as a parameter to the handler; ergo, a switch statement can be used to select an appropriate action based on the signal type. The typical handler needs only one parameter; yet, there can be cases where other parameters are necessary to narrow the cause of the trap, as with NT's catching of floating-point errors. Both of these cases will be illustrated in a later section of this chapter.

 

2.2 Process to process signaling

The usage of process to process signaling is not meant to serve as a means of communication, but rather to provide a means to asynchronously flag a process(es). This means that a process does not need to block, waiting for a signal, it can execute until it receives a signal. In Unix, process to process signaling can be accomplished with the kill system call. This function accepts two parameters, the first is the process ID (pid) of the destination process and the second is the signal type to be sent. For example, suppose that a process has forked a child process, thus the child's pid is known and it can be sent a signal from its parent. A parent can kill a child process by invoking the kill(child_pid, SIGKILL) system call as shown below.

pid_t child_pid;

...

if ( (child_pid=fork()) == 0 ) {

worker();

exit(0);

}

...

kill(child_pid, SIGKILL);

}

The child process terminates because the signal SIGKILL cannot be masked or blocked. There are, of course, rules that the system follows in using the kill API that prevent just any unfriendly process from removing other processes from the system, persevering an aspect of system reliability and security.

 

3.0 Recovery from Errors

In this section will build up to a multithreaded error recovery technique. First, however, we examine the single-threaded case of installing a handler to recover from errors, and the use of C's setjmp and longjmp run-time library APIs.

 

 

3.1 Calling a Handler

The code fragment below installs one signal handler that catches two different signals under Unix. The code fragment that follows will install a handler under NT. It should be noted that while NT's method is a close copy of Unix's, its purpose is to additionally illustrate the application of a two parameter signal handler.

 

 

3.1.1 Unix Handler

In main, the program first catches the user defined signal called SIGUSR1 and assigns the function called handler to process it. This means another process, with the proper permission (e.g., child process) can use the kill(parent_pid,SIGUSR1) system call. Next, the main function captures the signal SIGFPE. Thus if this process generates a floating-point error, the function handler will service it too. The installed signal handler exploits the operating system's convention of placing the signal type's unique identifier in the function's argument list. This single parameter (i.e., int sig) holds the type of signal received. In this case, the handler only becomes active for the values of sig that correspond to SIGFPE and SIGUSR1, and no other values. The code in handler takes advantage of this by using a switch statement to decide the correct response to the individual signal that invoked it. As a illustration of a response, the handler keeps count of the number of floating-point errors that have activated it. Once that error count exceeds three, it terminates the process; however, it could have taken another action.

 

3.1.2 NT Handler

NT does not have as rich a traditional signal environment as Unix, supporting fewer signal types; for instance, neither SIGUSR1 nor SIGUSR2 are supported. Its C run-time library's signal installer is prototyped to allow a number of ways to catch signals from the operating system. When using a handler, the run-time library reference advises to avoid certain activities such as using standard I/O (e.g.,printf) and calling routines that access the heap (e.g., malloc), for example. The problem in using these operations is that they can be interrupted and used by another handler, or a new instance of the current handler. Suppose, for example, that the handler is in the middle of a printf call when it is interrupted. This could cause a "scrambled" console output if the next signal handler also calls printf.

Of particular interest is the exception (or, interrupt) signal that can be generated from floating-point processing. This signal returns two integer values, the first is SIGFPE and the second is called a sub-code, and it identifies the exact exception that occurred during the last the floating point operation [MS Runtime Lib]. Below is a listing of some manifest constant subcodes for the floating-point math package. In the code below, the program main installs the function called handler to catch floating-point errors (FPE). The main program then executes some floating-point math, at which time a FPE signal may, or may not occur. However, after a computation the function fp_check() is called. which checks if a problem occurred during the preceding computation. In this case, if there is a problem, it simply display a message. An alternative would be to attempt and fix the problem, then loop-back to the label TRYAGN, to re-attempt the previously fouled computation.

The handler itself records the signal's subcode in the global variable fp_error and resets the floating-point math package. The math software is reset through the library function _fpreset(). Resetting the floating-point math package is necessary because it is in an unknown state after an error [61].

/* creating the handler */

void handler(int sig, int subcode)

{

fp_error = subcode;

_fpreset();

}

 

 

 

void fp_check()

{

switch (fp_error) {

case _FPE_INVALID:

printf("Invalid number \n");

break;

case _FPE_OVERFLOW:

printf("over flow action\n");

break;

case _FPE_UNDERFLOW:

printf("under flow\n");

break;

}

}

/* installing the handler */

#include <float.h>

#include <signal.h>

int fp_error=0; /* FPE subcode */

main()

{

...

signal(SIGFPE, handler);

...

TRYAGN:

/* do math */

fp_check();

...

}

 

4.0 Set Jump and Long Jump

The functions setjmp and longjmp are part of the standard C run-time library. NT and Unix support both these functions. In this section we will discuss the basics of setjmp and longjmp. The next section will build on this to develop a signal handler that uses these APIs for what is referred to as continuation semantics. Setjmp captures the current environment in a buffer of type jmp_buf when called. This includes the current program counter. The longjmp(long jump) run-time function restores the environment saved by setjmp in jmp_buf. Together these functions allow a programmer to essentially mark a "spot" in his code, and return to that "spot" later in the computation. This is more powerful than a simple goto because longjmp also restores the stack environment. The initial call to setjmp, always returns a zero, and sets the environment buffer. Setjmp will only return a zero on the first call. If a longjmp attempts to return a zero (i.e., longjmp(buf,0)), setjmp will return a 1. When the longjmp function is called, it returns control to the setjmp environment that it is given, and passes a integer to setjmp. This integer is used to indicated that this is not the first return from setjmp, but rather, it is a return to setjmp from a longjmp. If there are any active function calls between setjmp and longjmp, they are all bypassed. These two calls are typically used to attempt recovery from errors. When a parsing error occurs, for example, instead of aborting the parse, an error message is displayed and the parsing attempt continues. The fragments below illustrate the relationship between setjmp and longjmp.

 

 

 

Using Signals with Set Jump and Long Jump

While setjmp and longjmp work in NT's signaling mechanisms, Solaris provides explicit extensions to their traditional functionality for multithreading and signals. For example, setjmp has a signaled version called sigsetjmp which saves the current processor's registers and the stack environment. This extended version also includes a savemask parameter, that if non-zero, the caller's signalmask and scheduling information are also saved. A later call to siglongjmp will restore the environment with processor registers, (and optionally) the signalmask and scheduling information.

Multithreading with Set Jump and Long Jump

While the set jump and long jump APIs can be used within a multithreaded program, care must be taken to ensure that each thread has its own environment variable. This also means that long jumps cannot leap across threads; for example, if thread A calls setjmp, but thread B calls longjmp using the buffer set by thread A, there will be a memory access violation.

Continuance Semantics [Solaris96]

The program below employs what is called completion (or continuance) semantics, with an abort option. Completion semantics refers to the abandoning of an execution block, to run a viable alternative. The alternative attempts to complete what has failed, or provides a fail-safe state. The following two code fragment illustrates this idea. Below, the function sigsetjmp marks the step just prior to a critical computation, and returns a zero. Thus a siglongjmp can later return to this point.

If there is a floating-point error the exception handler, grace(), will be invoked, in place of the system's default handler that would have terminated the process and dumped core. The installed handler, grace, keeps count of the number of floating-point errors, and returns control to the point just prior to the errant computation. If the count of floating-point errors exceed MAX_TIRES, the process is terminated. When a floating-point error occurs and the error_count is less MAX_TRIES, sigsetjmp will return with a non-zero value (i.e., 1). This causes the if-then statement to set the value of the variable y to 1, a fail-safe value, "fixing" the problem.

 

Exception handling in Win32 [Win32 Pgm ref, vol.2]

In this section we introduce Win32's preferred mechanism for addressing exceptions within a process. The previous sections discussed the application of signal handlers to catch exception, that is, as a method to trap errors that would otherwise generate a run-time error. The signaling method that was used modeled the underlying interrupt system of the typical computer. The Win32 subsystem, however, provides a mechanism that is designed to process exceptions in support of a structured programming method. Historically, C programmers used a return-code paradigm. In this model a function call will return a code to indicate the success or failure of a call. This traditional method of error trapping has problems; for example, it is often difficult to select an appropriate return code. Consider the case of C's getchar() function call, it is not unreasonable to expect that the function would return a char; however, it returns an int: this is done to accommodate an end-of-file flag. Another problem, is that often additional data is needed to further narrow the cause of an error, thus programmer's need to use an extra-logical variable such as errno . An alternative to the traditional method is to embedded exception handling into a control structure [__]. To that end, Win32 supports structured error/exception handling.

 

Capturing an Exception In Win32

When a exception occurs in a process, the state of the offending thread is saved in a structure called CONTEXT; that is, when a thread raises an exception, the thread's context is placed in this structure, thus enabling the thread to continue later, if possible. The details of the exception are stored in a separate structure called EXCEPTION_RECORD. This structure contains the information on the exception and includes:

• a unique exception code;

• flags to indicate if the interrupted code can continue;

• a pointer to a linked list of exception records (to keep track of nested exceptions);

• the address where the exception occurred; and,

• an array of specific information about the current exception.

Unlike the previously discussed signaling methods where a programmer writes a function to catch an exception, Win32 is designed for the programmer to apply language constructs [60]. This method is structured in that, C language has structured exception handling extensions. These extensions are the try_exception and try_finally statements. They each define a block of code that is textually "local" to where the exception might occur, or, at least, within a syntactically enclosed block of code. The language construct that provides this structured exception handling is called try_exception, its syntax is:

 

 

 

Its semantics are as follows: execute the guarded body, and if there is an exception, evaluate the expression (i.e., exception(expressio)). If the expression is TRUE, then take the redemptive action, otherwise, abort the process. This structure can be nested, with other try-exception structures. With each progressive nesting, the system stacks the exception handler portion. Each instance is called a frame. When an exception occurs, the system backtracks through frames until a handler is found, or the frame stack is exhausted and the process is terminated by the system. A handler is considered found, if it catches the exception. The try-exception construct is a block structure that can be used as part of a function, for example:

 

A variant is the try-finally structure

Its semantics are simple, always execute the finally block of code. It is useful in parallel systems that use locks, in that, it can ensure that locks are released in the event of a failure in a process or thread. In the code fragment below, a lock is acquired and is guaranteed to unlock, short of a catastrophic failure.

try {

lock

/* do stuff */

}finally {

unlock

}

 

Continuance Semantics with Win32

Win32 uses what might be called frame-based continuance semantics. This method of continuation places the handler within a well defined textual block. This is in contrast to the method of explicitly defining a function as an exception handler, whose own text may be located in another file.

Multithreading and Signals

The signaling semantics of a single-threaded process can cause problems in a multithreaded process. Traditionally, an operating system discerns a process as one system entity. Thus if some thread causes a trap, or a signal is generated for a process, a question arises as to which thread of control should be interrupted to receive the signal? Therefore, the semantics for signaling a process need to be extended to include multiple threads-of-control within a process, that is, a multithreaded process.

Here are two possible solutions to the problem of selecting a thread to catch a signal: 1) to arbitrarily select a thread, or 2) designate a thread within the process to receive all signals. Arbitrarily selecting a thread of control from a process to execute a signal handler can lead to a number of difficult synchronization problems. Consider, for instance, a situation in which a client thread is holding a number locks when it is interrupted by a Ctrl-C signal. While the thread or process will be able to terminate, this abrupt, and random thread selection can make it difficult, if not impossible, to program a thread in such a way as to unwrap locks that are held. Thus, the selection of a particular thread can add an element of predictability. There are instances when a signal clearly "belongs" to a thread; for example, when a thread attempts a division-by-zero. Consequently, from a process' point-of-view there are two distinct origins of signals: those from outside the thread (e.g., Ctrl-C), and those from inside the thread (e.g., SIGFPE), these are known as asynchronous and synchronous signals, respectively.

 

Asynchronous Signals (async-signal)

These interrupts are generated from outside the thread; for example, SIGINT. In general, threads have the option of masking any asynchronous signal, and thus do not have to respond. In systems, such as Solaris the signal will be buffered until either some thread within the process enables the signal or the process is terminated. If more than one thread is able to respond to a given signal, the system will select a thread arbitrarily.

Synchronous Signals

These are signals that are generated either directly or as result of a thread's action (segmentation fault) and are particular to a thread. The ability of a thread to catch it own traps is a useful extension. Generally, intra-process interrupts (e.g., thr_terminate) are also considered as synchronous, and the signal is sent to the targeted thread. As an illustration of the usefulness of directing a signal to specific thread, consider the following: Deep in a branch-and-bound problem, an unforeseen overflow has corrupted a computation, and there has been a segmentation-fault----the entire process core dumps! A more elegant solution might be to have the thread catch the error and "gracefully" terminate itself, permitting the continuation of the search by, say, other threads. Another alternative is referred to as completion semantics, in where the offending thread, initially uses setjmp to capture the current "good" state, and later during a failure the handler can jump long, with some fixes, back to the "good" state in an effort to successfully complete the current computation.

The following is a program fragment that maintains a separate signal-set-jump environment for each of its worker threads. It uses Solaris' thread specific data APIs. The program operates as follows. The main function first creates a thread specific key to be used by the signal handler to retrieve the signaled thread's environment, and it then creates workers. When a worker is first created, it uses the key to check if it has a static variable already allocated (i.e., tsd_env==NULL). If the thread has no static space allocated, it mallocs space. This newly allocated space is where it will store its signal-set-jump environment.