[Next] [Previous] [Up] [Top]

4 Fault Tolerance Mechanisms

4.3 Acceptance Test Techniques

The fault detection mechanism used influences the remainder of the fault tolerance activities (diagnosis, containment, masking, compensation, and recovery). The two common mechanisms for fault detection are acceptance tests and comparison.

4.3.1 Fault Detection


Acceptance tests are the more general fault detection mechanism in that they can be used even if the system is composed of a single (non-redundant) processor. The program or sub-program is executed and the result is subjected to a test. If the result passes the test, execution continues normally. A failed acceptance test is a symptom of a fault. An acceptance test is most effective if it is based on criteria that can be derived independently of the function being tested and can be calculated more simply that the function being tested (e.g., multiplication of a result by itself to verify the result of a square root function).

4.3.2 Fault Diagnosis


An acceptance test cannot generally be used to determine what has gone wrong. It can only tell that something has gone wrong.

4.3.3 Fault Containment


An acceptance test provides a barrier to the continued propagation of a fault. Further execution of the program being tested is not allowed until some form of retry successfully passes the acceptance test. If no alternatives pass the acceptance test, the subsystem fails, preferably silently. The silent failure of faulty components allows the rest of the system to continue in operation (where possible) without having to worry about erroneous output from the faulty component [Schlichting 83].

4.3.4 Fault Masking


An acceptance test successfully masks a bad value if a retry or alternate results in a new, correct result within the time limit set for declaring failure.

4.3.5 Fault Compensation


A program that fails an acceptance test can be replaced by an alternate, as described in the next section. If the alternate passes the acceptance test, its result may be used to compensate for the original result. Notice that the alternate program run during a retry may be a very simple one that just outputs a "safe" value to compensate for the faulty subsystem. A common approach in control systems is to "coast" the result by providing the value calculated from the last known good cycle.

4.3.6 Fault Repair


Acceptance tests are usually used in a construct known as a recovery block. A recovery block provides backward fault recovery by rolling program execution back to the state before the faulty function was executed. This repairs the faulty state and the result. When a result fails an acceptance test, the program can be executed again before leaving the recovery block. If the new result passes the acceptance test, it can be assumed that the fault originally detected was transient. If the software is suspect, an alternative can be executed in place of the original program fragment. If a single processor is used, the state of the processor must be reset to the beginning of the function in question. A mechanism called the recovery cache has been proposed to accomplish this [Anderson 76]. A recovery cache records the state of the processor at the entrance to each recovery block. Although a recovery cache is best implemented in hardware, implementations to date have been limited to experimental software. Where multiple processors are available, the retry may take the form of starting the program on a backup processor and shutting down the failed processor. Recovery blocks can be cascaded so that multiple alternatives can be tried when an alternate result also fails the acceptance test.

4.3.1 - Fault Detection
4.3.2 - Fault Diagnosis
4.3.3 - Fault Containment
4.3.4 - Fault Masking
4.3.5 - Fault Compensation
4.3.6 - Fault Repair

A Conceptual Framework for Systems Fault Tolerance - 30 MAR 95
[Next] [Previous] [Up] [Top]

Generated with CERN WebMaker