[Next] [Previous] [Up] [Top]

4 Fault Tolerance Mechanisms

4.4 Comparison Techniques

4.4.1 Fault Detection


Comparison is an alternative to acceptance tests for detecting faults. If the principal fault source is processor hardware, then multiple processors are used to execute the same program. As results are calculated, they are compared across processors. A mismatch indicates the presence of a fault. This comparison can be pair-wise, or it may involve three or more processors simultaneously. In the latter case the mechanism used is generally referred to as voting. If software design faults are a major consideration, then a comparison is made between the results from multiple versions of the software in question, a mechanism known as n-version programming [Chen 78]. This is discussed more in the Section
4.5.

4.4.2 Fault Diagnosis


Fault diagnosis with comparison is dependent upon whether pair-wise or voting comparison is used:

pair-wise
When a mismatch occurs for a pair it is impossible to tell which of the processors has failed. The entire pair must be declared faulty.

voting
When three or more processors are running the same program, the processor whose values do not match the others is easily diagnosed as the faulty one.

4.4.2.1 Voting Issues


Voting may be centralized or decentralized. Centralized voting is easy to mechanize, either in software or hardware, but results in a single point of failure, a violation of many qualitative requirements specifications. It is possible to compensate for total voter failure using a master-slave approach that replaces a silent voter with a standby voter, as in the pair and spare approach. Decentralized voting avoids the single point of failure, but requires a consensus among multiple voting agents, either hardware or software in order to avoid replication faults mentioned in Section
3.4.1.3. In order to reach consensus, the distributed voters must synchronize to exchange several rounds of messages. In the worst case, where up to f faulty processors are allowed to send misleading results to other processors participating in the consensus process, 3f+1 distributed voters must be provided to reach a state known as interactive consistency [Pease 80]. Interactive consistency requires that each non-faulty processor provides a value, that all non-faulty processors agree on the same set of values, and that the values are correct for each of the non-faulty processors. Similar processes are required to maintain a consensus as to the number of members remaining in a group of distributed processors [Cristian 88].

4.4.3 Fault Containment


When pair-wise comparison is used, containment is achieved by stopping all activity in the mismatching pair. Any other pairs in operation can continue executing the application, undisturbed. They detect the failure of the miscomparing pair through time-outs.

When voting is used, containment is achieved by ignoring the failed processor and reconfiguring it out of the system.

4.4.4 Fault Masking


In a comparison based system, fault masking is achievable in two ways. When voting is used the voter only allows the correct value to pass on. If hardware voters are used, this usually occurs quickly enough to meet any response deadlines. If the voting is done by software voters that must reach a consensus, adequate time may not be available.

Pair-wise comparison requires the existence of multiple pairs of processors to mask faults. In this case the faulty pair of processors is halted, and values are obtained from the functional, good pairs.

4.4.5 Fault Compensation


The value provided by a voter may be the majority value, the median value, a plurality value, or some other predetermined satisfactory value. While this choice is application dependent, the most common choice is the median value. This guarantees that the value selected was calculated by at least one of the participating processors and that it is not an extreme value.

4.4.6 Fault Repair


In a comparison-based system with a single pair of processors, there is no recovery from a fault. With multiple pairs of pairs, recovery consists of using the values from the "good" pair. Some systems provide mechanisms to restart the miscomparing pair with data from a "good" pair. If the miscomparing pair subsequently produces results that compare for an adequate period of time, it may be configured back into the system.

When voting is used, recovery from a failed processor is accomplished by utilizing the "good" values from the other processors. A processor that is outvoted may be allowed to continue execution and may be configured back into the system if it successfully matches in a specified number of subsequent votes.

4.4.1 - Fault Detection
4.4.2 - Fault Diagnosis
4.4.2.1 - Voting Issues
4.4.3 - Fault Containment
4.4.4 - Fault Masking
4.4.5 - Fault Compensation
4.4.6 - Fault Repair

A Conceptual Framework for Systems Fault Tolerance - 30 MAR 95
[Next] [Previous] [Up] [Top]

Generated with CERN WebMaker