[Next] [Previous] [Up] [Top]

4 Fault Tolerance Mechanisms

4.1 Characteristics Unique to Digital Computer Systems

Digital computer systems have special characteristics that determine how these systems fail and what fault tolerance mechanisms are appropriate. First, digital systems are discrete systems. Unlike continuous systems, such as analog control systems, they operate in discontinuous steps. Second, digital systems encode information. Unlike continuous systems, values are represented by a series of encoded symbols. Third, digital systems can modify their behavior based on the information they process.

Since digital systems are discrete systems, results may be tested or compared before they are released to the outside world. While analog systems must continuously apply redundant or limiting values, a digital system may substitute an alternative result before sending an output value. While it is possible to build digital computers that operate asynchronously (without a master clock to sequence internal operations), in practice all digital computers are sequenced from a clock signal. This dependency on a clock makes an accurate clock source as important as a source of power, but it also means that identical sequences of instructions take essentially the same amount of time. One of the most common fault tolerance mechanisms, the time-out, uses this property to measure program activity (or lack of activity).

The fact that digital systems encode information is extremely important. The most important implication of information encoding is that digital systems can accurately store information for a long period of time, a capability not available in analog systems, which are subject to value drift. This also means that digital systems can store identical copies of information and expect the stored copies to still be identical after a substantial period of time. This makes the comparison techniques discussed in Section 4.4 possible.

Information encoding in digital systems may be redundant, with several codes representing the same value. Redundant encoding is the most powerful tool available to ensure that information in a digital system has not been changed during storage or transmission. Redundant encoding may be implemented at several levels in a digital system. At the lowest levels, carefully designed code patterns attached to blocks of digital information can allow special-purpose hardware to correct for a number of different communication or storage faults, including changes to single bits or changes to several adjacent bits. Parity for random access memory is a common example of this use of encoding. Since a single bit of information can have significant consequences at the higher levels, a programmer may encode sensitive information, such as indicators for critical modes, as special symbols unlikely to be created by a single-bit error.


A Conceptual Framework for Systems Fault Tolerance - 30 MAR 95
[Next] [Previous] [Up] [Top]

Generated with CERN WebMaker