1 Introduction

A system exists in an environment (e.g., a space probe in deep space), and has operators and users (possibly the same). The system provides feedback to the operator and services to the user. Operators are shown inside the system because operator procedures are usually a part of the system design, and many system functions, including fault recovery, may involve operator action. Not shown in the figure, but of equal importance, are the system's designers and maintainers.
Systems are developed to satisfy a set of requirements that meet a need. A requirement that is important in some systems is that they be highly dependable. Fault tolerance is a means of achieving dependability.
There are three levels at which fault tolerance can be applied. Traditionally, fault tolerance has been used to compensate for faults in computing resources (hardware). By managing extra hardware resources, the computer subsystem increases its ability to continue operation. Hardware fault tolerance measures include redundant communications, replicated processors, additional memory, and redundant power/energy supplies. Hardware fault tolerance was particularly important in the early days of computing, when the time between machine failures was measured in minutes.
A second level of fault tolerance recognizes that a fault tolerant hardware platform does not, in itself, guarantee high availability to the system user. It is still important to structure the computer software to compensate for faults such as changes in program or data structures due to transients or design errors. This is software fault tolerance. Mechanisms such as checkpoint/restart, recovery blocks and multiple-version programs are often used at this level.
At a third level, the computer subsystem may provide functions that compensate for failures in other system facilities that are not computer-based. This is system fault tolerance. For example, software can detect and compensate for failures in sensors. Measures at this level are usually application-specific. It is important that fault tolerance measures at all levels be compatible, hence the focus on system-level issues in this document.
Generated with CERN WebMaker