3 Fault Tolerance Concepts With Examples
The definition of failure as the deviation of the service delivered by a system from the system specification essentially eliminates "specification" faults or errors. While this approach may appear to be avoiding the problem by defining it away, it is important to have some reference for the definition of failure, and the specification is a logical choice. The specification can be considered as a boundary to the system's region of concern, discussed later. It is important to recognize that every system has an explicit specification, which is written, and an implicit specification that the system should at least behave as well as a reasonable person could expect based on experience with similar systems and with the world in general. Clearly, it is important to make as much of the specification as explicit as possible.
It has become the practice to define faults in terms of failure(s). The concept closest to the common understanding of the word fault is one that defines a fault as the adjudged cause of a failure. This fits with a common application of the verb form of the word fault, which involves determining cause or affixing blame. However, this requires an understanding of how failures are caused. An alternate view of faults is to consider them failures in other systems that interact with the system under consideration--either a subsystem internal to the system under consideration, a component of the system under consideration, or an external system that interacts with the system under consideration (the environment). In the first instance, the link between faults and failures is cause; in the second case it is level of abstraction or location.
The advantages of defining faults as failures of component/interacting systems are: (1) one can consider faults without the need to establish a direct connection with a failure, so we can discuss faults that do not cause failures, i.e., the system is naturally fault tolerant, (2) the definition of a fault is the same as the definition of a failure with only the boundary of the relevant system or subsystem being different. This means that we can consider an obvious internal defect to be a fault without having to establish a causal relationship between the defect and a failure at the system boundary.
In light of the proceeding discussion, a fault will be defined as the failure of (1) a component of the system, (2) a subsystem of the system, or (3) another system which has interacted or is interacting with the considered system. Every fault is a failure from some point of view. A fault can lead to other faults, or to a failure, or neither.
A system with faults may continue to provide its service, that is, not fail. Such a system is said to be fault tolerant. Thus, an important motivation for differentiating between faults and failures is the need to describe the fault tolerance of a system. An observer inspecting the internals of the system would say that the faulty component had failed, because the observer's viewpoint is now at a lower level of detail.
The observable effect of a fault at the system boundary is called a symptom. The most extreme symptom of a fault is a failure, but it might also be something as benign as a high reading on a temperature gauge. Symptoms are discussed in greater detail later.
As we have seen, differentiation between failures and faults is essential for fault tolerant systems. A third term, error, adds little to this distinction and can be a source of confusion. Consequently, we substitute the term fault for the common uses of the term error. Generally, references to the term "error" in the literature can be fitted to the context of this document by substituting the term "fault."
When designing the bridge the designer must consider a myriad of details regarding requirements, and the environment in which the bridge would operate. Suppose a 20 ton truck drives onto the bridge and the bridge collapses. From the truck's point of view, the bridge has failed. But what is the fault that led to the failure? There are lots of possible answers to this:
Scenarios like this can be generated ad infinitum. Note that a fault does not lead to a failure unless the result is observable by the user, and leads to the bridge becoming unable to deliver its specified service. This means that one person's fault is another person's failure. For instance, in example 4 above, from the point of view of the highway department the erroneous documentation was a fault that led to an operator failure. From the point of view of the user of the bridge the erroneous documentation was a documentation fault that led to an operator fault which led to a bridge failure.
The reasons for the memory fault could be manifold. The chip used might not have been manufactured to specification (a manufacturing fault), the hardware design may have caused too much power to be applied to the chip (a system design fault), the chip design may be prone to such faults (a chip design fault), a field engineer may have inadvertently shorted two lines while performing preventive maintenance (a maintenance fault), etc.
Generated with CERN WebMaker
3.2.1.1 Concept Definition
Over time, failure has come to be defined in terms of specified service delivered by a system. This avoids circular definitions involving essentially synonymous terms such as defect, etc. This distinction appears to have been first proposed by Melliar-Smith [Melliar-Smith 75]. A system is said to have a failure if the service it delivers to the user deviates from compliance with the system specification for a specified period of time. While it may be difficult to arrive at an unambiguous specification of the service to be delivered by any system, the concept of an agreed-to specification is the most reasonable of the options for defining satisfactory service and the absence of satisfactory service, failure. A Digression on Errors
The term error often is used in addition to the terms fault and failure, as in the article by Melliar-Smith previously cited. Often, errors are defined to be the result of faults, leading to failures. Informally, errors seem to be a passive concept associated with incorrect values in the system state. However, it is extremely difficult to develop unambiguous criteria for differentiating between faults and errors. Many researchers refer to value faults, which are also clearly erroneous values. The connection between error and failure is even more difficult to describe. 3.2.1.2 Bridge Example
To help understand these definitions, consider the example of a highway bridge over a river. Some time after developing this example, Alfred Spector has pointed out that a precedent for using this as an example exists in an article comparing practices in bridge design with practices in software design [Spector 86].
As an example of a fault which does not lead to a failure, consider the same bridge with a crack in its concrete roadbed. There is no failure involved if the bridge continues to carry the loads requested of it in spite of this fault. It may be the result of normal wear and tear on the roadbed. However, a thorough inspection of the bridge might discover that the crack in the roadbed was a symptom of a faulty strut, only observable by x-raying the strut. From the point of view of the bridge inspector, the strut would have failed. This component failure is an internal fault.
3.2.1.3 Computer System Example
Consider a computer system running a program to control the temperature of a boiler by calculating the firing rate of the burner for the boiler. If a bit in memory becomes stuck at one, that is a fault. If the memory fault effects the operation of the program in such a way that the computer system outputs cause the boiler temperature to rise out of the normal zone, that is a computer system failure and a fault in the overall boiler system. If there is a gauge showing the temperature of the boiler, and its needle moves into the "yellow" zone (abnormal, but acceptable), that is a symptom of the system fault. On the other hand, if the boiler explodes because of the faulty firing calculation, that is a (catastrophic) system failure.
A Conceptual Framework for Systems Fault Tolerance - 30 MAR 95
[Next] [Previous] [Up] [Top]