[Next] [Previous] [Up] [Top]

3 Fault Tolerance Concepts With Examples

3.2 Faults and Failures

3.2.1 Definitions


The terms failure and fault are key to any understanding of system reliability. Yet they are often misused. One describes the situation(s) to be avoided, while the other describes the problem(s) to be circumvented.

3.2.1.1 Concept Definition


Over time, failure has come to be defined in terms of specified service delivered by a system. This avoids circular definitions involving essentially synonymous terms such as defect, etc. This distinction appears to have been first proposed by Melliar-Smith [Melliar-Smith 75]. A system is said to have a failure if the service it delivers to the user deviates from compliance with the system specification for a specified period of time. While it may be difficult to arrive at an unambiguous specification of the service to be delivered by any system, the concept of an agreed-to specification is the most reasonable of the options for defining satisfactory service and the absence of satisfactory service, failure.

The definition of failure as the deviation of the service delivered by a system from the system specification essentially eliminates "specification" faults or errors. While this approach may appear to be avoiding the problem by defining it away, it is important to have some reference for the definition of failure, and the specification is a logical choice. The specification can be considered as a boundary to the system's region of concern, discussed later. It is important to recognize that every system has an explicit specification, which is written, and an implicit specification that the system should at least behave as well as a reasonable person could expect based on experience with similar systems and with the world in general. Clearly, it is important to make as much of the specification as explicit as possible.

It has become the practice to define faults in terms of failure(s). The concept closest to the common understanding of the word fault is one that defines a fault as the adjudged cause of a failure. This fits with a common application of the verb form of the word fault, which involves determining cause or affixing blame. However, this requires an understanding of how failures are caused. An alternate view of faults is to consider them failures in other systems that interact with the system under consideration--either a subsystem internal to the system under consideration, a component of the system under consideration, or an external system that interacts with the system under consideration (the environment). In the first instance, the link between faults and failures is cause; in the second case it is level of abstraction or location.

The advantages of defining faults as failures of component/interacting systems are: (1) one can consider faults without the need to establish a direct connection with a failure, so we can discuss faults that do not cause failures, i.e., the system is naturally fault tolerant, (2) the definition of a fault is the same as the definition of a failure with only the boundary of the relevant system or subsystem being different. This means that we can consider an obvious internal defect to be a fault without having to establish a causal relationship between the defect and a failure at the system boundary.

In light of the proceeding discussion, a fault will be defined as the failure of (1) a component of the system, (2) a subsystem of the system, or (3) another system which has interacted or is interacting with the considered system. Every fault is a failure from some point of view. A fault can lead to other faults, or to a failure, or neither.

A system with faults may continue to provide its service, that is, not fail. Such a system is said to be fault tolerant. Thus, an important motivation for differentiating between faults and failures is the need to describe the fault tolerance of a system. An observer inspecting the internals of the system would say that the faulty component had failed, because the observer's viewpoint is now at a lower level of detail.

The observable effect of a fault at the system boundary is called a symptom. The most extreme symptom of a fault is a failure, but it might also be something as benign as a high reading on a temperature gauge. Symptoms are discussed in greater detail later.

A Digression on Errors


The term error often is used in addition to the terms fault and failure, as in the article by Melliar-Smith previously cited. Often, errors are defined to be the result of faults, leading to failures. Informally, errors seem to be a passive concept associated with incorrect values in the system state. However, it is extremely difficult to develop unambiguous criteria for differentiating between faults and errors. Many researchers refer to value faults, which are also clearly erroneous values. The connection between error and failure is even more difficult to describe.

As we have seen, differentiation between failures and faults is essential for fault tolerant systems. A third term, error, adds little to this distinction and can be a source of confusion. Consequently, we substitute the term fault for the common uses of the term error. Generally, references to the term "error" in the literature can be fitted to the context of this document by substituting the term "fault."

3.2.1.2 Bridge Example


To help understand these definitions, consider the example of a highway bridge over a river. Some time after developing this example, Alfred Spector has pointed out that a precedent for using this as an example exists in an article comparing practices in bridge design with practices in software design [Spector 86].

When designing the bridge the designer must consider a myriad of details regarding requirements, and the environment in which the bridge would operate. Suppose a 20 ton truck drives onto the bridge and the bridge collapses. From the truck's point of view, the bridge has failed. But what is the fault that led to the failure? There are lots of possible answers to this:

  1. The designer of the bridge did not allow for appropriate bridge loading. This could be:

    1. A specification fault if the highway department did not anticipate that 20 ton trucks would need to use the bridge, or

    2. A design fault if the specification called for it being able to carry 20 ton trucks.

    3. An implementation fault if the fabricator didn't correctly follow the design.

  2. The truck driver ignored a "Load Limit" sign. This would be a user fault.

  3. A worker for the highway department posted an erroneous "Load Limit" sign. This would be an operator fault.

  4. The people preparing the documentation for the bridge mistakenly indicated that the bridge would support 20 tons, when in fact it was only designed to support 10 tons. The highway department erected a 20 ton "Load Limit" sign. This would be a documentation fault, followed by an operator fault.

  5. Previously a 30 ton truck crossed the bridge and sufficiently weakened the structure so that the subsequent 20 ton truck caused the bridge to fail. This, again, would be a user fault (the prior user).

  6. Inadequate maintenance caused the bridge to develop structural flaws which led to it being unable to support a 20 ton truck. This would be another operator fault.

  7. A barge on the river hit the bridge and knocked out a support. Or a 100 year flood came along and washed the bridge out, or a meteor crashed through the bridge. These would be environmental faults.

As an example of a fault which does not lead to a failure, consider the same bridge with a crack in its concrete roadbed. There is no failure involved if the bridge continues to carry the loads requested of it in spite of this fault. It may be the result of normal wear and tear on the roadbed. However, a thorough inspection of the bridge might discover that the crack in the roadbed was a symptom of a faulty strut, only observable by x-raying the strut. From the point of view of the bridge inspector, the strut would have failed. This component failure is an internal fault.

Scenarios like this can be generated ad infinitum. Note that a fault does not lead to a failure unless the result is observable by the user, and leads to the bridge becoming unable to deliver its specified service. This means that one person's fault is another person's failure. For instance, in example 4 above, from the point of view of the highway department the erroneous documentation was a fault that led to an operator failure. From the point of view of the user of the bridge the erroneous documentation was a documentation fault that led to an operator fault which led to a bridge failure.

3.2.1.3 Computer System Example


Consider a computer system running a program to control the temperature of a boiler by calculating the firing rate of the burner for the boiler. If a bit in memory becomes stuck at one, that is a fault. If the memory fault effects the operation of the program in such a way that the computer system outputs cause the boiler temperature to rise out of the normal zone, that is a computer system failure and a fault in the overall boiler system. If there is a gauge showing the temperature of the boiler, and its needle moves into the "yellow" zone (abnormal, but acceptable), that is a symptom of the system fault. On the other hand, if the boiler explodes because of the faulty firing calculation, that is a (catastrophic) system failure.

The reasons for the memory fault could be manifold. The chip used might not have been manufactured to specification (a manufacturing fault), the hardware design may have caused too much power to be applied to the chip (a system design fault), the chip design may be prone to such faults (a chip design fault), a field engineer may have inadvertently shorted two lines while performing preventive maintenance (a maintenance fault), etc.

3.2.1 - Definitions
3.2.1.1 - Concept Definition
A Digression on Errors
3.2.1.2 - Bridge Example
3.2.1.3 - Computer System Example

A Conceptual Framework for Systems Fault Tolerance - 30 MAR 95
[Next] [Previous] [Up] [Top]

Generated with CERN WebMaker