Terminology and Concepts Prof. Naga Kandasamy 1 Goals of Fault Tolerance Dependability is an umbrella term encompassing the concepts of reliability, availability, performability, safety, and testability. We will now define the above terms in an intuitive fashion. 1.1 Reliability The reliability R(t) of a system is a function of time, and is defined as the conditional probability that the system will perform correctly throughout the interval [t, t]. given that the system was performing correctly at time t. So, reliability is the probability that the system will operate correctly throughout a complete interval of time. Reliability is used to characterize systems in which even momentary periods of incorrect performance are unacceptable, or in which it is impossible to repair the system (e.g., a spacecraft where the time interval of concern may be years). In some other applications such as flight control, the time interval of concern may be a few hours. Fault tolerance can improve a system s reliability by keeping the system operational when hardware and software failures occur. 1.2 Availability Availability A(t) is a function of time, and is defined as the probability that a system is operating correctly and is available to perform its functions at the instant of time t. Availability differs from reliability in that reliability depends on an interval of time, whereas availability is taken at an instant of time. So, a system can be highly available yet experience frequent periods of down time as long as the length of each down-time period is very short. The most common measure of availability is the expected fraction of time that a system is available to correctly perform its functions. 1.3 Performability In many cases, it is possible to design systems that can continue to perform correctly after the occurrence of hardware/software failures, at a diminished level of performance. So, the performability P (L, t) of a system is a function of time, and is defined as the probability that the system performance will be at, or above, some level L at the instant of time t. Performability differs from reliability in that reliability is a measure of the likelihood that all of the functions are performed correctly, whereas performability is a measure of likelihood that some subset of the functions is performed correctly. Graceful degradation is the ability of the system to automatically decrease its level of performance to compensate for hardware/software failures. Fault tolerance can provide graceful degradation and improve performability by eliminating failed hardware/software components, allowing performance at some reduced level. 1989. These notes are adapted from: B. W. Johnson, Design and Analysis of Fault Tolerant Digital Systems, Addison Wesley, 1
Fig. 1: The full adder circuit used to illustrate the distinction between faults and errors. 1.4 Safety Safety S(t) is the probability that a system will either perform its functions correctly or will discontinue its functions in a manner that does not compromise the safety of any people associated with the system (fail-safe capability). Safety and reliability differ because reliability is the probability that a system will perform its functions correctly, whereas safety is the probability that a system will either perform its functions correctly or will discontinue the functions in a fail-safe manner. 1.5 Maintainability and Testability Maintainability is a measure of the ease with which a system can be repaired, once it has failed. So, the maintainability M(t) is the probability that a failed system will be restored to an operational state within a specified period of time t. The restoration process includes locating and diagnosing the problem, repairing and reconfiguring the system, and bringing the system back to its operational state. 2 Faults, errors, and system failures We define the following basic terms: A fault is a defect within the system. An error is a deviation from the required operation of the (sub)system. A system failure occurs when the system delivers a function deviating from the one specified. There is a cause-and-effect relationship between faults, errors, and failures. Faults result in errors, and errors can lead to system failures. In other words, errors are the effect of faults, and failures are the effect of errors. The full-adder circuit shown in Fig. 1 can be used to illustrate the distinction between faults and errors. The inputs A i, B i, and C i are the two operand bits and the carry bit, respectively. The truth table showing the correct performance for this circuit is shown in Fig. 2. If a short occurs between line L and the power supply line, resulting in line L becoming permanently fixed at a logic 1 value, then a fault (or defect) has occurred in the circuit. The fault is the actual short within the circuit. Fig. 3 shows the truth table of the circuit that contains the physical fault. Comparing Figs. 2 and 3, we see that the circuit performs correctly for the input combinations 1, 11, 11, and 111, but not for, 1, 1, and 11. So, whenever an input pattern is supplied to the circuit that results in an incorrect output, 2
Fig. 2: The truth table for the fault-free full-adder circuit. an error has occurred. If the output of the circuit is used to control a relay, and the relay is opened when it should be closed, a failure has occurred. 2.1 Characteristics of Faults System faults can be classified based on their duration. Permanent failures remain in existence indefinitely if no corrective action is taken. Though many are residual design or manufacturing faults, they are also caused by catastrophic events such as an accident. Intermittent failures appear, disappear, and reappear repeatedly. They are difficult to predict, but their effects are highly correlated. Most intermittent faults are due to marginal design, testing, or manufacturing, and manifest themselves under certain environmental or system conditions. Transient failures appear and disappear quickly, and are not correlated with each other. They are most commonly induced by random environmental disturbances such as electro-magnetic interference. Faults can also be characterized based on the underlying cause: (1) Specification mistakes; (2) Implementation mistakes; (3) External disturbances; and (4) Component failures. Fig. 3: The truth table for the full-adder circuit when the line L is stuck at logic value 1. 3
System reliability Non-redundant systems Redundant systems Fault avoidance Fault detection Masking redundancy Fault-tolerant systems Dynamic redundancy On-line detection/masking Reconfiguration Retry Online repair Fig. 4: A taxonomy of possible failure-response strategies. 2.2 Fault Models To design a fault-tolerant system, it is necessary to assume that the underlying faults behave according to some fault model. Even though, in practice, faults can be transient in nature, and exhibit complex behavior, fault models are used to make the problem of designing fault-tolerant systems more manageable, and as a way to restrict our attention to a subset of all faults that can occur. A commonly used fault model to capture the behavior of fault digital circuits is the logical stuck-fault model. 2.3 Failure Response Strategies A taxonomy of the primary techniques used to design systems to operate in a fault-prone environment is shown in Fig. 4. Broadly speaking, there are three primary methods: fault avoidance (e.g., shielding from EMI), fault masking (e.g., TMR systems), and fault tolerance. Fault detection does not tolerate faults, but provides a warning that a fault has occurred. Masking redundancy (also called static redundancy) tolerates failures, but provides no warning of them. Dynamic redundancy covers those systems whose configuration can be dynamically changed in response to a fault, or in which masking redundancy is enhanced by on-line fault detection which allows on-line repair. 3 Quantitative Evaluation of System Reliability Reliability of a system R(t) is defined to be the probability of a component or system functioning correctly over a given time period [t, t] under a given set of operating conditions. Consider a set of N identical components, all of which begin operating at the same time. Then, at some time t, the number of components operating correctly is N o (t) and the number of failed components is N f (t). Then, the reliability of a 4
component at time t is given by R(t) = N o(t) N = N o (t) N o (t) + N f (t) which is simply the probability that a component has survived the interval [t, t ]. We can also define unreliability Q(t) as the probability that a system will not function correctly over a given period of time. This is also called the probability of failure. If the number of failed components during time t is given by n f (t), then Q(t) = N f (t) N = N f (t) N o (t) + N f (t) From the definitions of reliability and unreliability, we obtain Q(t) = 1 R(t) If we write the reliability function as R(t) = 1 N f (t) N and differentiate R(t) with respect to time, we obtain = 1 N dn f (t) which can be rewritten as dn f (t) = N The derivative dn f (t)/ is simply the instantaneous rate at which components are failing. At time t, there are still N o (t) components operating correctly. Dividing dn f (t)/ by N o (t), we obtain z(t) = 1 dn f (t) N o (t) where z(t) is called the hazard function, hazard rate, or failure rate. The unit for the failure-rate function is failures per unit of time. The failure rate function can also be written in terms of the reliability function R(t) as z(t) = 1 dn f (t) = 1 N o (t) N o (t) Rearranging, we obtain the following differential equation. [ N = z(t)r(t) ] = R(t) The failure-rate function z(t) of electronic components exhibits a bathtub curve, shown in Fig. 5, comprising three distinct regions burn in, useful life, and wear out. It is typically assumed that the failure rate is constant during a component s useful life and is given by z(t) = λ. So, the differential equation is = λr(t) Solving the above equation gives us R(t) = e λt (1) The exponential relationship between reliability and time is known as the the exponential failure law. Thus, the probability of a system working correctly throughout a given period of time decreases exponentially with the length of this time period. The exponential failure law is extremely valuable for the analysis of electronic components, and is by far the most commonly used relationship between reliability and time. 5
Fig. 5: The bathtub form of the failure curve Mean time to failure. Mean time to failure (MTTF) is another way to quantify system reliability. MTTF gives the expected time that a system will operate before the first failure occurs. If we have N identical components operating at time t =, and we measure the time each system operates before failing, the average time is the MTTF. If each component i operates for a time t i before encountering the first failure, the MTTF is given by N i=1 MT T F = t i N We can calculate the MTTF by finding the expected value of the time of failure. From probability theory, we know that the expected value of a random variable X is E[X] = xf(x)dx where f(x) is the probability density function. From a reliability viewpoint, we are interested in the MTTF. So, MT T F = tf(t) where f(t) is the failure density function. The failure density function is f(t) = dq(t) = So, the MTTF can be written as MT T F = t Using integration by parts, we can show that MT T F = [ tr(t) + R(t) ] = R(t) If the reliability function obeys the exponential failure law, the MTTF is given by MT T F = e λt = 1 λ The above equation leads to a very simple result that the MTTF is the inverse of the system failure rate. Thus, a system with a constant failure rate of.1 failures per hour will have a mean time to failure of 1 hours. Finally, the reliability of a system equal to the MTTF for the failure exponential law is R(t) = e λ(1/λ) = e 1 =.37 (2) In other words, the system has only a 37% chance of operating correctly for an amount of time equal to its MTTF. 6
Mean time to repair. The mean time to repair (MTTR) is the average time taken to repair a failed system. Just as we describe the reliability of a system using its failure rate, we can quantify the repairability of a system using its repair rate µ. The MTTR is 1 µ. Mean time between failures. If a failed system can be repaired and made as good as new, then the mean time between failures (MTBF) is given by MT BF = MT T F + MT T R (3) The availability of the system is the probability that the system will be functioning correctly at any given time. In other words, it is the fraction of time for which a system is operational. Availabilty = Time system is operational Total time = MT T F MT T F + MT T R = MT T F MT BF (4) 7