Combinational Techniques for Reliability Modeling

Size: px

Start display at page:

Download "Combinational Techniques for Reliability Modeling"

Molly Weaver
5 years ago
Views:

1 Combinational Techniques for Reliability Modeling Prof. Naga Kandasamy, ECE Department Drexel University, Philadelphia, PA January 24, 2009 The following material is derived from these text books. D. P. Siewiorek and R. S. Swarz, Reliable Computer Systems, 3rd Edition, A. K. Peters, Natick, Massachusetts, M. L. Shooman, Reliability of Computer Systems and Networks, John Wiley & Sons, When designing a system, it is important to be able to predict the reliability of the final system containing many components. The two most common methods of estimating the reliability of complex systems are combinational modeling and Markov state modeling. 1 Canonical Structures We will first consider some canonical structures and discuss how their reliability can be quantified using combinational techniques. 1.1 Series and Parallel Systems In a series combination of components, the failure of any of the components will result in the failure of the overall system. If a system contains N components arranged in series, and if the failure rates of the components are independent, then the system s failure rate λ is given by where λ i is the failure rate of the i th component. λ = N i=1 The reliability of the series arrangement may also be expressed in terms of the reliability of individual components. If R i (t) is the reliability of the i th component in the system, the overall system reliability R(t) is given by R(t) = R 1 (t)r 2 (t)... R N (t) λ i which may be written as N R(t) = R i (t) i=1 The reliability of a parallel combination of components is given by N [ R(t) = 1 1 Ri (t) ] i=1 1

2 Series combination of components 1 2 N Parallel combination of components N Fig. 1: Series and parallel combination of system components. 1.2 Series-Parallel Combinations Consider the system shown in Fig. 2. If the reliability of module M1 is 0.99, that of M2, M3, and M4 is 0.80, M5 and M6 is 0.90, M7 and M8 is 0.95, and M9 is Then, the reliability of the parallel combination of modules 2, 3, and 4 is given by R(t) = 1 [1 0.8] 3 = The series combination of M5 and M7 (and M6 and M8) has reliability , or 0.855, and the parallel combination of these two paths is R(t) = 1 [ ] 2 = The system can be simplified as shown in Fig. 2. The series combination of M10 and M11 has reliability 0.971, and this in combination in parallel with M9 gives us R(t) = 1 [ ][1 0.94] = Nonseries/Nonparallel Models Sometimes, a success diagram is used to describe the operational mode of a system. A success diagram may not be directly reducible by the application of the series/parallel formulas. In such cases, one can obtain a lower bound on system reliability in terms of minimal cut sets of the system. We can define a cut set of a graph as a set of branches which interrupts all connections between the input and the output when removed from the graph. The minimum cut sets are a group of distinct cut sets containing the minimum number of terms. All system failures can be represented by the removal of at least one minimal cut set from the graph. The probability of system failure is, therefore, given by the probability that at least one minimal cut set fails. Let Q cuti denote the probability that a cut set fails. So, the lower bound on system reliability is given by R sys Π(1 Q cuti ) 2

3 Series-parallel combination of components M2 M5 M7 M1 M3 M6 M8 M4 2 M9 M1 M10 M11 M9 Fig. 2: Series/parallel combination of components. M6 M1 M2 M3 M4 M5 M6 M1 M2 M3 M4 M5 Fig. 3: A success diagram of a system. The minimum cut sets in Fig. 3 are {M1, M5}, {M1, M3}, M4, {M3, M6}, and {M2, M5, M6}. Assuming all modules are identical reliability R sys R(1 (1 R 2 )) 3 (1 (1 R) 3 ) 3

4 Fig. 4: The reliability of a NMR system comprising 2n + 1 modules as a function of time. A tie set is a group of paths (or branches) which when traversed, forms a connection between the input and the output. A minimal tie set is that containing a minimum number of elements. If no node is traversed more than once in tracing out the tie set, then the tie set is minimal. If R pathi denotes the serial reliability of path i, then R sys 1 Π(1 R pathi ) The minimum tie sets in Fig. 3 are {M1, M2, M3, M4}, {M1, M6, M4}, {M5, M3, M4}. The system reliability is given by R s R 4 + 2R 3 2 Masking Redundancy using M-out-of-N structures Another simple structure that serves as a useful model for many reliability problems is an M-out-of-N structure. Such a model represents a system of N components in which M out of N components must be good for the system to succeed. Thus, success of exactly M-out-of-N identical and independent components is given by ( ) N p M (1 p) N M M Here, p denotes the (identical) reliability of each component. For a constant failure rate of λ and using the exponential failure law p = e λt for each item, the success of at least M-out-of-N items is given by R(t) = N i=m ( ) N e iλt (1 e λt ) N i i In general, N is an odd integer. However, as we shall soon see, if we can diagnose and lock out faulty modules, it is feasible to let N be an even integer. If we let N = 2n + 1, n 1, then, in a simple masking scheme, we need a majority of the modules to work correctly, that is N = n + 1. Fig. 4 shows the reliability function for various values of n assuming p = e λt. The figure shows that NMR is superior to a single unit in the high-reliability region, specifically NMR is superior to the single unit for λt < Therefore, when designing any system, we must carefully evaluate the reliability values obtained over the range 0 < t < maximum mission time for various values of n and λ. 4

5 2.1 Triple Modular Redundancy A special case of an M-out-of-N structure is triple modular redundancy or TMR. The basic TMR structure, shown in Fig. 5, consists of three parallel modules where each module is provided with the same input. The outputs of the three modules are compared by the voter, which gives the majority opinion as the system output. If all three modules are operating properly, all outputs agree, and thus the system output is correct. However, if one module has failed so that it has produced an incorrect output, the voter chooses the output of the two good modules as the system output because they both agree, and thus the system output is correct. If two modules have failed, the voter agrees with the majority (the two that have failed), and thus the system output is incorrect. A TMR system will function correctly provided that at least two modules are operational, and assuming that the voter does not fail, that is R v = 1. Thus, the probability of the system working correctly is given by R = R v (( ) 3 p 3 (1 p) = 3p 2 2p 3 = p 2 (3 2p) ( ) 3 p 2 (1 p) 1) 2 This is, of course, the reliability expression for a two-out-of-three system. If we assume a constant-failure rate λ, then each module/component has a reliability p = e λt, and substituting in the above equation yields, R(t) = 3e 2λt 2e 3λt We can compute the MTTF for this system by integrating the reliability function as MT T F = 0 = 3 2λ 2 3λ = 5 6λ 3e 2λt 2e 3λt This TMR system can be called a 3-2 system because the system succeeds if 3 or 2 units are good. Thus, when a second failure occurs, the voter does not know which of the components have failed and cannot determine which is the good component. In some cases, additional information is available by such means as observation (from a human operator or a diagnostic system) of the remaining two units after the first failure occurs. If one of the two remaining units Module 1 Module 2 Voter Module 3 Fig. 5: The basic triple-modular redundancy (TMR) scheme. 5

Fig. 6: Comparison of the reliability functions of a single system/component/module, a TMR 3-2 system, and a TMR 3-2-1 system in the high-reliability region.

6 Fig. 6: Comparison of the reliability functions of a single system/component/module, a TMR 3-2 system, and a TMR system in the high-reliability region. has behaved erratically, it would be locked out (i.e., disconnected) and the other unit would be assumed to operate properly. In such a case, the TMR system becomes a 1-out-of-3 system with a voter, which can be called a TMR system. The reliability equations then become, R(t) = 3p 2 2p 3 + 3p(1 p) 2 = e 3λt 3e 2λt + 3e λt and the MTTF calculation yields MT T F = 1 3λ 3 2λ + 3 λ = 11 6λ Fig. 6 shows the superiority of the TMR systems in the high-reliability region. Note that the TMR 3-2 system reliability decreases to about the same value as a single component when λt increases from about 0.3 to Thus, TMR is of most use for λt < 0.2, whereas TMR is of greater benefit and provides a considerably higher reliability for λt < System Versus Component Redundancy Suppose we desire to use NMR for a digital system composed of three components A, B, and C, we must answer the following question: Do we use NMR on three full systems (A 1 B 1 C 1, A 2 B 2 C 2 and A 3 B 3 C 3 ) with one voter, or do we use voting at a lower or component level, with one voter comparing A 1 A 2 A 3, a second comparing B 1 B 2 B 3 and a third voter comparing C 1 C 2 C 3? In general, two redundancy techniques that are easily classified and studied are component and system redundancy. We can, in fact, prove that component redundancy is superior to system redundancy in a wide variety of situations. Consider the three system configurations shown in Fig. 7. The reliability of the simplex system in Fig. 7(a) is given by R a (t) = R M1 (t) R M2 (t) = p 2 6

7 A simplex (non-redundant) system comprising two components M 1 M 2 System redundancy (a) Component redundancy M 1 M 2 M 1 M 2 M 1 M 2 M 1 M 2 (b) (c) Fig. 7: Comparison of three different systems; (a) A simplex (or non-redundant) system, (b) system redundancy, and (c) component redundancy. where the components M 1 and M 2 are independent, but have identical reliability R(t) = p. The reliability expression for Fig. 7(b), comprising two simplex units connected in parallel, is given by ( ) 2 R b (t) = Ra(t) 2 + R a (t)(1 R a (t)) (1) 1 = p 2 (2 p 2 ) For Fig. 7(c), we combine each component pair in parallel to obtain To compare Equations 1 and 2, we use the ratio R c (t) = [p 2 + 2p(1 p)] 2 (2) = p 2 (2 p) 2 R c (t) R b (t) = p2 (2 p) 2 p 2 (2 p 2 ) = (2 p)2 (2 p 2 ) (3) Some algebraic manipulation yields R c (t) 2(1 p)2 = 1 + R b (t) 2 p 2 Since 0 < p < 1, the term 2 p 2 > 0, and R c (t)/r b (t) 1. Therefore, component redundancy is superior to system redundancy for this example. (They are, of course, equal at the extremes when p = 0 or p = 1). We can extend the above analysis to m components, in which case Equation 3 becomes R c (t) (2 p)m = R b (t) (2 p m ) It can be shown by induction that this ratio is always greater than 1 and that component redundancy is superior regardless of the number of components. The superiority of component redundancy over system redundancy also holds true for non-identical components. (4) 7

8 Fig. 8: Redundancy comparison: (a) component redundancy and (b) system redundancy A simpler proof of the foregoing principle can be formulated by considering tie sets. In Fig. 7(b), the tie sets are M 1 M 2 and M 1M 2, whereas in Fig. 7(c), the tie sets are M 1 M 2, M 1 M 2, M 1M 2, and M 1M 2. Since the system reliability is the probability of the union of tie sets, and since the redundant system in Fig. 7(c) has the same two tie sets as Fig. 7(b) as well as two additional ones, the component-redundancy configuration has a greater reliability than the configuration with two simplex units connected in parallel. This tie-set proof can be extended to the general case. The reliability of system and component redundancy are compared graphically in Fig Voter Design Issues This section considers various issues related to voter design including relaxing our assumption of perfect voters. Returning to the TMR reliability equation, if the unit reliability is denoted as p c and the voter reliability by 8

9 Fig. 9: A TMR system with component level voting and redundant voters. p v, then the reliability equation must be modified to yield R sys = p v p 2 c(3 2p c ) To achieve an overall gain in reliability, the reliability of the TMR scheme with an imperfect voter should be greater than the reliability of a single component. That is, R sys > p c R sys p c > 1 This requires that R sys p c = p v p c (3 2p c ) > 1 So, the minimum value of p v can be obtained by setting p v p c (3 2p c ) = 1, or 1 p v = (3 2p c )ṗ c 3.1 Use of Redundant Voters In certain cases, it may not be possible to build individual voters with a high enough reliability, to meet the requirements of an ultra-reliable system. Since the voter reliability multiples the N-modular redundancy reliability R(t) = p v ( N ( ) N p i (1 p) N i) i i=m the system reliability can never exceed that of the voter. Moreover, if voting is done at the component level, the situation is even worse, since the reliability function will be multiplied by p n v (the voters are arranged in series). This can significantly lower the reliability of the NMR scheme. Therefore, in such cases, we must consider the possibility of using redundant voters. Fig. 9 shows a TMR system with component-level voting using redundant voters. The figure shows a system composed of n different components (or sub-systems). Each sub-system is organized as a TMR structure with redundant voters. In the last stage of voting, only a single voter V n can be used. Note also that errors do not propagate more than one stage. If the sub-systems A 1, B 1, and C 1 are working properly, the outputs of the replicated voters will agree. If one module, say B 1 fails, then the three voters V 1, V 1, and V 1 will agree with the majority (A and C). If a voter, say V 1, fails, then it will provide an incorrect input to B 2 leading to an erroneous output. However, the next stage of voters will have correct inputs from A 2 and C 2, and the erroneous output from B 2 will be masked. 3.2 Exact versus Inexact Voting Voting on the outputs can be performed in exact fashion (bit-wise voting), or in an inexact or approximate fashion. Please read the Harper and Lala paper, available from the course web site, for more information. 9

Fig. 10: Synchronization in the COMTRAC railroad traffic control computer. 3.3 Synchronization Issues Synchronizing the outputs of the replicated units is another concern.

10 Fig. 10: Synchronization in the COMTRAC railroad traffic control computer. 3.3 Synchronization Issues Synchronizing the outputs of the replicated units is another concern. The problem of synchronization is often solved using a common (fault tolerant) clock. Another method of synchronization is used by the COMTRAC railroad traffic control computer shown in Fig Synchronization is maintained at the program task level. The system controller (DSC) ensures that both processors are performing the same calculation. When both computers have nished the calculation, the DSC compares the two results. If a mismatch occurs, the controller forces both processors to run identical test programs. The test program exercises the entire processor during the course of calculating a single constant. 4 Dynamic Redundancy One of the drawbacks of an NMR scheme is that the fault masking ability deteriorates as more copies fail. In its pure form, fault masking neutralizes the effects of failed units without notification of their failures. Therefore, the faulty modules can eventually outvote the good modules. However, an NMR system could continue to function longer if the known bad modules could be discounted in the vote. Two methods of reconfiguration based on NMR are: (1) Hybrid redundancy where failed modules are replaced with good spares and (2) Dynamic modification of the voting process or adaptive voting. 4.1 Hybrid Redundancy Fig. 11 illustrates the basic concept in which a core group of N identical modules is used at any one time, and their outputs voted upon to produce the system output. When a disagreement is detected, the module(s) in the minority are assumed to have failed and are replaced by an equivalent number of spare modules. Initially, the system contains a total of (N + S) modules. As long as the number of failed modules does not exceed 1 Ihara et al., Fault-Tolerant Computer System with Three Symmetric Computers, Proceedings of the IEEE, pp , October

11 Fig. 11: Organization of a system with hybrid redundancy. Fig. 12: The quad-redundant flight control system of the space shuttle. t = (N/2) in the core group before reconfiguration can take place, the system in Fig. 11 can tolerate the failure of t + S of its modules. 4.2 Adaptive voting with lockout When N-modular redundancy is used and N is greater than three, additional considerations emerge. For example, consider the quad-redundant system shown in Fig. 12. This is the architecture used for the Space Shuttle s primary flight control system (FCS). Let us focus on the first four computers in the FCS. Here, we have an example of a 4-level voting with lockout. Let us assume that unit B fails permanently. There is no reason to leave B in the system if we have a way to remove it from the voting process. The rationale here is that a second failure, say that of unit C, can lead to a situation where the two failed units agree and the two good elements agree, leading to a stand-off. Clearly, this can be avoided, if, after the failure of B, it is locked out, and the system reconfigures to become a TMR system. 11

12 Fig. 13: Reliability comparison of the various voting systems. In the case of adaptive voting with lockout in Fig. 12, after a first failure, the system assumes a TMR configuration. After the second failure, the system assumes a duplex configuration where the outputs of the functioning units are simply compared to detect faults. If the comparison fails, the system can shut off and the backup system takes over. If we assume that the lockout works perfectly, the system in Fig. 12 will succeed if there are 0, 1, or 2 failures. If the reliability of each module is p, the reliability of the overall system is given by R(2 of 4) = p 4 + (4p 3 4p 4 ) + (6p 2 12p 3 + 6p 4 ) (5) = 3p 4 8p 3 + 6p 2 The reliability of the system will be even higher if we can detect and isolate a third failure, that is, after the compare fails in the duplex system, the two units are taken o-line, and through a series of tests, the faulty unit is identified. In this case, we can start with Equation 5 and add the probability that the system will function properly if a single units works, to obtain R(1 of 4) = p 4 + 4p 3 6p 2 + 4p Fig. 13 plots the above Equations for various values of p (reliability of a single unit). Note that the TMR scheme is poorer than a single element for p < 0.5 but better that a single element for p >

Markov Models for Reliability Modeling

Markov Models for Reliability Modeling Prof. Naga Kandasamy ECE Department, Drexel University, Philadelphia, PA 904 Many complex systems cannot be easily modeled in a combinatorial fashion. The corresponding