No.5 Node Grouping in System-Level Fault Diagnosis 475 identified under above strategy is precise in the sense that all nodes in F are truly faulty an

Vol.16 No.5 J. Comput. Sci. & Technol. Sept. 2001 Node Grouping in System-Level Fault Diagnosis ZHANG Dafang (± ) 1, XIE Gaogang (ΞΛ ) 1 and MIN Yinghua ( ΠΦ) 2 1 Department of Computer Science, Hunan University, Changsha 410082, P.R. China 2 Institute of Computing Technology, The Chinese Academy of Sciences, Beijing 100080, P.R. China E-mail: dfzhang@mail.hunu.edu.cn Received January 14, 2000; revised August 9, 2000. Abstract With the popularization of network applications and multiprocessor systems, dependability of systems has drawn considerable attention. This paper presents a new technique of node grouping for system-level fault diagnosis to simplify the complexity of large system diagnosis. The technique transforms a complicated system to a group network, where each group may consist of many nodes that are either fault-free or faulty. It is proven that the transformation leads to a unique group network to ease system diagnosis. Then it studies systematically one-step t-faults diagnosis problem based on node grouping by means of the concept of independent point sets and gives a simple sufficient and necessary condition. The paper presents a diagnosis procedure for t-diagnosable systems. Furthermore, an efficient probabilistic diagnosis algorithm for practical applications is proposed based on the belief that most of the nodes in a system are fault-free. The result of software simulation shows that the probabilistic diagnosis provides high probability of correct diagnosis and low diagnosis cost, and is suitable for systems of any kind of topology. Keywords system-level fault diagnosis, one-step t-diagnosable system, node grouping, diagnosis algorithm, probabilistic diagnosis 1 Introduction With the popularization of network applications and multiprocessor systems, the study of systemlevel fault diagnosis has become more and more important. The basic idea of system-level fault diagnosis is to let processors test with each other, and then identify the fault processors after analyzing logically all the test outcomes. The interconnection of processors can be adequately represented by a graph G = (U; E), where each node u i 2 U represents a processor and each directed edge fu i ;u j g 2 E represents a link between u i and u j. The processor u i can directly test u j through the link, and get the outcome w ij. The test result is simply a conclusion that the tested node is faulty" or fault-free", denoted as label 1 or 0 respectively on the corresponding arrow. The graph G with all test outcomes is called a test graph. There are several strategies for interconnected processors to diagnose faulty processors among themselves. One strategy was initially introduced by Preparata, Metze, and Chien (known as the PMC model) [1]. It is assumed that a fault-free node should always give correct test result, whereas the test result given by a faulty node is unreliable. Along with the diagnostic model, Preparata et al. proposed a diagnosis strategy. The strategy is called one-step diagnosis whose target is to identify the exact set of all faulty nodes before their repair or replacement. The definition of (one-step) diagnosability under this strategy was also given as follows. Definition 1. Under the one-step diagnosis strategy, a system is said to be t-diagnosable if for any syndrome S, there is at most one faulty-subset F ρ U that is consistent with S, given that jf j» t. The t is called the diagnosability of the system under consideration. Fault-subset F consistent with syndrome S just means that S does not disagree with the test result under the condition of fault-subset F in system and PMC test model. The faulty node set F This work was supported by the National Natural Science Foundation of China under the grants No.69973016 and No.69733010.

No.5 Node Grouping in System-Level Fault Diagnosis 475 identified under above strategy is precise in the sense that all nodes in F are truly faulty and all nodes in U-F are fault-free. For this reason we can call this strategy as a precise diagnosis strategy. Clearly t is independent of the system topology, i.e., the way in which the nodes are linked. Since the PMC model was first proposed in 1967, people have suggested more complete and more practical diagnostic models during the following thirty years [2 6]. However, all the work in the past was studying the diagnosability of a system by considering some unit, some link or some reliable path one by one. It resulted in lots of repeated and complicated work. However, generally there are much more fault-free nodes than faulty nodes in real multiprocessor and network systems. In other words, there exist some big reliable blocks, in which all nodes are fault-free. The diagnosis efficiency can gain a significant enhancement if all fault-free nodes are clustered as reliable blocks. This is just the motivation of this paper. This paper adopts the PMC model. Furthermore, we assume the edges are bidirectional, that means the linked processors u i and u j are tested with each other, and the weight w ij =w ji is with the edge. Obviously if w ij = w ji = 0 then both u i and u j are either fault-free nodes or faulty nodes. Otherwise, at least one node is faulty. In addition, if w ij = 0, and w ji = 1, then the node u i should be faulty and u j is undetermined. In fact, if u i is fault-free, so is u j, and then w ji = 0. But it is not the case. The paper proposes, in Section 2, the concept of node group, in which there exists a path with the test outcome 0/0 between any two nodes. All nodes in a group are then either faulty or fault-free. The study of system-level fault diagnosis is done based on node grouping and some significant results are given. In Section 3, a diagnosis algorithm for one-step t-diagnosable systems based on node grouping is presented. In Section 4, some efficient diagnosis algorithms with practical significance are given. The simulation results show that the algorithms are suitable for large systems with high probability of correctness. 2 Node Grouping Definition 2. Suppose G(U; E) is a test graph. A non-empty sub-graph H of G is called a group of G iff the following is true. (1) If there is more than one node in H, then there exists at least one 0=0 path connecting any two nodes in H with all edge weights being 0=0. (2) For any u i 2 H, if it is connected to u j 62 H, then w i and w j are not both 0. The following lemmas are apparent. Lemma 1. Nodes in a group are all either fault-free or faulty. The group consisting of fault-free nodes is then called a good group, while any node in a bad group is faulty. Lemma 2. For a node u i 2 G, if any node u j connecting to u i is not associated with the weight 0=0, i.e., w ij 6= 0 or w ji 6= 0, then u j is the unique node in a group. Definition 3. For a test graph G(U; E), its diagnosis graph G Λ (U Λ ;E Λ ) is defined as follows. (1) Any node u Λ 2 U Λ is a group in G(U; E). (2) Any edge e Λ 2 E Λ connects two groups u Λ i and u Λ j if there exist u i 2 u Λ i and u j 2 u Λ j, and a connected edge 2 E. Lemma 3. In a diagnosis graph G Λ (U Λ ;E Λ ), if u Λ i is a good group and uλ j is connected to it, then u Λ j is a bad group. In fact, if u Λ is good, it would be included in j uλ. i Lemma 4. Any edge connecting two nodes in an identical good group should be with the weight 0=0. In fact, both u i and u j are in a good group u Λ, and i w ij 6= 0, then u j is faulty, which contradicts Lemma 1. Theorem 1. For any test graph G(U; E), its diagnosis graph G Λ (U Λ ;E Λ ) uniquely exists. Proof. The grouping process for any test graph G(U; E) can be done as follows.

476 ZHANG Dafang, XIE Gaogang et al. Vol.16 (1) Take a node u 2 U. Mark u with checked". If there is no edge from u with the weight 0/0, then put u in u Λ, which is the single node in the group. Otherwise do the following. (2) If v is a node connecting to u with the weight 0/0, then put v in u Λ, a group in U Λ. Mark v with checked". (3) Check all the adjacent nodes of u as in (2). (4) Take unchecked v in u Λ, repeat (2) until all nodes in u Λ are checked. (5) Take an unchecked node in U, repeat from (1) until all nodes in U are checked. (6) Make an edge e Λ 2 E Λ connecting two groups u Λ i and u Λ j with w Λ ij(w Λ ji) = maxfw ij(w ji)g for all u i 2 u Λ i and u j 2 u Λ j. From the above procedure, we can see that any node will be marked in (1) or (2), and can be marked only once. Notice that the grouping process is independent of the selection in Steps (1) and (3). In fact, all node pairs connected with the weight 0/0 belong to an identical group, and the process will go through all nodes, although the order of checking may be different. The construction proof shows the possibility and uniqueness of the transformation from a test graph to a diagnosis graph, and the computational complexity is O(n 2 ), where n is the number of nodes in U. 3 Diagnosable Systems As we know that in graph theory, S is called an independent point set of a connected graph G(U; E) if S is a subset of U, and any two nodes in S are not adjacent. If all nodes in U S are adjacent to S, then S is called a maximum independent point set (MaxIPS). Lemma 5. Given a test graph G(U; E), all fault-free nodes in U are transformed to a MaxIPS in the diagnosis graph G Λ (U Λ ;E Λ ). In fact, fault-free nodes belong to groups. But, good groups cannot be connected with each other directly. Any bad group is connected with at least one good group. Good groups are surrounded with some bad groups. Lemma 6 [1]. In a one-step t-diagnosable system G(U; E), the number of directed edges to a node u 2 U, d(u) t. Theorem 2. If G(U; E) is one-step t-diagnosable, then all fault-free nodes in U map to a MaxIPS in G Λ (U Λ ;E Λ ). Proof. By Lemma 6, for any node, there are at least t nodes that test it. For a faulty node, there is at least one fault-free node to test it. The fault-free node is included in a good group. Any faulty node is connected to a fault-free node, and is thus adjacent to an independent point set consisting of fault-free nodes. Therefore, all good groups construct a MaxIPS in G Λ (U Λ ;E Λ ). Suppose T is a MaxIPS in G Λ (U Λ ;E Λ ), and T = ft 1 ;T 2 ;:::;T s ) where T i (i = 1; 2;:::;s) is an independent point set with the size jt i j = n i. Then, jt j = sx i=1 jt i j = Theorem 3. The sufficient and necessary condition for a system G(U; E) to be one-step t- diagnosable is that there is a unique MaxIPS T in G Λ (U Λ ;E Λ ) consisting of all fault-free nodes, such that jt j n t, where n = juj. Proof. If such a MaxIPS T exists in G Λ (U Λ ;E Λ ), then all nodes in U Λ T are adjacent to T. That is, any faulty node is tested by at least one fault-free node. Therefore, the system is one-step t-diagnosable. Conversely, by Theorem 2, all fault-free nodes construct a MaxIPS T. The MaxIPS is unique, and jt j n t. Based on the above concepts and theorems we can give the diagnosis procedure as follows: 1. Map the test graph G(U; E) to the diagnosis graph G Λ (U Λ ;E Λ ) by node grouping. 2. Find the maximum independent point set in G Λ (U Λ ;E Λ ). sx i=1 n i

No.5 Node Grouping in System-Level Fault Diagnosis 477 3. Determine the corresponding fault-free node set T in G(U; E) with more than n t nodes. 4. U T includes all faulty nodes. Example. Consider a G(U; E) as shown in Fig.1 with 9 nodes and t = 4. (1) Five groups are found. H[a] = f0g, H[b] = f1; 2g, H[c] = f3; 4; 5; 6g, H[d] = f7g, H[e] = f8g. The diagnosis graph G Λ (U Λ ;E Λ ) is shown in Fig.2. Fig.1. Example of test graph. Fig.2. Example of diagnosis graph. (2) From group b, we can find fb; dg and fb; eg are MaxIPSs, and from group a or c, we can find fa; cg is also a MaxIPS. (3) To determine the faulty nodes, the MaxIPS's with their corresponding nodes in U are shown in Table 1. Table 1. MaxIPS's MaxIPS in G Λ (U Λ ;E Λ ) Correspondent nodes in G(U;E) fb; eg f1, 2, 8g fb; dg f1, 2, 7g fa; cg f0, 3, 4, 5, 6g (4) The only MaxIPS with n t = 9 4 = 5 fault-free nodes is fa; cg. These are f0; 3; 4; 5; 6g. The others are faulty nodes. 4 A Probabilistic Diagnosis Procedure Node grouping provides a complete algorithm for system-level fault diagnosis with the computational complexity of O(n 2 ) where n is the number of nodes. The algorithm is complete, which means that the t faulty nodes in a one-step t-diagnosable system can be exactly located by using this procedure, and node grouping simplifies the diagnosis process. However, complete algorithms require some conditions to be satisfied for the number of tests and the topology of the test graph. The study of diagnosis of heterogeneous systems with random faults is then significant [7;8]. In fact, it is not necessary in practice to manage all faulty nodes in a big system. Especially in a distributed system without a system monitor the diagnosis may be localized in a part of the system in order to schedule a particular task. In addition, usually most of the nodes are faulty free. The fact has become more and more apparent. Thus, node grouping can stop. As long as a group is found with t + 1 fault-free nodes, the group should be good. In order to verify the idea, some simulation results are presented in this section. The simulation was done on a PC with Pentium 166. It is very possible to have a single good group. All fault-free nodes are connected to form the maximum group. If the number of faulty nodes is not more than t, the maximum group should be identified as a good group. Based on the PMC model, assume a faulty node tests other ones and produces the result 0 or 1 randomly, i.e., with the probability 1/2 respectively. In order to identify such a good group, a number of bidirectional edges, denoted by e, are required. Given n = 290, t = 10, 25, 50, 100, and 140, and e varies in the range [1000, 2400], we randomly generate 10,000 test

478 ZHANG Dafang, XIE Gaogang et al. Vol.16 graphs and find the probability that all fault-free nodes construct the maximum group, as shown in Fig.3. From Fig.3 we can see that even when e fi nt, the probability can be very close to 1, and that the probability approaches to 1 with the increase of e. As well known, nt tests are required for deterministic diagnosis. The probability of correct diagnosis decreases with the number of faulty nodes as shown in Fig.4 providing n = 40, and e = 200. In addition, the probability is independent of the topology of the system. It is significant for most large systems whose test relationship can be flexible. The algorithm focuses on finding the maximum group. The mapping from the test graph to the diagnosis graph is not required. Therefore, the CPU time is largely saved. The CPU time of the diagnosis of 10,000 times for different n, t and e values is shown in Fig.5. The simulation shows that the algorithm provides high probability of correct diagnosis with a few test edges regardless of the system topology. Fig.3. Probability of a single good group. Fig.4. The probability of correct diagnosis vs. # of faulty nodes. Fig.5. CPU time for given n, t and e. In a large distributed system, it is significant for a node to know a given number of fault-free nodes to share its computation task, although the node may not be interested in knowing all nodes in the system to be faulty or not. For example, the node with a computation task which requires 20 processors to cooperate with each other will be interested in identifying just 20 fault-free nodes. It will not care the other nodes for the task. The simulation results are listed in Table 2 where the number in the parentheses indicates the case of single good group identification without the given number d. The data show that if the target number of fault-free nodes is given, the test edges required can be saved by more than 80% compared with the single good group identification for system diagnosis, and the correct diagnosis probability will be almost the same. Table 2. Identifying d Fault-Free Nodes n t e d CPU Time (s) Correct rate (%) 100 25 130(500) 40 10.8(17.9) 99.2(98.2) 150 30 150(800) 40 15.4(34) 99.4(98.8) 200 40 180(1200) 40 21.3(68.9) 98.9(99.1) 250 45 250(1500) 40 26(108.0) 98.5(98.8) 290 50 300(1600) 40 28(128.2) 99.2(98.9) From Table 2 we can see that it is a more significant saving in CPU time with high probability to identify a certain number of fault-free nodes than the whole system diagnosis. Fig.6 shows the comparison with the maximum group algorithm.

No.5 Node Grouping in System-Level Fault Diagnosis 479 Fig.6. CPU time comparison. 5 Conclusions Dependability of systems is going to be more and more critical for large systems with the popularization of network and multiprocessor system applications. This paper presents a new technique of node grouping for system-level fault diagnosis to simplify the complexity of large system diagnosis. The technique transforms a complicated system to a group network, which is called a diagnosis graph. It is proven that the transformation leads to a unique diagnosis graph to ease system diagnosis. Then it studies systematically one-step t-faults diagnosis problem based on node grouping by means of the concept of independent point sets and gives a simple sufficient and necessary condition. The paper presents a diagnosis procedure for t-diagnosable systems. Furthermore, the paper also presents an efficient probabilistic diagnosis algorithm for practical applications based on the belief that most of the nodes in a system are fault-free. Another algorithm to identify a certain number of fault-free nodes is more efficient and significant in practice. The simulation results show that the probabilistic diagnosis provides high probability of correct diagnosis, low diagnosis cost and is suitable for systems of any kind topology. References [1] Preparata F P, Metze G, Chien R T. On the connection assignment problem of diagnosable systems. IEEE Trans. Electronic Computer, 1967, 16(12): 848 854. [2] Barsi F, Grandoni F, Maestrini P. A theory of diagnosability of digital systems. IEEE Trans. Computers, 1976, 25(7): 585 593. [3] Chwa K Y, Hakimi S L. Schemes for fault-tolerant computing: A comparison of modularly redundant and t- diagnosable systems. Information Control, 1981, 49(2): 212 238. [4] Malek M. A comparison connection assignment for diagnosis of multiprocessor systems. In Proc. 7th Symp. Comput. Architecture, IEEE, France, May, 1980, pp.31 36. [5] Maheshwart S N, Hakimi S L. On models for diagnosable system and probabilistic fault diagnosis. IEEE Trans. Computers, 1976, 25(3): 228 236. [6] Chen T H. Fault Diagnosis and Fault Tolerance: A Systematic Approach to Special Topics. Berlin: Springer-Verlag, 1992. [7] Andrzej Pelc. Optimal diagnosis of heterogeneous systems with random faults. IEEE Trans. Computers, 1998, 47(3): 298 304. [8] Krzysztrf Diks, Andrzej Pelc. Globally optimal diagnosis in system with random faults. IEEE Trans. Computers, 1997, 46(2): 200 204. ZHANG Dafang, born in 1959, is a professor and Ph.D. supervisor and director of Department of Computer Science, Hunan University. Prof. Zhang is engaged in research and teaching of test & diagnosis, fault-tolerant computing, network and communication technology. He is now taking charge of 3 projects supported by the National Natural Science Foundation of China and 1 project by the National 863 Programme. He has published 11 books and edited 4 books. Now, Prof. Zhang is a member of the Technical Committee on Fault-Tolerant Computing of China Computer Federation, deputy director of a technical group on testing and diagnosis, and a member of IEEE. He also serves as vice-president of the Research Committee of the China Computer Continuing Education. XIE Gaogang, born in 1974, received the M.S. degree in computer science from Hunan University. He is currently pursuing his Ph.D. degree in computer science at Hunan University and Institute of Computing Technology, The Chinese Academy of Sciences. His research interests include test & diagnosis, network and communication technology, distributed computing. He has published more than 10 papers on magazines and proceedings.