Coordinated Decentralized Protocols for Failure Diagnosis of Discrete Event Systems

Discrete Event Dynamic Systems: Theory and Applications, 10, 33 86 (2000) c 2000 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands. Coordinated Decentralized Protocols for Failure Diagnosis of Discrete Event Systems RAMI DEBOUK ridebouk@eecs.umich.edu Department of Electrical Engineering and Computer Science, The University of Michigan, 1301 Beal Avenue, Ann Arbor, MI 48109 2122, USA STÉPHANE LAFORTUNE stephane@eecs.umich.edu Department of Electrical Engineering and Computer Science, The University of Michigan, 1301 Beal Avenue, Ann Arbor, MI 48109 2122, USA DEMOSTHENIS TENEKETZIS teneket@eecs.umich.edu Department of Electrical Engineering and Computer Science, The University of Michigan, 1301 Beal Avenue, Ann Arbor, MI 48109 2122, USA Abstract. We address the problem of failure diagnosis in discrete event systems with decentralized information. We propose a coordinated decentralized architecture consisting of local sites communicating with a coordinator that is responsible for diagnosing the failures occurring in the system. We extend the notion of diagnosability, originally introduced in Sampath et al. (1995) for centralized systems, to the proposed coordinated decentralized architecture. We specify three protocols that realize the proposed architecture; each protocol is defined by the diagnostic information generated at the local sites, the communication rules used by the local sites, and the coordinator s decision rule. We analyze the diagnostic properties of each protocol. We also state and prove conditions for a language to be diagnosable under each protocol. These conditions are checkable off-line. The on-line diagnostic process is carried out using the diagnosers introduced in Sampath et al. (1995) or a slight variation of these diagnosers. The key features of the proposed protocols are: (i) they achieve, each under a set of assumptions, the same diagnostic performance as the centralized diagnoser; and (ii) they highlight the performance vs. complexity tradeoff that arises in coordinated decentralized architectures. The correctness of two of the protocols relies on some stringent global ordering assumptions on message reception at the coordinator s site, the relaxation of which is briefly discussed. Keywords: failure diagnosis, decentralized information, diagnostic protocols 1. Introduction Failure detection and isolation is an important task in the automatic control of large complex systems. In order to guarantee a reliable system performance, the control engineer should guarantee that the system is running safely within its normal boundaries. Consequently, the problem of failure diagnosis has received considerable attention in the literature. Many schemes ranging from fault-tree (Lapp and Powers, 1977) and analytical redundancy (Willsky, 1976; Frank, 1990) methods to discrete event system (DES) approaches (Sampath et al., 1995; Lin, 1994; Bavishi and Chong, 1994; Holloway and Chand, 1994; Boubour et al., 1997; Cassandras and Lafortune, 1998), model based reasoning (Davis and Hamscher, 1992) and expert systems (Scherer and White, 1987) methods, have been proposed to approach this problem. For a brief description of these methods and additional references,

34 DEBOUK ET AL. the interested reader is referred to Pouliezos and Stavrakakis (1994) and the introduction of Sampath et al. (1995). Almost all of the abovementioned approaches have been developed for systems where the information used for fault diagnosis is centralized. A notable exception is Holloway and Chand (1994), where the authors present a distributed fault monitoring method, time templates. Time templates monitoring is cited to have the advantage of being easily implemented in distributed control architectures. Many systems are decentralized in nature, for instance, the majority of technological complex systems (computer and communication networks, manufacturing, process control and power systems, etc.) are informationally decentralized. In decentralized information systems there are several work stations (decision makers, controllers, diagnosers) each having access to its own local information. The stations may communicate and exchange limited information among each other. Since this information is exchanged in real-time and over channels of limited capacity, there are propagation delays, along with faults and transmission errors. Thus, the information available to each station is incomplete, delayed, and possibly erroneous. Hence, the approaches to failure diagnosis mentioned above do not apply directly to informationally decentralized systems. Consequently, it is important to develop diagnostic methodologies for informationally decentralized systems. This fact is also recognized in Holloway and Chand (1994) and Boubour et al. (1997). In this paper, we investigate failure diagnosis problems in DES under decentralized information. Having adopted a DES approach to failure diagnosis, we extend the notion of diagnosability, introduced in Sampath et al. (1995) for centralized systems, to a coordinated decentralized architecture consisting of local sites communicating with a coordinator that is responsible for diagnosing the failures occurring in the system. We present three specific protocols that realize the architecture under consideration. A protocol specifies the diagnostic information generated at each local site, the communication rules used by the local sites, and the decision rule for failure diagnosis employed by the coordinator. We present and discuss the diagnostic properties of the suggested protocols. We state and prove conditions for a language to be diagnosable under these protocols and provide off-line tests to check the diagnosability property. The on-line diagnostic process is carried out by the diagnosers introduced in Sampath et al. (1995) or a slight variation of these diagnosers. The key features of the coordinated decentralized protocols presented in this paper are: first, they perform as well as the centralized diagnoser each under a set of assumptions; and second, they highlight the performance vs. complexity tradeoff that arises in coordinated decentralized architectures. The correctness of two of the protocols relies on some stringent global ordering assumptions on message reception at the coordinator s site, the relaxation of which is briefly discussed. This paper is organized as follows. In Section 2, we present some preliminary definitions and results that are critical for the development of the technical results in this paper. We provide an overview of the coordinated decentralized architecture under consideration in Section 3. We specify three protocols that realize this architecture in Sections 4, 5, and 6. We describe each protocol in detail; that is, we precisely specify the diagnostic information generated at local sites, the communication rules used between the local sites and

COORDINATED DECENTRALIZED PROTOCOLS 35 the coordinator and the coordinator s decision rule for failure diagnosis. We analyze the diagnostic properties of each protocol, and discover conditions to ensure diagnosability of a language under each protocol. We present and discuss the performance vs. complexity tradeoff highlighted by the three protocols and the relaxation of the ordering assumption in Section 7. We draw some conclusions and discuss the contribution of the paper in Section 8. 2. Preliminaries 2.1. The System Model The system to be diagnosed is modeled as a FSM G = (X,,δ,x 0 ) (1) where X is the state space, is the set of events, δ is the partial transition function, and x 0 is the initial state of the system. The model G accounts for the normal and failed behavior of the system. The behavior of the system is described by the prefix-closed language (Ramadge and Wonham, 1989) L(G) generated by G. L(G) is a subset of, where denotes the Kleene closure of the set (Hopcroft and Ullman, 1979). In this paper we will use the language L(G), or simply L, and the system interchangeably. Some of the events in are observable, i.e., their occurrence can be observed, while the rest are unobservable. Thus, the event set is partitioned as = o uo where o represents the set of observable events and uo the set of unobservable events. The observable events in the system may be one of the following: commands issued by the controller, sensor readings occurring after the execution of those above commands, and changes in sensor readings. The unobservable events may be failure events or other events that cause changes in the system state not recorded by sensors (see Sampath, 1995; Sampath et al., 1996). Let f denote the set of failure events which are to be diagnosed. We assume, without loss of generality, that f uo, since an observable failure event can be trivially diagnosed. Our objective is to identify the occurrence, if any, of the failure events, given that in the traces generated by the system, only the events in o are observed. In this regard, we partition the set of failure events into disjoint nonempty sets corresponding to different failure types f = f 1 f 2 fm. (2) Let f denote this partition. For the motivation of such a partition, the reader is referred to Sampath et al. (1995) and Sampath et al. (1996). Hereafter, when we write a failure of type F i has occurred, we will mean that some event of the set fi has occurred.

36 DEBOUK ET AL. 2.2. Notation The empty trace is denoted by ɛ. Let s denote the prefix-closure of any trace s. We define s to be the length of trace s. Whenever we say that there exists a trace s of arbitrarily long length having a given property, we mean the following: for all integers n, there exists s, such that s > n and s possesses the given property. We denote by L/s the post-language of L after s, i.e., L/s ={t st L}. (3) We define the projection P: o in the usual manner (Ramadge and Wonham, 1989) P(ɛ) = ɛ, P(σ ) = σ if σ o, P(σ ) = ɛ if σ uo, P(sσ) = P(s)P(σ ), s,σ. (4) The inverse projection operator P 1 L is defined as P 1 L (y) ={s L: P(s) = y}. (5) Let s f denote the final event of trace s. We define ( fi ) ={sσ f L: σ f fi }, (6) i.e., ( fi ) denotes the set of all traces that end in a failure event belonging to the class fi. Consider σ and s. We use the notation σ s to denote that σ is an event in the trace s. With slight abuse of notation, we write fi s to denote the fact that σ f s for some σ f fi, or formally, s ( fi ). We also define X o ={x 0 } {x X: x has an observable event into it}. (7) Let L(G, x) denote the set of all traces that originate from state x of G. We define and L o (G, x) ={s L(G, x): s = uσ, u uo,σ o} (8) L σ (G, x) ={s L o (G, x): s f = σ }. (9) L o (G, x) denotes the set of all traces that originate from state x and end at the first observable event, while L σ (G, x) denotes those traces in L o (G, x) that end with the particular observable event σ. The generator G (see Sampath et al., 1995; Sampath, 1995) is the nondeterministic FSM, G = (X o, o,δ G, x 0 ), (10)

COORDINATED DECENTRALIZED PROTOCOLS 37 where X o, o, and x 0 are defined as previously, and the transition relation of G is given by δ G (X o X o ) and is defined as follows: (x,σ,x ) δ G if δ(x, s) = x for some s L σ (G, x). (11) It is easy to verify that L(G ) = P(L) where P(L) ={t: t = P(s) for some s L}. (12) 2.3. Definition of Diagnosability Loosely speaking, a language is said to be diagnosable with respect to a set of observable events and a failure partition if within a finite delay, the occurrence of any failure can be detected using the history of observable events. More rigorously, diagnosability is defined as follows (Sampath et al., 1995; Sampath, 1995): Definition 1. A prefix-closed and live language L is said to be diagnosable with respect to the projection P and with respect to the partition f on f if the following holds ( i f )( n i N)( s ( fi ))( t L/s)( t n i D) where the diagnosability condition D is ( w P 1 L (P(st))) ( fi w). (A language L is live if for all s L, there exists σ such that sσ L.) Note here that the above definition is only applicable to centralized systems, since it assumes the availability of all the system information at one (centralized) center or site: there is only one projection P that observes the behavior of the system, in addition to a single inverse projection P 1 L, and both are used to check the diagnosability condition D. 2.4. The Diagnoser The diagnoser is a FSM built from the system model G. This machine is used to perform diagnostic when it observes on-line the behavior of the system. We first define the set of failure labels f ={F 1, F 2,...,F m } where f =m, and the complete set of possible labels ={N} 2 f. (13) Here N is to be interpreted as meaning normal, while F i, i {1, 2,..., j} as meaning that a failure of type F i has occurred. Recall, from Equation 7, the definition of X o and define Q o = 2 X o. (14)

38 DEBOUK ET AL. The diagnoser for G is the FSM G d = (Q d, o,δ d, q 0 ) (15) where Q d, o,δ d, and q 0 have the usual interpretation of state space, event set, transition function, and initial state. The initial state of the diagnoser is defined to be {(x 0, {N})}. The transition function δ d of the diagnoser is constructed in a similar manner to the transition function of an observer of G (Hopcroft and Ullman, 1979), with an additional aspect that includes attaching failure labels to the states and propagating these labels from state to state. For more information about the construction of the diagnoser, the reader is referred to Sampath et al. (1995) and Sampath (1995). The state space Q d is the resulting subset of Q o composed of the states of the diagnoser that are reachable from q 0 under δ d. Since the state space Q d of the diagnoser is a subset of Q o, a state q d of G d is of the form q d = {(x 1, l 1 ),...,(x n, l n )}, where x i X o and l i. Next, we provide some definitions that are necessary in order to state the main diagnosability result for centralized systems in Section 2.5. For a detailed discussion and interpretation of this material the reader is referred to Sampath et al. (1995) and Sampath (1995). Definition 2. (Definition 6-1 in Sampath et al., 1995). A state q Q d is said to be F i -certain if (x, l) q, F i l. Definition 3. (Definition 6-3 in Sampath et al., 1995). A state q Q d is said to be F i - uncertain if (x, l), (y, l ) q, x not necessarily distinct from y, such that F i l and F i / l. Definition 4. (Definition 7 in Sampath et al., 1995). A set of states x 1, x 2,...,x n X is said to form a cycle in G if s L(G, x 1 ) such that s = σ 1 σ 2...σ n, and δ(x l,σ l ) = x (l+1) mod n, l = 1, 2,...,n. Definition 5. (Definition 8 in Sampath et al., 1995). A set of F i -uncertain states q 1, q 2,...,q n Q d is said to form an F i -indeterminate cycle if 1) q 1, q 2,...,q n form a cycle in G d with δ d (q l,σ l ) = q l+1, l = 1, 2,...,n 1, δ d (q n,σ n ) = q 1, σ l o, l = 1, 2,...,n. 2) (xl k, lk l ), (yr l, l l r) q l, xl k not necessarily distinct from yl r, l = 1, 2,...,n, k = 1, 2,...,m, and r = 1, 2,...,m such that a) F i ll k, F i l l r for all l, k, and r. b) The sequence of states {xl k}, l = 1, 2,...,n, k = 1, 2,...,m, and {yr l }, l = 1, 2,...,n, r = 1, 2,...,m form cycles in G with (xl k,σ l, xl+1 k ) δ G, l = 1, 2,...,n 1, k = 1, 2,...,m, (xn k,σ n, x k+1 1 ) δ G, k = 1, 2,...,m 1, and (xn m,σ n, x1 1) δ G.

COORDINATED DECENTRALIZED PROTOCOLS 39 and (yl r,σ l, yl+1 r ) δ G, l = 1, 2,...,n 1, r = 1, 2,...,m, (yn r,σ n, y r+1 1 ) δ G, r = 1, 2,...,m 1, and (yn m,σ n, y1 1) δ G. An F i -indeterminate cycle in G d indicates the presence in L of two traces s 1 and s 2 of arbitrarily long length, such that they both have the same observable projection and s 1 contains a failure event from the set fi while s 2 does not. Finally, the following lemma relates the properties of a diagnoser state to the properties of the traces in the language. LEMMA 1 (Lemma 2 in Sampath et al., 1995) i) Let δ d (q 0, u) = q. If q is F i -certain, then w P 1 L (u), fi w. ii) If a state q Q d is F i -uncertain, then this implies that s 1, s 2 L such that fi s 1, fi s 2,P(s 1 ) = P(s 2 ), and δ d [q 0, P(s 1 )] = q. We note here that all of the notation introduced in this subsection and the previous ones assumes that the set of observable events is o. Later on, we will be using subsets of o, namely o1 and o2 ; the above notation will still be applicable to the subsets of o, with the minor change, when necessary, of adding subscripts: a 1 subscript will be used in notation related to o1, while a 2 subscript will be used in notation related to o2. In this case, we define o to be o1 o2. 2.5. Necessary and Sufficient Conditions for Diagnosability It is intuitive, based on the definition of diagnosability, the properties of the diagnoser, and Definition 5, that in order for a language to be diagnosable, the diagnoser should not have any F i -indeterminate cycles for all failure types F i. This condition is stated formally as follows: THEOREM 1 (Theorem 2 in Sampath et al., 1995) A language L is diagnosable with respect to the projection P and the failure partition f on f if and only if its diagnoser G d satisfies the following condition: there are no F i -indeterminate cycles in G d for all failure types F i. 3. General Specification of the Problem 3.1. A Coordinated Decentralized Architecture In decentralized systems, the global system information is distributed at several sites. The agents at different sites may communicate and exchange information in real time, or just

40 DEBOUK ET AL. Figure 1. Coordinated decentralized architecture. report some or all of their information to a center that, in general, possesses limited knowledge about the system. For each distinct information flow we obtain a distinct decentralized architecture. In this paper, we restrict attention to a coordinated decentralized architecture with two local sites communicating with a coordinator. This architecture is depicted in Figure 1. In this section, we discuss this architecture. We present protocols that realize the architecture in Sections 4, 5, and 6. In Figure 1, the top block represents the system model, or G in the notation of Section 2.1. G models the synchronization of the interaction of all the components that constitute the system (see Sampath, 1995; Sampath et al., 1996). Each site is composed of two modules: an observation module and a diagnostic module. The site i, i {1, 2}, locally observes the system based on its available sensing capabilities. Therefore, a projection P i is associated with site i, where P i is defined on the set of observable events oi (note here that o1 and o2 need not be disjoint although sites 1 and 2 may be physically apart). The union of o1 and o2 is the set of observable events o. Site i locally processes its own observation and generates its diagnostic information. Both sites communicate some form of their diagnostic information to the coordinator. The type of information communicated is determined by the communication rules used by the sites. The task of the coordinator is to process, according to a prescribed decision rule, the messages received from both sites to infer occurrences of failures. If a failure is detected by the coordinator, it is broadcast to the failure recovery module.

COORDINATED DECENTRALIZED PROTOCOLS 41 We intend to investigate diagnosability properties of the above architecture under the following assumptions. A1 L(G) is live. A2 G has no cycles of unobservable events with respect to either o1 or o2. A3 L(G) is not diagnosable with respect to P i and f on f, i = 1, 2. A4 There is reliable communication between the local sites and the coordinator, i.e., all messages sent from a local site are received by the coordinator correctly and in order. A5 Messages communicated between the local sites and the coordinator are received in the order they are sent (globally). A6 The sets of observable events at each site are common knowledge (Aumann, 1976; Washburn and Teneketzis, 1984) to all sites. A7 The two sites are allowed to report to the coordinator only some processed version of their raw data. A8 The coordinator does not have a model of the system, that is, it does not know the dynamics of the system. It has a simple structure; specifically, it has limited memory and limited processing capabilities. Assumption A1 ensures that there are no deadlocks. This assumption can be relaxed easily as discussed in Sampath (1995) and Sampath et al. (1998). Assumption A2 ensures that observations occur with some regularity with respect to both P 1 and P 2 : since detection of failures is based on observable transitions of the system, we require that G does not generate arbitrarily long sequences of unobservable events with respect to either P 1 or P 2. Assumption A3 eliminates the trivial case where even though the observable events are partitioned, the system is still diagnosable (in a centralized setup) with respect to one of the projections and the failure partition. In such a case the decentralized architecture is necessarily diagnosable! Assumption A5 ensures that the global order of all messages received by the coordinator is preserved. Assumptions A4, A6, and A7 are self explanatory. Finally, Assumption A8 is consistent with features of hierarchical organizations. Assumptions A1 A8 will be used, even if not explicitly stated, in the derivation of all the results of this paper, unless otherwise specified. 3.2. Definition of Diagnosability As noted in Section 2.3, the definition of diagnosability in Sampath et al. (1995) (Definition 1 in this paper) assumes centralization of the available information; hence it is not directly applicable to coordinated decentralized systems. Moreover, the coordinated decentralized architecture in Figure 1 represents a class of realizations of the same architecture where the choice of local diagnostic rules, communication rules, and decision rules, defines one realization. Therefore, to define diagnosability for coordinated decentralized systems, we

42 DEBOUK ET AL. need to account for the rules used to generate local diagnostic information together with the associated communication rules and the coordinator s decision rule for failure diagnosis. In the proposed coordinated architecture the local agents do not interact with one another; they only communicate with the coordinator that is assigned the task of detecting and isolating failures. Let C denote the coordinator s diagnostic information. For each sample path of the DES, C is represented by an information set that is protocol-dependent. For instance in Protocol 3 (cf. Section 6.2.3), C is described by a set of failure labels; in Protocol 2 (cf. Section 5.2.3), C is described by a diagnoser state. The description of C in the case of Protocol 1 is more complex and is presented in Section 4.1.3. Definition 6. The coordinator s diagnostic information C is said to be F i -certain if based on C, the coordinator is certain that a failure of type F i has occurred. We mentioned earlier that a protocol realizes one instance of the coordinated decentralized architecture of Figure 1. We formalize this notion of protocol as follows: Definition 7. Within the context of the coordinated decentralized architecture described in Section 3.1 and depicted in Figure 1, a protocol is defined by the diagnostic information generated at the local sites, the rules used by the local sites to communicate to the coordinator, and the decision rule used at the coordinator site. Using Definitions 6 and 7 we can define diagnosability under a given protocol. Definition 8. A prefix-closed and live language L is said to be diagnosable under a protocol, a set of projections P 1, P 2 and a failure partition f on f if the following holds ( i f )( n i N)( s ( fi ))( t L/s)( t n i C is F i -certain). Thus diagnosability, as defined above, requires that the detection of any failure should be achieved by the coordinator within a finite delay of the occurrence of that failure. 3.3. Objective Any realization of the coordinated decentralized architecture of Section 3.1 cannot outperform the centralized one. Hence, a desirable objective in realizing such an architecture is to aim at diagnosing all failure types that can be diagnosed by the centralized diagnoser. Therefore, the design process should determine a failure diagnosis protocol that performs as well as the centralized diagnoser would. In case this is not feasible, conditions on the system structure may be found to guarantee that the protocol diagnoses all failure types that are diagnosed by the centralized diagnoser. Note here that according to Definition 8, the set of projections and the failure partition are given and fixed; more generally, they could be included in the protocol. The next three sections describe three protocols that achieve the above objective.

COORDINATED DECENTRALIZED PROTOCOLS 43 4. A Coordinated Decentralized Protocol: Protocol 1 4.1. Specification of the Protocol In this section, we present a protocol for the preceding coordinated decentralized architecture that is capable of diagnosing the same types of failures as the ones diagnosed using a centralized diagnoser. The specification of the protocol is done under Assumptions A1 A8 of Section 3.1. Thereafter, we will refer to this protocol as Protocol 1. We begin by specifying the type of diagnostic information generated at local sites. 4.1.1. Diagnostic Information at Local Sites The diagnostic information at the local site is generated by the extended diagnoser defined below. The extended diagnoser for G was first introduced in Sampath (1993), and it is the FSM G e d = ( Q e d, o,δd e, ) qe 0 (16) where Q e d, o,δd e, and qe 0 have the usual interpretation of state space, event set, transition function, and initial state. The initial state of the extended diagnoser is defined to be {(x 0, {N}), (x 0, {N})}. A state q Q e d is of the form q ={((x 1, l 1 ), (x 1, l 1 )), ((x 2, l 2 ), (x 2, l 2 )), ((x 2, l 2 ), (x 2, l 2 )),...,((x n, l n ), (x n, l n ))} where each (x, l) pair is in Q o, i.e., x X o and l. A tuple of (x, l) pairs, say ((x 1, l 1 ), (x 1, l 1 )), has the following meaning: x 1 is a component of a system state estimate after the occurrence of an observable event and l 1 is its failure label, while x 1 is the immediate predecessor state of x 1 in G and l 1 is its corresponding failure label. The transition function δd e of the extended diagnoser is constructed in a manner similar to the transition function of the diagnoser G d, with the additional aspect that every state of G that appears in a state component of G d is associated with its immediate predecessor state in G (along the sub-trace of events under consideration) and both states carry their labels; these labels are attained following the same label propagation rules as in Sampath et al. (1995). The state space Q e d is the resulting subset of Q o Q o composed of the states of the extended diagnoser that are reachable from q0 e under δe d. By construction, L(Ge d ) = L(G d) = P(L). We illustrate the construction of extended diagnosers in the following example. Example 1. Consider the system shown in Figure 2 with ={a, b, c, d, e,σ}, uo ={σ }, f 1 ={σ }, o1 ={a, c, d, e}, and o2 ={b, d, e}. The extended diagnosers G e d1 and Ge d2 for this system are shown in Figure 3. Consider the state q ={(2N, 6N), (5N, 7N)} in G e d1 ; q is read as follows: the system is either in state 6 with a normal label, or it is in state 7, also with a normal label; state 6 has been reached (by an observable event, possibly preceded by unobservable events) from state 2, while state 7 has been reached (by an observable event, possibly preceded by unobservable events) from state 5. Now the next observable event is d: if the system is at state 6, then it transitions into state 8, and since there are no

44 DEBOUK ET AL. Figure 2. The system G for Example 1. Figure 3. The extended diagnosers G e d1 and Ge d2 for Example 1. failure events along the path from state 6 to state 8 the resulting component of the new state estimate is (6N, 8N); if the system is at state 7, it transitions into state 10 following the occurrence of the sequence σ d, i.e., a failure of type F 1 has occurred along the path, and

COORDINATED DECENTRALIZED PROTOCOLS 45 Figure 4. The diagnoser G d and extended diagnoser G e d for Example 1. the resulting other component of the new state estimate is (7N, 10F1). Therefore the state of G e d1 is {(6N, 8N), (7N, 10F1)} after the occurrence of the observable event d. All other extended diagnoser states are constructed by following a similar procedure. We define the state projection SP: Q o Q o Q o as follows: q ={((x 1, l 1 ), (x 1, l 1 )),...,((x n, l n ), (x n, l n ))} SP(q) ={(x 1, l 1 ),...,(x n, l n )}. (17) Then, with a slight abuse of notation, we have that SP(G e d ) = G d; hence, one diagnoser state may be associated with more than one extended diagnoser states. Therefore, an extended diagnoser state potentially carries more information than a diagnoser state. In the case of centralized systems, G d and G e d are equivalent from the point of view of diagnosability as defined in Definition 1; it is for that reason that prior work (Sampath et al., 1995; Sampath, 1995; Sampath et al., 1996; Sampath et al., 1998) only considered the simpler G d. Example 2. Again consider the system shown in Figure 2 with ={a, b, c, d, e,σ}, uo ={σ }, f 1 ={σ }. The diagnoser G d and extended diagnoser G e d for this system are shown in Figure 4. We can see that the transition structure of G e d refines that of G d. In particular, state 6N of G d is associated with states (4N, 6N) and (8N, 6N) in G e d since SP((4N, 6N)) = SP((8N, 6N)) = 6N.

46 DEBOUK ET AL. We define the unobservable reach of an extended diagnoser state as follows. Definition 9. Let q ={((x 1, l 1 ), (x 1, l 1 )),...,((x n, l n ), (x n, l n ))} be a state of the extended diagnoser G e dj, j {1, 2}. Define the set S j (q) = { s ( \ oj ) : s L σ (G, x k ) for some σ oi, i {1, 2}\{ j}, and some k {1,...,n} }. Then the unobservable reach of q with respect to \ oj is defined as follows: UR j (q) ={q} {((y s, l s ), (y s, l s ))} s S j (q) where (i) y s is the successor of some x k, k {1,...,n}, after sub-trace s S j(q), (ii) y s is the immediate predecessor along s of y s in G, and (iii) l s, l s are the failure labels corresponding to y s, y s obtained by propagating the label l k of x k according to the label propagation function defined in Sampath et al. (1995). The unobservable reach appends to the components of each state of the extended diagnoser G e dj some additional components (along with failure labels and predecessors) that may have been reached following an additional event or a sequence of events that are not observable by the local site j. Note here that in the above definition, y s may not be equal to x k. Also note that while we call UR j (q) the unobservable reach of q with respect to \ oj, its definition stipulates that the sub-traces that are used to generate it end with an event in oi, the other set of observable events. Example 3. Consider the system discussed in Example 1. The extended diagnosers G e d1 and G e d2 associated with the projections P 1 and P 2 are shown in Figure 3. Consider the state q ={(1N, 3N), (1N, 4N)} in G e d2. To compute the unobservable reach of q with respect to \ o2, we first find the set S 2 (q) ={a, c, ac}. The successors of state 3 after sub-traces a and ac are 5 and 7, respectively, while the successor of state 4 after sub-trace c is 6. Therefore, UR 2 (q) ={(1N, 3N), (1N, 4N), (3N, 5N), (5N, 7N), (4N, 6N)}. All state labels are N since there were no failure events along any sub-trace. Note here that although state 7 is a successor of state 3 along the sub-trace ac, the immediate predecessor of 7 in G (not pictured) is state 5, so the corresponding tuple (after adding the failure labels) is (5N, 7N). To provide the necessary and sufficient conditions of diagnosability in terms of G e d,we need the following definitions. Definition 10. A state q Q e d is said to be F i-certain if (x, l) SP(q), F i l. Definition 11. A state q Q e d is said to be F i-uncertain if (x, l), (y, l ) SP(q), x not necessarily distinct from y, such that F i l and F i l.

COORDINATED DECENTRALIZED PROTOCOLS 47 Definition 12 (Definition 1 in Sampath, 1993). A set of states q 1, q 2,...,q n Q e d is said to form a cycle in G e d if the following is true: δd e (q l,σ l ) = q (l+1), l = 1, 2,...,n 1, and δd e (q n,σ n ) = q 1 for some observable events σ i, i = 1,...,n. Definition 13 (Definition 2 in Sampath, 1993). A set of (x i, l i ) pairs, where (x i, l i ) Q o, i = 1, 2,...,n, is said to form a matched cycle in G e d if q i G e d, i = 1, 2,...,n, such that: ((x i, l i ), (x i+1, l i+1 )) q i+1, i = 1, 2,...,n 1, and ((x n, l n ), (x 1, l 1 )) q 1. Note that the existence of such a set of (x i, l i ) pairs has the two following implications (from the construction procedure of G e d ): 1. q i, i = 1, 2,...,n, form a cycle in G e d. 2. x i, i = 1, 2,...,n, form a cycle in G. Definition 14 (Definition 3 in Sampath, 1993). A set of states q 1, q 2,...,q n Q e d forming a cycle of F i -uncertain states in G e d is said to form an F i-indeterminate cycle in G e d if the following hold: 1. a set of (x j, l j ) SP(q j ), j = 1, 2,...,n, forming a matched cycle in G e d, with F i l j, j = 1, 2,...,n, and 2. a set of (y j, l j ) SP(q j), j = 1, 2,...,n, forming a matched cycle in G e d, with F i l j, j = 1, 2,...,n. Next we state a result that relates the existence of F i -indeterminate cycles in G e d to the existence of F i -indeterminate cycles in G d. PROPOSITION 1 Consider a system G, its diagnoser G d, and its extended diagnoser G e d. Then there are F i -indeterminate cycles in G d if and only if there are F i -indeterminate cycles in G e d. Proof: Sufficiency( ). G e d has F i-indeterminate cycles. Consider a set of states q k, k = 1,...,n, that form an F i -indeterminate cycle in G e d. We claim that the set of states {p 1,...,p m }=SP({q 1,...,q n }), m n, forms an F i -indeterminate cycle in G d. (Note here that m n since, as discussed earlier, one diagnoser state may be associated with more than one extended diagnoser states.) The claim can be established as follows: by assumption, there exist two sets of states of the form (x j, l j ), (y j, l j ) SP(q j), j = 1,...,m such that F i l j,butf i l j (cf. Definition 14). Hence the cycle of states {p 1,...,p m }

48 DEBOUK ET AL. in G d is an F i -uncertain cycle. Moreover, by the implications of Definition 13, the sets {x j } and {y j } form cycles in G. Therefore the resulting cycle in G d is F i -indeterminate by Definition 5. Necessity( ) G d has F i -indeterminate cycles. From Definition 5, there exist two traces s and s in L(G), such that P(s) = P(s ), F i s, F i s and s, s are arbitrarily long. Since L(G) is a regular language, then the fact that s, s are arbitrarily long implies that the system will loop in a cycle, say A (respectively B) if s (respectively s ) is executed. Corresponding to A (respectively B) there exists a cycle of pairs (x j, l j ) (respectively (y j, l j )) in Q o, j = 1,..., n. Moreover, since P(s) = P(s )(x j, l j ) and (y j, l j ), j = 1,...,n, belong to the same set of states {q 1,...,q n } in G e d ; hence they form matched cycles in Ge d. By the implications of Definition 13 the states {q 1,...,q n } in G e d form a cycle, and the fact that F i l j but F i l j implies that the cycle is F i-indeterminate. Based on Proposition 1 and Definition 1 we provide a test to check the diagnosability of a language in terms of the extended diagnoser G e d : THEOREM 2 1 A prefix-closed and live language L is diagnosable with respect to the projection P and the failure partition f on f if and only if its extended diagnoser G e d satisfies the following condition: there are no F i -indeterminate cycles in G e d for all failure types F i. Proof: The proof is a direct consequence of Definition 1 and Proposition 1. Having presented the type of diagnostic information generated at the local sites, along with some of its properties, we next define the communication rules used by the diagnosers. 4.1.2. Communication Rules To define the communication rules, we first note that right after the occurrence of an event that is observable only by one site, say i, the state of the extended diagnoser at site j i does not contain the true system state. Therefore, for the purpose of communicating information from a local site to the coordinator, we need to augment the state of the extended diagnoser with some additional information, the unobservable reach. We define the communication rules CR := (CR1, CR2) as follows: [CRi], i = 1, 2: After the agent at site i observes an event σ oi, it communicates to the coordinator the corresponding state q i of its extended diagnoser G e di, its unobservable reach UR i (q i ) with respect to \ oi, and a status bit, SB i, that takes the values SB i = 1 when σ oj, j {1, 2}, j i,orsb i = 0 when σ oj. 4.1.3. Decision Rule The decision rule of the coordinator consists of two components: (1) a rule according to which its information is updated; and (2) a rule according to which failure occurrences are declared and broadcast to the failure recovery module.

COORDINATED DECENTRALIZED PROTOCOLS 49 As stated earlier, the coordinator declares that a failure of type F i has occurred when its diagnostic information C is F i -certain (cf. Definition 6). To specify the information update rule we first need to define the following operators (Definitions 15 and 16). Definition 15. Let q 1 ={((x 1, l 1 ), (x 1, l 1 )),...,((x n, l n ), (x n, l n ))} and q 2 ={((y 1, l 1 ), (y 1, l 1 )),...,((y m, l m ), (y m, l m ))} belong to Q o Q o. We denote by i e, i {L, R} the intersection scheme that acts on q 1 and q 2, and we define it as follows: q 1 i e q 2 ={((z, l), (z, l )) Q o Q o : (z, l ) = (x i, l i ) = (y j, l j ) for some i, j, i {1, 2,...,n}, j {1, 2,...,m}, and(z, l) = (x i, l i ) if i = L, otherwise (z, l) = (y j, l j )}. This intersection scheme is a regular intersection of the components of the two system state estimates along with their failure labels. However, the intersection applies to the components corresponding to the current system state estimates and not to their immediate predecessors. The components of q 1 i e q 2 corresponding to the immediate predecessors are determined by operator i. The intersection scheme i e introduced by Definition 15 is illustrated by the following example: Example 4. Let q 1 = {(6N, 8N), (7N, 10F1)} and q 2 = {(3N, 11N), (3N, 10F1), (4N, 8N)}. To compute q 1 L e q 2 we find the common components in the two current system state estimates, namely 8N and 10F1, and we append the predecessors of 8N and 10F1inq 1 to the states to get q 1 L e q 2 ={(6N, 8N), (7N, 10F1)}. Similarly, q 1 R e q 2 = {(4N, 8N), (3N, 10F1)}. The second operator we introduce is another intersection scheme, and is defined as follows: Definition 16. Let q 1 ={((x 1, l 1 ), (x 1, l 1 )),...,((x n, l n ), (x n, l n ))} and q 0 ={((y 1, l 1 ), (y 1, l 1 )),...,((y m, l m ), (y m, l m ))} belong to Q o Q o. We denote by c the intersection scheme that acts on q 1 and q 0, and we define it as follows: q 1 c q 0 ={((z, l), (z, l )) Q o Q o : (z, l) = (x i, l i ) = (y j, l j ), for some i, j, i {1, 2,...,n}, j {1, 2,...,m}, and (z, l ) = (x i, l i )}. The intersection scheme c is illustrated by the following example: Example 5. Let q 1 = {(6N, 8N), (7N, 10F1)} and q 0 = {(4N, 6N)}. q 1 c q 0 = {(6N, 8N)} since the component 10F1 ofq 1 was reached from the component 7N which is not present in q 0. In addition to the above operators, we need to describe the structure of the coordinator before we precisely specify its information update rule. In addition to the register C where the coordinator stores its current diagnostic information, eight supplementary registers are

50 DEBOUK ET AL. Table 1. Information update rule at the coordinator site (Protocol 1). Last report received from G e d1 SB SB 1 C New SB New SB 1old New SB 2old DR1 0 0 (R 1 i e R 4) c C old 0 1 0 DR2 0 1 Wait 1 Unmodified Unmodified 1 0 Impossible DR3 1 1 (R 1 i e R 2) c C old 0 1 1 Last report received from G e d2 SB SB 2 C New SB New SB 1old New SB 2old DR4 0 0 (R 2 i e R 3) c C old 0 0 1 DR5 0 1 Wait 1 Unmodified Unmodified 1 0 Impossible DR6 1 1 (R 1 i e R 2) c C old 0 1 1 the i superscript in i e depends on the current values of the flip-flops SB 1old and SB 2old, not shown in this table used for storing messages and previous relevant values necessary for the update of its information. These registers are: R1, R2, R3, R4, C old, SB, SB 1old, and SB 2old. R1 and R2 hold the latest states of G e d1 and Ge d2, respectively, R3 and R4 hold the latest unobservable reaches of G e d1 and Ge d2, respectively, C old holds the previous coordinator diagnostic information, SB specifies whether the last observed event is observed by both sites (1) or not (0) and SB 1old, SB 2old provide necessary information to compute the new coordinator diagnostic information. The information update rule is given in Table 1. The rule picks one of the actions DR1 DR6 depending on the available information, i.e., which site observed the last and previous to the last events, and who sent the last message to the coordinator. The rationale behind the actions DR1 to DR6 can be summarized as follows. Once a message from one diagnoser, say G e d1, reaches the coordinator after the occurrence of an observable event, the state of that diagnoser should contain the true system state. Moreover, if the message says that the event is not observed by the other site (site 2), the current unobservable reach of the diagnoser G e d2 also contains the true system state. Consequently, the logical action is to intersect the state of G e d1 with the unobservable reach of Ge d2 using the intersection scheme i e (the bits SB 1old and SB 2old specify the value of i in i e :ifsb 1old = 1, then i = L, that is you append the predecessors from the state of G e d1 ; otherwise i = R), and then intersect the result with the old coordinator diagnostic information, using the intersection scheme c, to generate the new coordinator diagnostic information. The last intersection is needed to eliminate the possibility of including any illegal behavior in the coordinator diagnostic information. In case the event is also observed by site 2, the state of G e d2 contains the true system state. Therefore, the logical action in this case is to intersect the states of the diagnosers G e d1 and Ge d2 by applying the i e intersection, and then refine the result by applying the intersection scheme c as discussed earlier. Note here that before performing any update of the coordinator diagnostic information, the current coordinator

COORDINATED DECENTRALIZED PROTOCOLS 51 diagnostic information is saved into the register C old for later use. Also, the flip-flops are modified once the update of the coordinator diagnostic information is completed. At reset, R1 and R2 are initialized with the initial states of G e d1 and Ge d2, respectively, and R3 and R4 hold the initial unobservable reaches of G e d1 and Ge d2, respectively. Note that the coordinator is not aware of the rationale described above when it updates its diagnostic information and when it declares that a failure of a certain type has occurred. The coordinator simply executes the operations i e, c, updates all of its registers, and declares the occurrence of failures according to the decision rule described above. In summary, the registers of the coordinator are updated according to the information update rule presented in Table 1. Once C, the coordinator s diagnostic information is F i - certain, the coordinator broadcasts to the failure recovery module that a failure of type F i has occurred. 4.2. Diagnostic Properties of Protocol 1 The diagnostic properties of Protocol 1 are summarized by Theorem 3, the proof of which is based on the following proposition. PROPOSITION 2 Let q 1,q 2, and q be the states of the extended diagnosers G e d1,ge d2, and Ge d, respectively, after the system executed the trace s = s 1 aub, where a, b o (= o1 o2 ), u uo. Denote by q old the state of G e d after the execution of s 1a. Then the following is true: 1. if b o1 o2 then (i) (ii) q = (q 1 L e q 2) c q old, if a o1 q = (q 1 R e q 2) c q old, otherwise 2. if b o1 \ o2 then (i) (ii) q = (q 1 L e UR 2(q 2 )) c q old, if a o1 q = (q 1 R e UR 2(q 2 )) c q old, otherwise 3. if b o2 \ o1 then (i) (ii) q = (UR 1 (q 1 ) L e q 2) c q old, if a o1 q = (UR 1 (q 1 ) R e q 2) c q old, otherwise. Proof: Proof of 1. We first note that SP(q) SP(q i ) SP(UR i (q i )), i ={1, 2}. (18)

52 DEBOUK ET AL. The first inclusion is true since the set of observable events oi is a subset of the original set of observable events o and b o1 o2, and the second inclusion is true by definition (cf. Definition 9). In case (i) we have (q 1 L e q 2) c q old = (q 1 c q old ) L e q 2 (19) (q 1 c q old ) = q. (20) (19) follows from the definition of the intersection operators c and e L. (20) is obtained as follows: by definition, q 1 c q old gives all state estimate tuples in q 1 that are reached by the observable event b. Since b is the next observable event after a in s and SP(q) SP(q 1 ) by (18), q 1 c q old is the state of the diagnoser G e d which is q by definition. Combining (19) and SP(q) SP(q 2 ) from (18) we obtain q = (q 1 e L q 2) c q old. To prove case (ii), we note that by Definition 15 we can write q 1 e L q 2 = q 2 e R q 1. So by exchanging the roles of q 1 and q 2, and using the same arguments as in case (i) we have q = (q 1 e R q 2) c q old. Proof of 2. We note first that SP(q) SP(q 1 ) and SP(q) SP(UR 2 (q 2 )). (21) The inclusions are true since the set of observable events oi is a subset of the original set of observable events o and b o1 \ o2. The proof of case (i) proceeds in the same way as the proof of 1 (i) with the minor modification of using UR 2 (q 2 ) instead of q 2. To prove (ii) we have (q 1 R e UR 2(q 2 )) c q old = q 1 R e (UR 2(q 2 ) c q old ) (22) (UR 2 (q 2 ) c q old ) = q. (23) (22) follows from the definition of the intersection operators c and e R. (23) is obtained as follows: by definition, UR 2 (q 2 ) c q old gives all state estimate tuples in UR 2 (q 2 ) that are reached by the observable event b. This is true by Definition 9: UR 2 (q 2 ) may include state estimate tuples {(x, l), (x, l )} whose current state x may be reached by a sequence of observable events and not only by one observable event, like in the case of the event b; however in such a case the predecessor state x is by definition the immediate predecessor of x in G, and this predecessor does not belong to any SP(x i ), where x i q old. Since b is the next observable event after a in s and SP(q) SP(UR 2 (q 2 )) by (21), UR 2 (q 2 ) c q old is the state of the diagnoser G e d which is q by definition. Combining (22) and SP(q) SP(q 1) from (21) we obtain q = (q 1 e R UR 2(q 2 )) c q old. Proof of 3. Exchange the roles of q 1 and UR 1 (q 1 ) with q 2 and UR 2 (q 2 ), respectively, and proceed as in the proof of 2. Proposition 2 can be used to prove the main result concerning the diagnostic properties of Protocol 1.

COORDINATED DECENTRALIZED PROTOCOLS 53 THEOREM 3 (i) The coordinator s diagnostic information C under Protocol 1 is the same as the state of the centralized extended diagnoser G e d. (ii) Protocol 1 achieves the same diagnostic performance as a centralized diagnoser. Proof: (i) We prove part (i) by induction on the number of observable events (in o = o1 o2 ) in the trace s. Basis of induction: Let P(s) = b =1. In this case C old ={(x 0, N), (x 0, N)} by assumption, where x o is the initial state of the system. Moreover, by assumption both G e d1 and G e d2 have the same initial state {(x 0, N), (x 0, N)}. Ifb o1 o2 then q 1 = q 2 = q by the construction of the diagnosers. Therefore q 1 i e q 2 = q and q c C old = q by definition. If b o1 \ o2 then q 1 = q by construction and q UR 2 (q 2 ) as discussed earlier in the proof of Proposition 2. Therefore, q 1 i e UR 2(q 2 ) = q, and q c C old = q by definition. The proof of the case when b o2 \ o1 is symmetric to the case where b o1 \ o2. Induction step: The proof of the induction step is provided by Proposition 2 since by Assumptions A4 and A5 every message is received in the order it was sent. (ii) From part (i) and the specification of the coordinator s decision rule it follows that Protocol 1 achieves the same diagnostic performance as a centralized diagnoser. Note that, according to Assumption A8, the coordinator has no knowledge of the system model, and has limited memory and limited processing capabilities. Yet, if the coordinator has the memory and processing capabilities required by the decision rule described in Section 4.1.3, it can diagnose the same types of failures as a centralized diagnoser; by receiving the extended diagnoser states (and unobservable reaches) and using the rules i e, i = L, R and c the coordinator, in essence, can keep track of the state of the system in the same way as the centralized diagnoser. Consequently, it has the same diagnostic properties as the centralized diagnoser. 4.3. Necessary and Sufficient Conditions for Diagnosability In Section 4.2, we showed that the information update rule that is used at the coordinator site is reconstructing the centralized diagnoser state. Consequently the necessary and sufficient conditions for diagnosability with respect to Protocol 1 can be stated with respect to the centralized diagnoser as follows: THEOREM 4 A live and prefix-closed language L is diagnosable with respect to Protocol 1, the set of projections P 1,P 2 and the failure partition f on f if and only if the diagnoser G d does not have F i -indeterminate cycles for all failure types F i. Proof: Sufficiency( ). Suppose G d does not have F i -indeterminate cycles. Then by Proposition 1, G e d does not have F i-indeterminate cycles. Consider a trace st L(G) such that s ( fi ), and t is long enough, i.e., t > n, where n can be arbitrarily large. Then, by assumption, st, t t does not lead to an F i -indeterminate cycle in G e d. Consequently, an argument similar to the one used in the proof of Theorem 2 in Sampath et al. (1995)